PDF image conversion

Someone silly sent me a metric fuckton of images embedded in PDFs. They wanted this stuff on the intertubez and they wanted it now. I’m lazy, so I automated the process. Figured I would write it down so I can remember it.

Imagemagick’s convert tool was my first weapon of choice, but this resulted in really poor output. So I decided to combine it with the pdfimages tool. The following snippet assumes the PDF contains only one image of interest.

pdfimages input.pdf tmp
convert tmp-000.ppm output.jpg

Wrap a loop around it and you’re done. It’s almost too simple to be true.

“Failed to start domain – Host CPU does not provide required features: spec-ctrl”

[root@foo ~]# virsh start bar
error: Failed to start domain bar
error: the CPU is incompatible with host CPU: Host CPU does not provide required features: spec-ctrl

After a recent CentOS update and reboot, certain VMs refused to start, bailing out with the error message above. The interwebz didn’t really offer much in terms of advice. After talking to people with more clue, a working theory was formed: the version combination of kernel/libvirt/kvm/qemu is messed up. Rolling back to an older version was not an option.

The root cause is Intel’s Spectre vulnerability and its mitigation. The fix proved to be surprisingly simple. Simply edit the VM definition (“virsh edit foo”) and remove “-IBRS” from the CPU definition. Bear in mind that this does disable the Indirect Branch Restricted Speculation mitigation, so consider this a security disclaimer.

To recap:

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Haswell-noTSX-IBRS</model>
  </cpu>

becomes

  <cpu mode='custom' match='exact' check='partial'>
    <model fallback='allow'>Haswell-noTSX</model>
  </cpu>

I’m not sure how this CPU model definition came to be. Did updating libvirt update the definition? Was the definition automatically detected when the VM was created, and did it stop working after an upgrade?

On pointless Java 8 embellishments, or an exercise in simplicity

While performing a code review, I stumbled upon this little gem:

Long num = someFunctionThatNeverReturnsNull();
if(Optional.ofNullable(num).orElse(0l) > 0) {
  // ...
}

The salient part, of course, is the if-statement. It’s rare to come across a single line with so many layers of wrongness.

  1. First, let’s talk about 0l. Depending on your font, that might look like zero one, zero el, o one or o el. By convention in Java, this should be zero EL: 0L. This is easier to read, and the suffix L makes it clear that we’re dealing with a long instead of an integer. That would look something like this:
    if(Optional.ofNullable(num).orElse(0L) > 0) {
      // ...
    }
  2. Second, what’s going on with these data types? 0L is now obviously a primitive long which will be auto-boxed to a Long object. num is a Long object. And then there’s the dangling 0. Which is a primitive integer (int). For one reason or another. Now, I admit that casting to long or int is pretty cheap. This is never going to be a performance issue. But consistency is a good thing. You probably want to compare like with like.
    There’s also a bit of weird auto-(un)boxing going on here. 0L will be auto-boxed to Long. But the result of the orElse() bit will be unboxed to a primitive long. The comparison will then be comparing a long to an int, which causes the int to be widened to a long.
    There isn’t much we can do about the auto-(un)boxing in this case, considering the comparison we want to perform. But we can at least ensure we’re using consistent data type width. So this:

    if(Optional.ofNullable(num).orElse(0L) > 0L) {
      // ...
    }
  3. And lastly, why the fuck are they using Optional in the first place? Never mind the fact that num can’t even be null here. The “plain”, non-embellished form of this statement would have been shorter and easier to read. That would look something like this:
  4. if(null != num && num > 0L) {
      // ...
    }

Here is the “corrected” version:

long num = someFunctionThatNeverReturnsNull();
if(num > 0L) {
  // ...
}

I can’t fathom why anyone would write garbage like this. It’s pointless. It’s hard to read. It’s ugly. And it’s bloody inefficient.

On brute force stupidity

Everyone who’s ever managed any internet-facing server is aware of the ridiculous amount of brute force SSH login attempts by all kinds of botnets. Some folks decide to move their SSHD to a non-standard port, some rely on complicated shenanigans like port knocking, and some use tools like fail2ban. I’m unfortunate enough to manage a little over a dozen servers, so I decided to have some fun with fail2ban.

[sshd]
enabled = true
banaction = %(banaction_allports)s
 
[recidive]
enabled = true

My configuration is pretty straightforward. You fuck up, you get banned. You fuck up repeatedly, you get banned for a longer time. Nothing special there. Given that I’m running a similar config on many boxes, I decided to compile some data relating to the origins of login attempts. This data was collected over a period of ~2 months on ~12 servers.

Here’s a quick plot of the number of times a certain IP address was banned. Only the top 100 abusers are included, because the chart has a very long tail indeed. I removed the IP addresses from the X-axis because there’s no way to include them without turning into a black blob.

It should be immediately obvious that a relatively small number of IP addresses is responsible for a metric fuckton of unwelcome activity. Remember that this represents the number of times an IP was banned. Left unchecked, the number of attempts increases by orders of magnitude.

The top offender (and the only one whose full IP address I’ll publish) is 116.31.116.38. It’s part of a Chinese subnet. It managed to get banned a staggering 4466 number of times. More than the next 5 abusers combined.

As the following chart illustrates, a whopping 76% of these IP addresses belong to Chinese subnets.

I daresay the internet would be a slightly better place if those 100 machines were permanently disconnected. It’s likely they’re just unsuspecting folks with compromised machines. But I for one am permanently firewalling all of them on any box I have access to.

Yet Another Battery Widget (Awesome 3.5.1)

Yet another battery widget for Awesome. This one actually works (shock! horror!) on Awesome 3.5.1 on my Thinkpad x230. Your mileage may vary. Colours used are from the excellent Solarized colour scheme. Behold the mighty widget, in all its unobtrusive glory!

battery

The implementation is in two parts: a simple shell script to output the battery status, and a bit of rc.lua tweaks to display the widget. This is mostly the result of a bit of copy/pasting from different sources I forgot to bookmark. Oh well.

~/bin/battery.sh:

#!/bin/bash
 
healthy='#859900'
low='#b58900'
discharge='#dc322f'
 
capacity=`cat /sys/class/power_supply/BAT0/capacity`
if (($capacity <= 25));
then
        capacityColour=$low
else
        capacityColour=$healthy
fi
 
status=`cat /sys/class/power_supply/BAT0/status`
 
if [[ "$status" = "Discharging" ]]
then
        statusColour=$discharge
        status="▼"
else
        statusColour=$healthy
        status="▲"
fi
 
echo "<span color=\"$capacityColour\">$capacity%</span> <span color=\"$statusColour\">$status</span>"

Add the following snippets to /path/to/awesome/rc.lua. I’ll attempt to indicate the approximate location at the top of each snippet.

Create the widget..and don’t forget to adjust the path to the battery.sh script.

-- This goes below the line containing mytextclock = awful.widget.textclock()
 
-- Create a battery widget
battery = wibox.widget.textbox()
function getBatteryStatus()
   local fd= io.popen("/path/to/battery.sh")
   local status = fd:read()
   fd:close()
   return status
end

Add the widget..

-- This goes above the line containing right_layout:add(mytextclock)
    right_layout:add(battery)

Get the widget to refresh every 30 seconds. Put this somewhere near the end of the config file.

-- Battery status timer
batteryTimer = timer({timeout = 30})
batteryTimer:connect_signal("timeout", function()
  battery:set_markup(getBatteryStatus())
end)
batteryTimer:start()
battery:set_markup(getBatteryStatus())

That’s all! Restart awesome and you’ll see a relatively purdy yet unobstrusive battery status display.

Java Date Performance Subtleties

A recent profling session pointed out that some of our processing threads were blocking on java.util.Date construction. This is troubling, because it’s something we do many thousands of times per second, and blocked threads are pretty bad!

A bit of digging led me to TimeZone.getDefault(). This, for some insanely fucked up reason, makes a synchronized call to TimeZone.getDefaultInAppContext(). The call is synchronized because it attempts to load the default time zone from the sun.awt.AppContext. What. The. Fuck. I don’t know what people were smoking when they wrote this, but I hope they enjoyed it …

Unfortunately, Date doesn’t have a constructor which takes a TimeZone argument, so it always calls getDefault() instead.

I decided to run some microbenchmarks. I benchmarked four different ways of creating Dates:

// date-short:
    new Date();
//date-long: 
    new Date(year, month, date, hrs, min, sec);
// calendar:
    Calendar cal = Calendar.getInstance(TimeZone);
    cal.set(year, month, date, hourOfDay, minute, second)
    cal.getTime();
// cached-cleared-calendar:
//    Same as calendar, but with Calendar.getInstance() outside of the loop, 
//    and a cal.clear() call in the loop.

I tested single threaded performance, where 1M Dates were created using each method in a single thread. Then multi-threaded with 4 threads, each thread creating 250k Dates. In other words: both methods ended up creating the same number of Dates.

Lower is beter.
Click to enlarge. Lower is beter.

With exception of date-long, all methods speed up by a factor of 2 when multi-threaded. (The machine only has 2 physical cores). The date-long method actually slows down when multi-threaded. This is because of lock contention in the synchronized TimeZone acquisition.

The JavaDoc for Date suggests replacing the date-long call by a calendar call. Performance-wise, this is not a very good suggestion: its single-threaded performance is twice as bad as that of Date unless you reuse the same Calendar instance. Even multi-threaded it’s outperformed by date-long. This is simply not acceptable.

Fortunately, the cached-cleared-calendar option performs very well. You could easily store a ThreadLocal reference to an instance of a Calendar and clear it whenever you need to use it.

More important than the raw duration of the Date creation, is the synchronization overhead. Every time a thread has to wait to enter a synchronized block, it could end up being rescheduled or swapped out. This reduces the predictability of performance. Keeping synchronization down to a minimum (or zero, in this case) increases predictability and liveness of the application in general.

Before anyone mentions it: yes, I’m aware that the long Date constructors are deprecated. Unfortunately, they are what Joda uses when converting to Java Dates. I’ve proposed a patch, but while doing a bit more research for this blog post, I’ve come to the conclusion that my patch needs a bit of refining as it is still too slow (though it no longer blocks). In the mean while, I hope that the -kind?- folks at Oracle will reconsider their shoddy implementation.

I’ve also heard rumours that Joda will somehow, magically, replace java.util.Date in JDK 8. Not sure how that’s going to work with backwards compatibility. I’d be much happier if java.util.Date would stop sucking quite as much. And if SimpleDateFormat were made thread-safe. And … the list goes on.

Character sets, time zones and hashes

Character sets, time zones and password hashes are pretty much the bane of my life. Whenever something breaks in a particularly spectacular fashion, you can be sure that one of those three is, in some way, responsible. Apparently the average software developer Just Doesn’t Get It™. Granted, they are pretty complex topics. I’m not expecting anyone to care about the difference between ISO-8859-15 and ISO-8859-1, know about UTC‘s subtleties or be able to implement SHA-1 using a ball of twine.

What I do expect, is for sensible folk to follow these very simple guidelines. They will make your (and everyone else’s) life substantially easier.

Use UTF-8..

Always. No exceptions. Configure your text editors to default to UTF-8. Make sure everyone on your team does the same. And while you’re at it, configure the editor to use UNIX-style line-endings (newline, without useless carriage returns).

..or not

Make sure you document the cases where you can’t use UTF-8. Write down and remember which encoding you are using, and why. Remember that iconv is your friend.

Store dates with time zone information

Always. No exceptions. A date/time is entirely meaningless unless you know which time zone it’s in. Store the time zone. If you’re using some kind of retarded age-old RDBMS which doesn’t support date/time fields with TZ data, then you can either store your dates as a string, or store the TZ in an extra column. I repeat: a date is meaningless without a time zone.

While I’m on the subject: store dates in a format described by ISO 8601, ending with a Z to designate UTC (Zulu). No fancy pansy nonsense with the first 3 letters of the English name of the month. All you need is ISO 8601.

Bonus tip: always store dates in UTC. Make the conversion to the user time zone only when presenting times to a user.

Don’t rely on platform defaults

You want your code to be cross-platform, right? So don’t rely on platform defaults. Be explicit about which time zone/encoding/language/.. you’re using or expecting.

Use bcrypt

Don’t try to roll your own password hashing mechanism. It’ll suck and it’ll be broken beyond repair. Instead, use bcrypt or PBKDF2. They’re designed to be slow, which will make brute-force attacks less likely to be successful. Implementations are available for most sensible programming environments.

If you have some kind of roll-your-own fetish, then at least use an HMAC.

Problem be gone

Keeping these simple guidelines in mind will prevent entire ranges of bugs from being introduced into your code base. Total cost of implementation: zilch. Benefit: fewer headdesk incidents.

Repeat after me: MySQL is not a filesystem

I came across this gem on DZone this morning. It’s a tutorial on storing images in a MySQL database (using PHP). There are several things in the tutorial that I don’t agree with, but I’ll let those slide. What really bugs me, is how it fails to mention that this is a very bad idea.

A relational database is not a filesystem. Files go on a filesystem. Relational data goes in an RDBMS. Repeat that a couple of times.

The most compelling argument for this, is performance. I did a quick test. I did a google image search on stupidity and downloaded the first 10 images. I then wrote PHP scripts to serve them up in two ways:

1. From a MySQL (MyISAM) table with 2 columns: ID (int, auto_increment) and DATA (mediumblob)
2. Using readfile.

The third test method, “FS”, simply loads the image over HTTP directly, without any intermediary scripts.

The results are the average of running Apache Benchmark 10 times: 10 concurrent requests, 1000 requests per run.

images

As you can see, the MySQL approach is a hell of a lot slower than the more sensible FS approach.

The best way to store your images (or other binary files) is on the filesystem. Every modern web server does a good (or excellent) job of serving up static content. Storing them in a database is by far the worst possible solution. Not only because it’s slow, but also because it complicates database backups: MySQL dumps with binary data don’t compress very well, causing the whole database backup to be slower and larger than needs be.

So please, be sensible. Store your files on a filesystem.

Java 7 Performance

I decided to compare Java 6 & 7 performance for $employer’s $application. Java 7 performs better — as expected. What I did not expect, was that the difference would be so big. Around 10% on average. That’s not bad for something as simple as a version bump.

Jave 6 vs Java 7

Ideally I’d like to investigate where this difference comes from. I suspect improved ergonomics have a lot to do with it.

$application uses Apache Solr rather extensively. In fact, most of the time querying is spent in Solr. With indexing it’s probably about 50% of the time. With querying it’s probably closer to 90%. All tests are run in a controlled environment, so I have a fair amount of confidence in these results.

The indexing test inserts 3 million documents in Solr. Creating these documents takes up the bulk of the time. It involves a lot of filesystem access — something which Java versions have very little influence over and heavily multi-threaded CPU-intensive processing.

If you’re not using Java 7, you really should consider upgrading. If you’re stuck with people who live in the past, maybe you can convince them with a bunch of pretty performance graphs of your own.