Endless Locale Bugs

This isn’t the first time I’ve ranted about Locale related issues. It probaly won’t be the last, either.

Currently my biggest gripe is with convenience methods in Java which rely on platform defaults, and are thus platform-specific. For instance, the venerable System.out.println() will terminate your line with whatever your platform thinks a line ending should be. Printing dates or numbers will try to use your platform Locale. Writing a string to a file will default to platform encoding.

Some of these defaults can be controlled at run time, others require JVM startup options. This is all horribly wrong. This results in all kinds of unexpected behaviour. It’s error-prone. None of this should be allowed to happen. Any method that assumes platform-specific magic should be deprecated. Which is exactly what I’ll do, as soon as I can figure out how to write a Sonar plugin to detect all this nonsense.

Java Date Performance Subtleties

A recent profling session pointed out that some of our processing threads were blocking on java.util.Date construction. This is troubling, because it’s something we do many thousands of times per second, and blocked threads are pretty bad!

A bit of digging led me to TimeZone.getDefault(). This, for some insanely fucked up reason, makes a synchronized call to TimeZone.getDefaultInAppContext(). The call is synchronized because it attempts to load the default time zone from the sun.awt.AppContext. What. The. Fuck. I don’t know what people were smoking when they wrote this, but I hope they enjoyed it …

Unfortunately, Date doesn’t have a constructor which takes a TimeZone argument, so it always calls getDefault() instead.

I decided to run some microbenchmarks. I benchmarked four different ways of creating Dates:

// date-short:
    new Date();
    new Date(year, month, date, hrs, min, sec);
// calendar:
    Calendar cal = Calendar.getInstance(TimeZone);
    cal.set(year, month, date, hourOfDay, minute, second)
// cached-cleared-calendar:
//    Same as calendar, but with Calendar.getInstance() outside of the loop, 
//    and a cal.clear() call in the loop.

I tested single threaded performance, where 1M Dates were created using each method in a single thread. Then multi-threaded with 4 threads, each thread creating 250k Dates. In other words: both methods ended up creating the same number of Dates.

Lower is beter.
Click to enlarge. Lower is beter.

With exception of date-long, all methods speed up by a factor of 2 when multi-threaded. (The machine only has 2 physical cores). The date-long method actually slows down when multi-threaded. This is because of lock contention in the synchronized TimeZone acquisition.

The JavaDoc for Date suggests replacing the date-long call by a calendar call. Performance-wise, this is not a very good suggestion: its single-threaded performance is twice as bad as that of Date unless you reuse the same Calendar instance. Even multi-threaded it’s outperformed by date-long. This is simply not acceptable.

Fortunately, the cached-cleared-calendar option performs very well. You could easily store a ThreadLocal reference to an instance of a Calendar and clear it whenever you need to use it.

More important than the raw duration of the Date creation, is the synchronization overhead. Every time a thread has to wait to enter a synchronized block, it could end up being rescheduled or swapped out. This reduces the predictability of performance. Keeping synchronization down to a minimum (or zero, in this case) increases predictability and liveness of the application in general.

Before anyone mentions it: yes, I’m aware that the long Date constructors are deprecated. Unfortunately, they are what Joda uses when converting to Java Dates. I’ve proposed a patch, but while doing a bit more research for this blog post, I’ve come to the conclusion that my patch needs a bit of refining as it is still too slow (though it no longer blocks). In the mean while, I hope that the -kind?- folks at Oracle will reconsider their shoddy implementation.

I’ve also heard rumours that Joda will somehow, magically, replace java.util.Date in JDK 8. Not sure how that’s going to work with backwards compatibility. I’d be much happier if java.util.Date would stop sucking quite as much. And if SimpleDateFormat were made thread-safe. And … the list goes on.

Character sets, time zones and hashes

Character sets, time zones and password hashes are pretty much the bane of my life. Whenever something breaks in a particularly spectacular fashion, you can be sure that one of those three is, in some way, responsible. Apparently the average software developer Just Doesn’t Get It™. Granted, they are pretty complex topics. I’m not expecting anyone to care about the difference between ISO-8859-15 and ISO-8859-1, know about UTC‘s subtleties or be able to implement SHA-1 using a ball of twine.

What I do expect, is for sensible folk to follow these very simple guidelines. They will make your (and everyone else’s) life substantially easier.

Use UTF-8..

Always. No exceptions. Configure your text editors to default to UTF-8. Make sure everyone on your team does the same. And while you’re at it, configure the editor to use UNIX-style line-endings (newline, without useless carriage returns).

..or not

Make sure you document the cases where you can’t use UTF-8. Write down and remember which encoding you are using, and why. Remember that iconv is your friend.

Store dates with time zone information

Always. No exceptions. A date/time is entirely meaningless unless you know which time zone it’s in. Store the time zone. If you’re using some kind of retarded age-old RDBMS which doesn’t support date/time fields with TZ data, then you can either store your dates as a string, or store the TZ in an extra column. I repeat: a date is meaningless without a time zone.

While I’m on the subject: store dates in a format described by ISO 8601, ending with a Z to designate UTC (Zulu). No fancy pansy nonsense with the first 3 letters of the English name of the month. All you need is ISO 8601.

Bonus tip: always store dates in UTC. Make the conversion to the user time zone only when presenting times to a user.

Don’t rely on platform defaults

You want your code to be cross-platform, right? So don’t rely on platform defaults. Be explicit about which time zone/encoding/language/.. you’re using or expecting.

Use bcrypt

Don’t try to roll your own password hashing mechanism. It’ll suck and it’ll be broken beyond repair. Instead, use bcrypt or PBKDF2. They’re designed to be slow, which will make brute-force attacks less likely to be successful. Implementations are available for most sensible programming environments.

If you have some kind of roll-your-own fetish, then at least use an HMAC.

Problem be gone

Keeping these simple guidelines in mind will prevent entire ranges of bugs from being introduced into your code base. Total cost of implementation: zilch. Benefit: fewer headdesk incidents.