Character sets, time zones and hashes
Character sets, time zones and password hashes are pretty much the bane of my life. Whenever something breaks in a particularly spectacular fashion, you can be sure that one of those three is, in some way, responsible. Apparently the average software developer Just Doesn't Get It™. Granted, they are pretty complex topics. I'm not expecting anyone to care about the difference between ISO-8859-15 and ISO-8859-1, know about UTC's subtleties or be able to implement SHA-1 using a ball of twine.
What I do expect, is for sensible folk to follow these very simple guidelines. They will make your (and everyone else's) life substantially easier.
Use UTF-8..
Always. No exceptions. Configure your text editors to default to UTF-8. Make sure everyone on your team does the same. And while you're at it, configure the editor to use UNIX-style line-endings (newline, without useless carriage returns).
..or not
Make sure you document the cases where you can't use UTF-8. Write down and remember which encoding you are using, and why. Remember that iconv is your friend.
Store dates with time zone information
Always. No exceptions. A date/time is entirely meaningless unless you know which time zone it's in. Store the time zone. If you're using some kind of retarded age-old RDBMS which doesn't support date/time fields with TZ data, then you can either store your dates as a string, or store the TZ in an extra column. I repeat: a date is meaningless without a time zone.
While I'm on the subject: store dates in a format described by ISO 8601, ending with a Z to designate UTC (Zulu). No fancy pansy nonsense with the first 3 letters of the English name of the month. All you need is ISO 8601.
Bonus tip: always store dates in UTC. Make the conversion to the user time zone only when presenting times to a user.
Don't rely on platform defaults
You want your code to be cross-platform, right? So don't rely on platform defaults. Be explicit about which time zone/encoding/language/.. you're using or expecting.
Use bcrypt
Don't try to roll your own password hashing mechanism. It'll suck and it'll be broken beyond repair. Instead, use bcrypt or PBKDF2. They're designed to be slow, which will make brute-force attacks less likely to be successful. Implementations are available for most sensible programming environments.
If you have some kind of roll-your-own fetish, then at least use an HMAC.
Problem be gone
Keeping these simple guidelines in mind will prevent entire ranges of bugs from being introduced into your code base. Total cost of implementation: zilch. Benefit: fewer headdesk incidents.
Repeat after me: MySQL is not a filesystem
I came across this gem on DZone this morning. It's a tutorial on storing images in a MySQL database (using PHP). There are several things in the tutorial that I don't agree with, but I'll let those slide. What really bugs me, is how it fails to mention that this is a very bad idea.
A relational database is not a filesystem. Files go on a filesystem. Relational data goes in an RDBMS. Repeat that a couple of times.
The most compelling argument for this, is performance. I did a quick test. I did a google image search on stupidity and downloaded the first 10 images. I then wrote PHP scripts to serve them up in two ways:
1. From a MySQL (MyISAM) table with 2 columns: ID (int, auto_increment) and DATA (mediumblob)
2. Using readfile.
The third test method, "FS", simply loads the image over HTTP directly, without any intermediary scripts.
The results are the average of running Apache Benchmark 10 times: 10 concurrent requests, 1000 requests per run.
As you can see, the MySQL approach is a hell of a lot slower than the more sensible FS approach.
The best way to store your images (or other binary files) is on the filesystem. Every modern web server does a good (or excellent) job of serving up static content. Storing them in a database is by far the worst possible solution. Not only because it's slow, but also because it complicates database backups: MySQL dumps with binary data don't compress very well, causing the whole database backup to be slower and larger than needs be.
So please, be sensible. Store your files on a filesystem.
Java 7 Performance
I decided to compare Java 6 & 7 performance for $employer's $application. Java 7 performs better — as expected. What I did not expect, was that the difference would be so big. Around 10% on average. That's not bad for something as simple as a version bump.
Ideally I'd like to investigate where this difference comes from. I suspect improved ergonomics have a lot to do with it.
$application uses Apache Solr rather extensively. In fact, most of the time querying is spent in Solr. With indexing it's probably about 50% of the time. With querying it's probably closer to 90%. All tests are run in a controlled environment, so I have a fair amount of confidence in these results.
The indexing test inserts 3 million documents in Solr. Creating these documents takes up the bulk of the time. It involves a lot of filesystem access -- something which Java versions have very little influence over and heavily multi-threaded CPU-intensive processing.
If you're not using Java 7, you really should consider upgrading. If you're stuck with people who live in the past, maybe you can convince them with a bunch of pretty performance graphs of your own.
On Bug Reports & the Urge To Decapitate
There's an old joke about a Manager, an Engineer and a Software Developer: they're in a car and the brakes fail as they go down a mountain road. They miraculously come to a standstill, but now they're stuck, and Somebody Should Do Something ™. The Manager suggests they make a Plan, define Goals & Measurable Objectives in order to solve the Critical Problem. The Engineer has his tools with him, so he suggests he take look at the problem and fix it. The Software Developer, of course, is not convinced and suggests they push the car up hill to see if the problem will manifest itself again ...
You can probably find a bunch of different versions of this on the web, but the punch line is always the same: the developer wants to be able to reproduce the bug. Not just to verify its existence (although it wouldn't be the first time a user reported a critical data loss bug after having pressed the delete button..), but also to have a place to start the bug hunt. Software systems are complex. We have enough layers of abstraction built on top of a bunch of transistors to make Shrek jealous. Something as seemingly simple as displaying a bit of text on a screen is so complex that it can no longer be fully understood by a single person.
Being able to reproduce a bug is the only way to resolve ambiguity. When it isn't possible to describe all the steps that led to the bug, then a thorough description of the problem is the next best thing. Think screenshots, explanations of the expected result, and answers to all of the usual "Which"-questions (Version, OS, Environment, Lunar Phase, ..).
Debugging is hard (and fun). Vague bug reports like "the application is broken" rather make me want to tie a team of horses to your limbs and decapitate your bloody remains educate you on the virtues of well-written bug reports.
Now .. let me see if I can fix this "broken" application .. sigh.
Gnuplot data analysis, real world example
Creating graphs in LibreOffice is a nightmare. They're ugly, nearly impossible to customize and creating pivot tables with data is bloody tedious work. In this post, I'll show you how I took the output of a couple of performance test scripts and turned it into reasonably pretty graphs with a few standard command line tools (gnuplot, awk, a bit of (ba)sh and a Makefile).
The Data
I ran a series of query performance tests against data sets of different sizes. The sets contain 10k, 100k, 1M, 10M, 100M and 500M documents. One of the basic constraints is that it has to be easy to add/remove sets. I don't want to faff about with deleting columns or updating pivot tables. If I add a set to my test data, I want it automagically show up in my graphs.
The output of the test script is a simple tab separated file, and looks like this:
#Set Iteration QueryID Duration 500M 1 101 10.497499465942383 500M 1 102 3.9973576068878174 500M 1 103 9.4201889038085938 500M 1 104 2.8091645240783691 500M 1 105 2.944718599319458 500M 1 106 5.1576917171478271 500M 1 107 5.7224125862121582 500M 1 108 5.7259769439697266 500M 1 109 4.7974696159362793
Each row contains the query duration (in seconds) for a single execution of a single query.
Processing the data
I don't just want to graph random numbers. Instead, for each query in each set, I want the shortest execution time (MIN), the longest (MAX) and the average across iterations (AVG). So we'll create a little awk script to output data in this format. In order to make life easier for gnuplot later on, we'll create a file per dataset.
% head -n 3 output/500M.dat #SET QUERY MIN MAX AVG ITERATIONS 500M 200 0.071 2.699 0.952 3 500M 110 0.082 5.279 1.819 3
Here's the source of the awk script, transform.awk. The code is quite verbose, to make it a bit easier to understand.
BEGIN { } { if($0 ~ /^[^#]/) { key = $1"_"$3 first = iterations[key] > 0 ? 0 : 1 sets[$1] = 1 queries[$3] = 1 totals[key] += $4 iterations[key] += 1 if(1 == iterations[key]) { minima[key] = $4 maxima[key] = $4 } else { minima[key] = $4 < minima[key] ? $4 : minima[key] maxima[key] = $4 > maxima[key] ? $4 : maxima[key] } } } END { for(set in sets) { outfile = "output/"set".dat" print "#SET\tQUERY\tMIN\tMAX\tAVG\tITERATIONS" > outfile for(query in queries) { key = set"_"query iterationCount = iterations[key] average = totals[key] / iterationCount printf("%s\t%d\t%.3f\t%.3f\t%.3f\t%d\n", set, query, minima[key], maxima[key], average, iterationCount) >> outfile } } } |
This code will read our input data, calculate MIN, MAX, AVG, number of iterations for each query and dump the contents in a tab-separated dat file with the same name as the set. Again, this is done to make life easier for gnuplot later on.
I want to see the effect of dataset size on query performance, so I want to plot averages for each set. Gnuplot makes this nice and easy, all I have to do is name my sets and tell it where to find the data. But ah ... I don't want to tell gnuplot what my sets are, because they should be determined dynamically from the available data. Enter, a wee shellscript that outputs gnuplot commands.
#!/bin/sh # Output plot commands for all data sets in the output dir # Usage: ./plotgenerator.sh column-number # Example for the AVG column: ./plotgenerator.sh 5 prefix="" echo -n "plot " for s in `ls output | sed 's/\.dat//'` ; do echo -n "$prefix \"output/$s.dat\" using 2:$1 title \"$s\"" if [[ "$prefix" == "" ]] ; then prefix=", " fi done |
This script will generate a gnuplot "plot" command. Each datafile gets its own title (this is why we named our data files after their dataset name) and its own colour in the graph. We want to plot two columns: the QueryID, and the AVG duration. In order to make it easier to plot the MIN or MAX columns, I'm parameterizing the second column: the $1 value is the number of the AVG, MIN or MAX column.
Plotting
Gnuplot will call the plotgenerator.sh script at runtime. All that's left to do is write a few lines of gnuplot!
Here's the source of average.gnp
#!/usr/bin/gnuplot reset set terminal png enhanced size 1280,768 set xlabel "Query" set ylabel "Duration (seconds)" set xrange [100:] set title "Average query duration" set key outside set grid set style data points eval(system("./plotgenerator.sh 5")) |
The result
% ./average.gnp > average.png
Click for full size.
Wrapping it up with a Makefile
I don't like having to remember which steps to execute in which order, and instead of faffing about with yet another shell script, I'll throw in another *nix favourite: a Makefile.
It looks like this:
average:
rm -rf output
mkdir output
awk -f transform.awk queries.dat
./average.gnp > average.png
Now all you have to do, is run
make
whenever you've updated your data file, and you'll end up with a nice'n purdy new graph. Yay!
Having a bit of command line proficiency goes a long way. It's so much easier and faster to analyse, transform and plot data this way than it is using graphical "tools". Not to mention that you can easily integrate this with your build system...that way, each new build can ship with up-to-date performance graphs. Just sayin'!
Note: I'm aware that a lot of this scripting could be eliminated in gnuplot 4.6, but it doesn't ship with Fedora yet, and I couldn't be arsed building it.
What bugs me on the web
2013 is nearly upon us, and the web has come a very long way in the ~15 years I've been a netizen. And yet, even though we've made so many advances, it sometimes feels like we've been stagnant, or worse, regressed in some cases.
Each and every web developer out there should have a long, hard think about how the web has (d)evolved in their lifetime and which way we want to head next. There's an awful lot happening at the moment: web 2.0, HTML 5, Flash's death-throes, super-mega-ultra tracking cookies, EU cookie regulation nonsense, microdata, cloud fun, ... I could go on all day. Needless to say: it's a mixed bunch.
In any event, here's a brief list of 3 things that bug me on the web.
Links are broken
Usability has long been the web's sore thumb, and in spite of any number of government-sponsored usability certification programmes over the year, people still don't seem to give a rat's arse. Websites are still riddled with nasty drop down menus that only work with a mouse. Sometimes they're extra nasty by virtue of being ajaxified. At least Flash menus are finally going the way of the dinosaur.
Pro tip: every single bloody link on your web site should have a working HREF, so people can use it without relying on click handlers, mice, javascript and so people can open the bloody thing in a new tab without going through hell and back.
Bonus points: make your links point to human-readable URLs.
Languages, you're doing it wrong
The web is no longer an English-only or US-only playing field, and companies all over are starting to cotton on to this fact. What they have yet to realise, however, is that people don't necessarily speak the language you think they do. If you rely on geolocation data to serve up translated content: stop. You're doing it wrong. The user determines the language. Believe it or not, people do know which language(s) they speak.
Geolocation, for starters, isn't an exact science. Depending on the kind of device this can indeed be very accurate. Or very much not. Proxies, VPNs, Onion Routers etc can obviously mislead your tracking. Geolocation tells you nothing. It doesn't tell you why that person is there (maybe they're on holiday?). It also doesn't tell you what language is spoken there. This might be a shock to some people, but some countries have more than one official language. Hell, some villages do. Maybe you can find this data somewhere, and correlate it with the location, but you'd be wrong to. Language is a very sensitive issue in some places. Get it right, or pick a sensible default and make clear that it was a guess. Don't be afraid to ask for user input.
Pro tip: My favourite HTTP header: Accept-Language. Every sensible browser sends this header with every request. In most cases, the default is the browser's or OS's language. Which is nearly always the user's first language, and when it's not, at least you know the user understands it well enough to be able to use a browser..
Bonus points: Seriously, use Accept-Language. If you don't, you're a dick.
Clutter is back
Remember how, back in 1999, we all thought Google looked awesome because it was so clean & crisp and didn't get in your face and everyone copied the trend? Well, that seems to have come to an end.
Here's Yahoo in 1997. (I love how it has an ad for 256mb of memory.)
Here's Yahoo now.
The 1997 version was annoying to use (remember screen resolutions in the 90s? No? You're too young to read this, go away) because it was so cluttered.
The 2012 version is worse and makes me want to gouge my eyes out.
Even Google is getting all in your face these days, with search-as-you-type and whatnot. Bah. DuckDuckGo seems to be the exception (at least as far as search engines go). It offers power without wagging it in your face.
Pro tip: don't put a bazillion things on your pages. Duh.
2013 Wishlist
My web-wishlist for 2013 is really quite simple: I want a usable web. Not just people with the latest and greatest javascript-enabled feast-your-eyes-on-this devices. For everyone. Including those who use text-to-speech, or the blind, or people on older devices. Graceful degradation is key to this. So please, when you come up with a grand feature, think about what we might be giving up on as well. Don't break links. Don't break the back button. Don't break the web.
The 12-Factor App
If you haven't read it yet, I strongly suggest you have a look at this here manifesto/methodology for building better web apps.
Most of the twelve points strike me as Common Sense™. However, the points on Processes and Scaling via the Process Model really struck a chord with me. A very nice sounding chord, too. Maybe an E-minor. Allow me to restate the points briefly and comment on them.
6. Processes — Execute the app as one or more stateless processes
8. Concurrency — Scale out via the process model
This is pretty much the same as the good old "be modular"-advice. Splitting your code into sensible logical entities makes it easier to reason about your application, fix bugs and add features. As a wonderful bonus, it allows you scale much more easily by creating extra instances of said processes -- who knows, maybe even on some fancy buzzword-riddled cloud service! Like the author mentions, taking a look at the UNIX process model for inspiration - and by extension UNIX tools - is a pretty good idea.
Over the last year or so, I've become a rather big fan of Apache Thrift, which can be summed up as "remote method invocation Done Right". You can use Thrift to communicate between applications written in different languages, running on different machines, over different communications protocols. You want HTTP? You've got it. Binary foo over a socket? Check. Bind C++ services to PHP or JavaScript? Check and check! JSON over carrier pigeon? Go implement it!
While it isn't explicitly mentioned in the 12-factor app, I'd like to think that there is room within this model for multiple programming languages. I've grown so incredibly tired of language obsession. "My language is better than yours". It's just dick-sizing. It's 2012. There's a myriad of languages out there. Some of them more suited for certain tasks than others. So pick the right one for the job!
Bad Press for Agile
So .. Agile's been getting some bad press of late. Now, these guys are just quacks, and I probably shouldn't feed the trolls here, but I never could resist.
Saying "agile doesn't work" or "agile is only out to sell services(training,certification etc)" is obviously a bogus claim. The same could be said of any software methodology. Many waterfall projects have failed, and many have had the help of process improvement engineers and whatnot. Some projects will always fail. A sound development methodology & culture can help you realise imminent failure earlier, or it can help reduce chances of failure. But no methodology is a guarantee for success. A team of idiots run by idiots will always produce crap. No matter how many buzzwords they fit in their job titles or marketing blurbs.
Agile is many things, but no one has ever claimed it to be a silver bullet. As for for it being "for lazy devs": all developers are lazy, it's part of the job description. It's why we automate shit. It's why we focus on code and not on hot air.
My recommendation to you: use whatever works for you. And in doing so, you're already on your way to being Agile
.
SSH Gateway Shenanigans
I love OpenSSH. Part of its awesomeness is its ability to function as a gateway. I'm going to describe how I (ab)use SSH to connect to my virtual machines. Now, on a basic level, this is pretty easy to do, you can simply port forward different ports to different virtual machines. However, I don't want to mess about with non-standard ports. SSH runs on port 22, and anyone who says otherwise is wrong. Or you could give each of your virtual machines a seperate IP address, but then, we're running out of IPv4 addresses and many ISPs stubbornly refuse to use IPv6. Quite the pickle!
ProxyCommand to the rescue!
ProxyCommand in ~/.ssh/config pretty much does what it says on the tin: it proxies ... commands!
Host fancyvm
User foo
HostName fancyvm
ProxyCommand ssh foo@physical.box nc %h %p -w 3600 2> /dev/null
This allows you to connect to fancyvm by first connecting to physical.box. This works like a charm, but it has a couple of very important drawbacks:
- If you're using passwords, you have to enter them twice
- If you're using password protected key files without an agent, you have to enter that one twice as well
- If you want to change passwords, you have to do it twice
- It requires configuration on each client you connect from
Almighty command
Another option is the "command=" option in ~/.ssh/authorized_keys on the physical box:
command="bash -c 'ssh foo@fancyvm ${SSH_ORIGINAL_COMMAND:-}'" ssh-rsa [your public key goes here]
Prefixing your key with command="foo" will ensure that "foo" is executed whenever you connect using that key. In this case, it will automagically connect you to fancyvm when you log in to physical.box using your SSH key. This has a small amount of setup overhead on the server side but it's generally the way I do things. The only real drawback here is that's impossible to change your public key, which isn't too bad, as long as you keep it secure.
The Actual Shenanigans
The command option is wonderful, but some users can't or won't use SSH key authentication. That's a bit trickier, and here's the solution I've come up with -- but if you have a better one, please do share!
We need three things:
- A nasty ForceCommand script on the physical box
- A user on the physical box (with a passwordless ssh key pair)
- A user on the VM, with the above user's public key in ~/.ssh/authorized_keys
This will grant us the magic ability to log in to the VM by logging in to the physical box. We only have to log in once (because the second part of the login is done automagically by means of the key files). A bit of trickery will also allow us to change the gateway password, which was impossible with any of our previous approaches.
Let's start with a change in the sshd_config file:
Match User foo
ForceCommand /usr/local/bin/vmlogin foo fancyvm "$SSH_ORIGINAL_COMMAND"
This will force the execution of our magic script whenever the user connects. And don't worry, things like scp will work just fine.
And then there's the magic script, /usr/local/bin/vmlogin:
#!/bin/bash
user=$1
host=$2
command="${3:-}"
if [ "$command" = "passwd" ] ; then
bash -c "passwd"
exit 0
fi
command="'$command'"
bash -c "ssh -e none $user@$host $command"
And there you have it, that's all the magic you really need. Everything works exactly as if you were connecting to just another machine. The only tricky bit is changing the gateway password: you have to explicitly provide the passwd command when connecting, like so:
ssh foo@physical.box passwd
Symfony2 Manual Login
The Symfony2 documentation is a bit lacking at times, and it took me a while to figure out how to force a user to log in without going through the motions of the security framework and a login form. Useful in cases where you want to switch to a different user, or maybe one-time auto login URLs or whatever.
The code is pretty simple, and in the end all you really have to do is register a Token with the SecurityContext -- and the session if you want to persist the login across multiple requests. The only prerequisite is having access to the Container.
public function forceLogin(UserInterface $user, $firewallName) { // $this->container = the SF2 DI container // $this->securityContext = the SF2 SecurityContext $token = new UsernamePasswordToken($user, null, $firewallName, $user->getRoles()); $this->securityContext->setToken($token); try { $request = $this->container->get('request')->getSession()->set('_security_' . $firewallName, serialize($token)); } catch(InactiveScopeException $e) { // No worries, no need (or way) to set the token on the session if there is no request object! } } |



