Gnuplot data analysis, real world example

Creating graphs in LibreOffice is a nightmare. They’re ugly, nearly impossible to customize and creating pivot tables with data is bloody tedious work. In this post, I’ll show you how I took the output of a couple of performance test scripts and turned it into reasonably pretty graphs with a few standard command line tools (gnuplot, awk, a bit of (ba)sh and a Makefile).

The Data

I ran a series of query performance tests against data sets of different sizes. The sets contain 10k, 100k, 1M, 10M, 100M and 500M documents. One of the basic constraints is that it has to be easy to add/remove sets. I don’t want to faff about with deleting columns or updating pivot tables. If I add a set to my test data, I want it automagically show up in my graphs.

The output of the test script is a simple tab separated file, and looks like this:

#Set	Iteration	QueryID	Duration
500M	1	101	10.497499465942383
500M	1	102	3.9973576068878174
500M	1	103	9.4201889038085938
500M	1	104	2.8091645240783691
500M	1	105	2.944718599319458
500M	1	106	5.1576917171478271
500M	1	107	5.7224125862121582
500M	1	108	5.7259769439697266
500M	1	109	4.7974696159362793

Each row contains the query duration (in seconds) for a single execution of a single query.

Processing the data

I don’t just want to graph random numbers. Instead, for each query in each set, I want the shortest execution time (MIN), the longest (MAX) and the average across iterations (AVG). So we’ll create a little awk script to output data in this format. In order to make life easier for gnuplot later on, we’ll create a file per dataset.

% head -n 3 output/500M.dat

#SET	QUERY	MIN	MAX	AVG	ITERATIONS
500M	200	0.071	2.699	0.952	3
500M	110	0.082	5.279	1.819	3

Here’s the source of the awk script, transform.awk. The code is quite verbose, to make it a bit easier to understand.

BEGIN {
}
 
{
        if($0 ~ /^[^#]/) {
                key = $1"_"$3
                first = iterations[key] > 0 ? 0 : 1
                sets[$1] = 1
                queries[$3] = 1
                totals[key] += $4
                iterations[key] += 1
 
                if(1 == iterations[key]) {
                        minima[key] = $4
                        maxima[key] = $4
                } else {
                        minima[key] = $4 < minima[key] ? $4 : minima[key]
                        maxima[key] = $4 > maxima[key] ? $4 : maxima[key]
                }
        }
}
 
END {
 
        for(set in sets) {
                outfile = "output/"set".dat"
                print "#SET\tQUERY\tMIN\tMAX\tAVG\tITERATIONS" > outfile
                for(query in queries) {
                        key = set"_"query
                        iterationCount = iterations[key]
                        average = totals[key] / iterationCount
                        printf("%s\t%d\t%.3f\t%.3f\t%.3f\t%d\n", set, query, minima[key], maxima[key], average, iterationCount) >> outfile
 
                }
        }
}

This code will read our input data, calculate MIN, MAX, AVG, number of iterations for each query and dump the contents in a tab-separated dat file with the same name as the set. Again, this is done to make life easier for gnuplot later on.

I want to see the effect of dataset size on query performance, so I want to plot averages for each set. Gnuplot makes this nice and easy, all I have to do is name my sets and tell it where to find the data. But ah … I don’t want to tell gnuplot what my sets are, because they should be determined dynamically from the available data. Enter, a wee shellscript that outputs gnuplot commands.

#!/bin/sh
 
# Output plot commands for all data sets in the output dir
# Usage: ./plotgenerator.sh column-number
# Example for the AVG column: ./plotgenerator.sh 5
 
prefix=""
 
echo -n "plot "
for s in `ls output | sed 's/\.dat//'` ;
do
        echo -n "$prefix \"output/$s.dat\" using 2:$1 title \"$s\""
 
        if [[ "$prefix" == "" ]] ; then
                prefix=", "
        fi
done

This script will generate a gnuplot “plot” command. Each datafile gets its own title (this is why we named our data files after their dataset name) and its own colour in the graph. We want to plot two columns: the QueryID, and the AVG duration. In order to make it easier to plot the MIN or MAX columns, I’m parameterizing the second column: the $1 value is the number of the AVG, MIN or MAX column.

Plotting

Gnuplot will call the plotgenerator.sh script at runtime. All that’s left to do is write a few lines of gnuplot!

Here’s the source of average.gnp

#!/usr/bin/gnuplot
reset
set terminal png enhanced size 1280,768
 
set xlabel "Query"
set ylabel "Duration (seconds)"
set xrange [100:]
 
set title "Average query duration"
set key outside
set grid
 
set style data points
 
eval(system("./plotgenerator.sh 5"))

The result

% ./average.gnp > average.png

Click for full size.

average

Wrapping it up with a Makefile

I don’t like having to remember which steps to execute in which order, and instead of faffing about with yet another shell script, I’ll throw in another *nix favourite: a Makefile.

It looks like this:

average:
        rm -rf output
        mkdir output
        awk -f transform.awk queries.dat
        ./average.gnp > average.png

Now all you have to do, is run

make

whenever you’ve updated your data file, and you’ll end up with a nice’n purdy new graph. Yay!

Having a bit of command line proficiency goes a long way. It’s so much easier and faster to analyse, transform and plot data this way than it is using graphical “tools”. Not to mention that you can easily integrate this with your build system…that way, each new build can ship with up-to-date performance graphs. Just sayin’!

Note: I’m aware that a lot of this scripting could be eliminated in gnuplot 4.6, but it doesn’t ship with Fedora yet, and I couldn’t be arsed building it.

What bugs me on the web

2013 is nearly upon us, and the web has come a very long way in the ~15 years I’ve been a netizen. And yet, even though we’ve made so many advances, it sometimes feels like we’ve been stagnant, or worse, regressed in some cases.

Each and every web developer out there should have a long, hard think about how the web has (d)evolved in their lifetime and which way we want to head next. There’s an awful lot happening at the moment: web 2.0, HTML 5, Flash’s death-throes, super-mega-ultra tracking cookies, EU cookie regulation nonsense, microdata, cloud fun, … I could go on all day. Needless to say: it’s a mixed bunch.

In any event, here’s a brief list of 3 things that bug me on the web.

Links are broken

Usability has long been the web’s sore thumb, and in spite of any number of government-sponsored usability certification programmes over the year, people still don’t seem to give a rat’s arse. Websites are still riddled with nasty drop down menus that only work with a mouse. Sometimes they’re extra nasty by virtue of being ajaxified. At least Flash menus are finally going the way of the dinosaur.

Pro tip: every single bloody link on your web site should have a working HREF, so people can use it without relying on click handlers, mice, javascript and so people can open the bloody thing in a new tab without going through hell and back.

Bonus points: make your links point to human-readable URLs.

Languages, you’re doing it wrong

The web is no longer an English-only or US-only playing field, and companies all over are starting to cotton on to this fact. What they have yet to realise, however, is that people don’t necessarily speak the language you think they do. If you rely on geolocation data to serve up translated content: stop. You’re doing it wrong. The user determines the language. Believe it or not, people do know which language(s) they speak.

Geolocation, for starters, isn’t an exact science. Depending on the kind of device this can indeed be very accurate. Or very much not. Proxies, VPNs, Onion Routers etc can obviously mislead your tracking. Geolocation tells you nothing. It doesn’t tell you why that person is there (maybe they’re on holiday?). It also doesn’t tell you what language is spoken there. This might be a shock to some people, but some countries have more than one official language. Hell, some villages do. Maybe you can find this data somewhere, and correlate it with the location, but you’d be wrong to. Language is a very sensitive issue in some places. Get it right, or pick a sensible default and make clear that it was a guess. Don’t be afraid to ask for user input.

Pro tip: My favourite HTTP header: Accept-Language. Every sensible browser sends this header with every request. In most cases, the default is the browser’s or OS’s language. Which is nearly always the user’s first language, and when it’s not, at least you know the user understands it well enough to be able to use a browser..

Bonus points: Seriously, use Accept-Language. If you don’t, you’re a dick.

Clutter is back

Remember how, back in 1999, we all thought Google looked awesome because it was so clean & crisp and didn’t get in your face and everyone copied the trend? Well, that seems to have come to an end.
Here’s Yahoo in 1997. (I love how it has an ad for 256mb of memory.)
Here’s Yahoo now.

The 1997 version was annoying to use (remember screen resolutions in the 90s? No? You’re too young to read this, go away) because it was so cluttered.
The 2012 version is worse and makes me want to gouge my eyes out.

Even Google is getting all in your face these days, with search-as-you-type and whatnot. Bah. DuckDuckGo seems to be the exception (at least as far as search engines go). It offers power without wagging it in your face.

Pro tip: don’t put a bazillion things on your pages. Duh.

2013 Wishlist

My web-wishlist for 2013 is really quite simple: I want a usable web. Not just people with the latest and greatest javascript-enabled feast-your-eyes-on-this devices. For everyone. Including those who use text-to-speech, or the blind, or people on older devices. Graceful degradation is key to this. So please, when you come up with a grand feature, think about what we might be giving up on as well. Don’t break links. Don’t break the back button. Don’t break the web.

Bad Press for Agile

So .. Agile’s been getting some bad press of late. Now, these guys are just quacks, and I probably shouldn’t feed the trolls here, but I never could resist.

Saying “agile doesn’t work” or “agile is only out to sell services(training,certification etc)” is obviously a bogus claim. The same could be said of any software methodology. Many waterfall projects have failed, and many have had the help of process improvement engineers and whatnot. Some projects will always fail. A sound development methodology & culture can help you realise imminent failure earlier, or it can help reduce chances of failure. But no methodology is a guarantee for success. A team of idiots run by idiots will always produce crap. No matter how many buzzwords they fit in their job titles or marketing blurbs.

Agile is many things, but no one has ever claimed it to be a silver bullet. As for for it being “for lazy devs”: all developers are lazy, it’s part of the job description. It’s why we automate shit. It’s why we focus on code and not on hot air.

My recommendation to you: use whatever works for you. And in doing so, you’re already on your way to being Agile :-).

SSH Gateway Shenanigans

I love OpenSSH. Part of its awesomeness is its ability to function as a gateway. I’m going to describe how I (ab)use SSH to connect to my virtual machines. Now, on a basic level, this is pretty easy to do, you can simply port forward different ports to different virtual machines. However, I don’t want to mess about with non-standard ports. SSH runs on port 22, and anyone who says otherwise is wrong. Or you could give each of your virtual machines a seperate IP address, but then, we’re running out of IPv4 addresses and many ISPs stubbornly refuse to use IPv6. Quite the pickle!

ProxyCommand to the rescue!

ProxyCommand in ~/.ssh/config pretty much does what it says on the tin: it proxies … commands!

Host fancyvm
        User foo
        HostName fancyvm
        ProxyCommand ssh foo@physical.box nc %h %p -w 3600 2> /dev/null 

This allows you to connect to fancyvm by first connecting to physical.box. This works like a charm, but it has a couple of very important drawbacks:

  1. If you’re using passwords, you have to enter them twice
  2. If you’re using password protected key files without an agent, you have to enter that one twice as well
  3. If you want to change passwords, you have to do it twice
  4. It requires configuration on each client you connect from

Almighty command

Another option is the “command=” option in ~/.ssh/authorized_keys on the physical box:

command="bash -c 'ssh foo@fancyvm ${SSH_ORIGINAL_COMMAND:-}'" ssh-rsa [your public key goes here]

Prefixing your key with command=”foo” will ensure that “foo” is executed whenever you connect using that key. In this case, it will automagically connect you to fancyvm when you log in to physical.box using your SSH key. This has a small amount of setup overhead on the server side but it’s generally the way I do things. The only real drawback here is that’s impossible to change your public key, which isn’t too bad, as long as you keep it secure.

The Actual Shenanigans

The command option is wonderful, but some users can’t or won’t use SSH key authentication. That’s a bit trickier, and here’s the solution I’ve come up with — but if you have a better one, please do share!

We need three things:

  1. A nasty ForceCommand script on the physical box
  2. A user on the physical box (with a passwordless ssh key pair)
  3. A user on the VM, with the above user’s public key in ~/.ssh/authorized_keys

This will grant us the magic ability to log in to the VM by logging in to the physical box. We only have to log in once (because the second part of the login is done automagically by means of the key files). A bit of trickery will also allow us to change the gateway password, which was impossible with any of our previous approaches.

Let’s start with a change in the sshd_config file:

Match User foo
        ForceCommand /usr/local/bin/vmlogin foo fancyvm "$SSH_ORIGINAL_COMMAND"

This will force the execution of our magic script whenever the user connects. And don’t worry, things like scp will work just fine.

And then there’s the magic script, /usr/local/bin/vmlogin:

#!/bin/bash

user=$1
host=$2
command="${3:-}"

if [ "$command" = "passwd" ] ; then
        bash -c "passwd"
        exit 0
fi
command="'$command'"
bash -c "ssh -e none $user@$host $command"

Update 2016

The above script no longer works with SFTP on CentOS 7 with Debian guests. Not sure why, and I’m too lazy to find out. So here’s a script that works around the problem.

#!/bin/bash

user=$1
host=$2
command="${3:-}"

if [ "$command" = "passwd" ] ; then
        bash -c "passwd"
        exit 0
fi

# SFTP has been fucking up. This ought to fix it.
if [ "$command" = "/usr/libexec/openssh/sftp-server" ] || [ "$command" = "internal-sftp" ] ; then
        bash -c "ssh -s -e none $user@$host sftp"
        exit 0
fi

command="'$command'"
bash -c "ssh -e none $user@$host $command"

And there you have it, that’s all the magic you really need. Everything works exactly as if you were connecting to just another machine. The only tricky bit is changing the gateway password: you have to explicitly provide the passwd command when connecting, like so:

ssh foo@physical.box passwd

Symfony2 and Jenkins

I was a bit surprised to see that Symfony2 doesn’t come with an ant build file for Jenkins by default, so I spent a bit of time whipping one up for you (well, for me, really):you can get it here.

Maybe it’ll work for you, maybe it won’t. The project I’m working on has the complete SF2 distribution in version control and all of our bundles in the src folder. It’s easier to test & ship the code this way. If you want to test a specific bundle or don’t want to have SF2 in version control, then you’re on your own :-).

On donating to open source

Some time ago, Gabriel Weinberg, the guy behind the excellent Duck Duck Go search engine, started a bit of a discussion about open source donations. He set up a website where companies can make a pledge to donate x% of their profits to worthy projects. I donated 15% of my company’s estimated profit this year. Half went towards an individual hacker who will remain nameless and genderless. The other half went to the OpenBSD project, because face it, where would you be without OpenSSH?

FOSS Tithe.org

The Agile Samurai: Mini book review

The Pragmatic Programmers have published quite a few books over the years. My bookshelf contains nearly a dozen of them. The latest addition being The Agile Samurai: How Agile Masters Deliver Great Software.

It’s essentially a high level overview of agile practices, with a focus on the why and the how. Unlike some other books on Agile, this one tries to remain pretty neutral when it comes to methodologies. XP, Scrum, Kanban are all briefly mentioned, but the author managed to boil Agile down to its essentials: common sense, being goal oriented and having a willingness to improve.

The chapter on Agile Planning is a particularly excellent treat. It could have easily been called “the idiot proof guide to agile planning”, because really, it’s that good. Concepts like velocity and burndown are illustrated with pretty graphs. Not only does the book explain how to apply agile planning, but by the end of the chapter you’ll also know why it’s a good idea. The phrase “Why does reality keep messing with my Gantt chart!?!” sums it up pretty nicely.

I have just one problem with the book. The Samurai theme could’ve been explored a bit better. For starters, this here Samurai is wearing his swords on the wrong hip. Second, his name, Master Sensei is a bit silly. Ō-sensei would’ve been much more appropriate. But in all seriousness, the whole Samurai theme could’ve been expanded on. There are many similarities between software development and martial arts in general. Mostly when it comes to drive and focus, a bit less so when it comes to actual sword wielding. Still, it doesn’t detract from the book, so all is well.

All in all, a pretty good book. If you’re an Agile Veteran, you won’t need it, but maybe your pesky manager or team leader could benefit from it …

Automating Maven Releases

Automating maven releases should be pretty straightforward in non-interactive mode. A bug in the release plugin made it impossible in my situation. Every time I would provide the release version(s) as command line arguments, the release plugin would choke on me with the following error message:

Error parsing version, cannot determine next version: Unable to parse the version string

The following shellscript works around this problem, by redirecting input to the maven execution.

Note: I’m releasing a project with a parent and 2 child modules, which is why I have to specify three versions ( + 1 SCM tag). If you’re not using multiple modules, or are using more, you’ll have to adjust the script accordingly.

#!/bin/sh

releaseVersion=AmazingRelease1
nextVersion=AmazingRelease2

mvn \
    release:prepare -P production &>> /tmp/build.log << EOS
$releaseVersion
$releaseVersion
$releaseVersion
$releaseVersion
$nextVersion-SNAPSHOT
$nextVersion-SNAPSHOT
$nextVersion-SNAPSHOT
EOS

mvn release:perform -P production &>> /tmp/build.log

This is an abridged version of our full release script. The full version asks the user to enter the release version once, then releases several versions using different profiles and creates a distribution set with all versions and a bunch of documentation. This works in my situation, but if your release procedure is more complicated then you can just expand on the script :-).

Test design – Equivalence Classes

During a recent job interview, I was asked to write some code — I know, shocking! The idea was that several test cases had been defined, and that I was to implement a relative simple class that would make the tests pass. The problem was pretty simple, so I won’t bore you with it.

What was shocking, however, was how poorly designed the tests were. Boundary cases were largely untested, and it seemed like someone spent an inordinate amount of time writing useless tests. When I brought this up during the interview, the person who wrote the tests seemed surprised that they weren’t very good, because he got nearly 100% code coverage on the implementation he created.

While code coverage is all fine and dandy, it doesn’t actually say anything about the quality of your tests. Maybe his implementation would’ve worked perfectly, even with strange values and edge-cases. Maybe not. We’ll never know.

Equivalence Partitioning is one of the simplest test-design techniques. As the name pretty much implies, the idea is to partition possible input values into equivalent classes. Sounds like a bunch of gibberish? Let me illustrate with a classic example. Liquor laws.

As you can tell from the image, if you’re under 16, you’re not allowed any alcoholic beverages. Once you turn 16, you’re allowed to have beer and other non-spirits. Once you turn 18, you hit the jackpot and can drink whatever tickles your fancy.

The red, yellow and green areas are the three Equivalence Classes for this problem. Whether you’re newborn, 5, 11 or 15, it doesn’t matter, you’re not getting a drink. And once you’re older than 18, your age stops mattering entirely.

Once you have this information, you can design a couple of test cases. In this case, you could start off by designing a test case for each class. The exact age for each test you pick doesn’t matter, as long as it’s in the class you’re testing – or outside of it if that’s what you’re testing.

So that’s three easy tests. Then it’s time to apply a bit of Boundary Value Analysis. After all, it’s so very easy to create off-by-one errors.

Boundaries are the areas where equivalent classes meet. The boundaries in this case are 16 and 18. When you look at the boundaries you’ve defined, you’ll want to look very carefully at your specifications again. Someone’s just turned 16 on this very day. Does that mean they can have a drink? Or not? Once you have the answer, create a test case. Then do the same for all other boundaries.

With five test cases, one for each boundary and one for each equivalence class, you’ll have tested this very thoroughly. Additional test cases can be made to test invalid input. What happens if you try to pass a person with a negative age? What if the age is a million years old?