Rabbit on the treadmill: Run Rabbit, Run!

By: on July 14, 2009

For the last couple of months I’ve been working on rewriting RabbitMQ’s persister so that it will scale to volumes of data that won’t fit in RAM, and will perform consistently across a wide variety of use cases. This work is coming to a conclusion now, and although the code is not yet released, nor has it even been through QA, benchmarking it thoroughly is useful to allow us to understand what’s good and what’s bad about the new design. In this post I’m not going to do any before and after comparisons — they’ll be coming in due course. Instead, I’m going use RabbitMQ to benchmark harddiscs — an SSD, and a normal rotating harddisc. As someone said at the presentation we gave at the recent Erlang Factory, “using SSDs are just like RAM”. Cue expectations of a turbo-charged, overclocked, overvolted Rabbit, with liquid nitrogen cooling.

So, I arrived at my desk on Monday morning to find Father Christmas had woken early from his year-long hangover, and dropped off an OCZ Vertex SSD 60GB. Now I have read the massive article on Anandtech about how SSDs work, and how most except the Intel ones have awful write performance for anything except sequential writes, and how the new OCZ Vertex range have changed that, have pretty good performance and are usefully not as crippling expensive as the Intel SSDs.

Monday I pretty much spent getting its firmware upgraded (ugh, Windows, DOS etc) getting it all set up, and just hammering the hell out of it. Playing with lots of different filesystems (ext2, ext3, ext4, xfs, btrfs) and just dialing everything to 11. But by the end of the day, I was having a feeling that not all was well. It seemed to be going slower than it had been initially, and I couldn’t really get it to go that much faster. Over night, I remember that yes, SSD performance does degrade once all the empty sectors of the disk have been written to once, because to fill in the remaining gaps, the entire sector has to be erased and rewritten. So the degradation was to be expected. So I thought, ok, I’ve now got to a stage where the drive seems to be well worn in, so let’s benchmark it now.

So, the SSD is 60GB, with a 64MB cache, running OCZ’s 1.30 firmware. The spinning disk is a Western Digital Caviar SE16, 320GB with a 16MB cache (WD3200AAKS). Both disks are formatted using ext3, with exactly the same options, and mounted with data=ordered.

Test 1: Start Rabbit, create a queue, set the queue to disk-only mode (this is a new feature), send in 3 million 1024-byte messages. Measuring, I’m taking microseconds since epoch and running iostat. I’m doing this about ever 0.4 seconds. It doesn’t matter if the interval between calls to iostat aren’t totally even because I’m capturing the timestamp too.

Writing 3 million 1024-byte messages

I’ve divided the 3-million into 3 runs, each of 1 million messages, but the queue isn’t emptied in between runs — i.e. by the end, the queue did have 3 million messages in it.

What can we see here? Well, the spinning disk is just faster. It gets to the end of each million at least 10 seconds sooner. Are either disks saturated? No. The writes are pretty simple — the message content itself gets appended to plain files, so there is almost no seeking going on there. However, we use mnesia to maintain an index into these files. Mnesia is running our table in disc_copies mode, so from time to time it’ll decide to dump out to the disk. That’ll be in a different part of the disk and will cause some seeking, but really should be another large bulk write. Also note that I’m plotting just writes. There were no reads going on at all during this test. CPU load is fairly high, but quite a lot of the time, XOSView does show that processes are stalled waiting on IO to complete. So you think that one doesn’t look too bad? We know that SSDs have traditionally been optimised for sequential performance at the expense of random access and latency. So let’s go from 3,000,000 1024-byte messages to 300,000 10,240-byte messages. This should suit the SSD better, right?

Writing 300,000 10,240-byte messages

Wrong! It’s even worse, and I promise you’ve I’ve not got the two sets of data reversed!

Some of you may be wondering why each run seems to write a different amount of data to disk. That puzzled me too. Our best guess is that it’s affected by the filesystem doing coalescing of writes, eg of metadata, and the interaction between the barriers there with the dumps coming from Mnesia. Please let us know if you have further ideas!

Finally, one thing we all agree on is that SSDs have awesome read performance. Running the venerable dd shows that the SSD can happily sit there reading at 150MB/s so this really should out perform the spinning disk, right? The test then is to set up an auto-acking consumer and read out those 3 million 1024-byte messages. Here, I used free, dd, and /dev/zero to make sure that before starting, the OS did not have any caches of the files Rabbit would need to be reading from. Also, as the deliveries occur, there will be writes as we have to update the Mnesia tables to indicate the messages have been delivered (and ack’d).

Reading 3 million 1024-byte messages

So here, we find the SSD and spinning disk are basically matched for performance. A result! The best thing about this graph is how easy it is to see the size of the Mnesia table. It starts of with 3 million rows in it, and so when it is dumped, the “step” in the writes is quite large. As it gets smaller and smaller, the dumps also get smaller. Brilliant!

Some quick maths is even more exciting: 3 million 1024-byte messages would suggest that we should read, well, about 3 GB. Now we actually seem to read a bit more, but there are some fixed overheads in the file format (length prefixes, trailing status bytes, etc) so the amount of data read seems very likely indeed. What’s a little surprising is that in the course of reading 3GB, we actually write out nearly 6GB in updates to the Mnesia table. Now of course, this won’t amount to 6GB disk space, because we’re constantly rewriting the same table, but nevertheless, it is rather eyeopening.

Looking back at the writing graphs, we see a similar story. The raw data being written is about double the amount of message data. Each trace amounts to about 1GB (i.e. 1 million 1024-byte messages or 100,000 10,240-byte messages) and yet we see between 2GB and 2.5GB of data actually being sent to disk. This is, if not alarming, then certainly somewhat eyeopening.

The bottom line, however, is no, SSDs are not just like RAM, and certainly as far as Rabbit is concerned, for high-throughput operation, they are nowhere near viable as replacements for spinning disks. Latency of writes may have improved over the initial models, and random access may too have performed. But at the end of the day, for our particular access patterns of reads and writes, the spinning disks still win.

FacebookTwitterGoogle+

11 Comments

  1. Tim says:

    While a very interesting read, you need to pick better graphing software and/or line colors because your first and third graphs are unreadable :p

  2. matthew says:

    @Tim

    If you click on the graphs, they get bigger! I would render them SVGs except I’m not quite sure how widespread SVG rendering is. Is it supported on IE yet?

  3. Artur Bergman says:

    Hi,

    Have you done similar tests with the Intels?

    Also, is this data continuously fsynced() or not?

    Cheers
    Artur

  4. matthew says:

    @Artur,

    We have not tested with the Intels, no. Really, I would expect that right now the fastest throughput would be achieved by a RAID0 array of good fast spinning disks. Certainly best performance per money.

    The data is fsynced only when necessary – this is something I’ve heavily optimised inside the new persister. As such, in these write tests there’ll be under a dozen fsyncs going on, at least as far as the raw message content is concerned, and on the read test, there’ll be no fsyncs going on. That said, I wouldn’t be at all surprised to find that when mnesia does its dumps to disk from time to time, there are fsyncs going on there.

  5. Artur Bergman says:

    @matthew

    For our uses db and http cache, the intel SSDs give us significantly better price/performance than spinning media.

  6. matthew says:

    @Artur

    That’s interesting. My immediate guess is that your access patterns are very read-dominated, but I also have a suspicion you’re about to tell me they’re not?!

  7. Justin Pitts says:

    Did you run this on Linux? If so, what kernel version, and what IO scheduler?

  8. matthew says:

    @Justin

    All good questions which I should have included in the article, sorry. It’s a 64-bit 2.6.30 kernel.

    CONFIG_DEFAULT_IOSCHED=”cfq”

    cat /sys/block/sd*/queue/scheduler

    noop anticipatory deadline [cfq]
    noop anticipatory deadline [cfq]
    noop anticipatory deadline [cfq]

    So it looks like I was using the cfq. But I hadn’t actually thought about that at all I’m afraid…

  9. matthew says:

    Ugh, epic win for markdown interpretation of comments… sigh.

  10. Artur Bergman says:

    @matthew they are more read heavy yes, but the DB servers do a fair amount of writes

    Yeah, you want to use noop scheduler, and probably a non journaling filesystem plus turn of readahead

    also, in the code, madvise helps a lot

  11. matthew says:

    @Artur

    Err, yes, I would love access to madvise. I would also love access even to mmap. However, Erlang provides me with neither. All it gives is basic fread/fwrite. Haskell, it ain’t…!

    I’ll try again with different IO schedulers and ext2 (or ext3 with no journal) when I manage to put the Rabbit back together again — it’s all over the floor right now in lots of little bits…

Post a comment

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>