technology from back to front

Cranial Surgery: Giving Rabbit more Memory

Many users of Rabbit have been asking us about how Rabbit copes with many large messages in queues, to the extent that the total size of these messages exhausts the available physical memory (RAM). As things stand at the moment, the answer is not very well. Although we have a persistence mechanism, that is not quite an answer either because whilst it does ensure that messages are written to disk, it does not remove messages from RAM. So, we’ve been looking at writing a disk-based queue so that should RAM become tight, we can start to push messages out to disk and collect them later from there.

However, there is this thing called swap, and it seems wise to test how Rabbit copes when we just allow it to expand into swap. The current releases of Rabbit monitor memory usage, and by default use Channel.Flow to tell publishing clients to stop sending messages when memory gets tight. However, if you start up Rabbit with -rabbit memory_alarms false then the memory monitoring does not occur and so clients will not be told to stop sending messages when we run out of memory. This means we can just start hammering more and more messages into Rabbit and exhaust RAM. Cue fitting an extra 160GB hard disc to be used solely as swap.

Quickly, we hit another problem. The OTP platform, which sits atop Erlang and provides a series of common behaviours for Erlang processes, has a couple of places where it specifies default timeout values of 5 seconds on replies coming back to messages. When the whole computer is stalled swapping out pages, these timeouts can often be exceeded, and so we went through the code base and set all such timeouts to infinity. This does not alter behaviour in the non-this-computer-is-in-a-lot-of-pain case, but when the computer is unwell, it allows Erlang to solider on regardless (albeit somewhat more slowly!). For the brave souls of you who wish to test this for yourself, hg clone/pull from the usual repository and update to the latest on the default branch.

Unless it turns out that swapping just works extremely well, it’s pretty likely that we’re going to be writing our own disk-backed queue, and if we do, we need to be able to demonstrate that it was worthwhile — i.e. it works better than just using swap. Thus we need to measure the performance when using swap to give us something to compare against. So, we have two tests. Before getting on to the differences, I’ll start by mentioning the similarities. All message payloads are 10MB in size. Both the client and server are run on the same machine and are communicating using the loopback network device. The machine has an Intel Core2 Quad CPU Q9400 running at 2.66GHz, and 4GB of RAM. When the tests are started, about 3GB of that RAM is available. Each test is started on a frest running instance of Rabbit, with an empty database. The kernel is the Debian stock 64-bit 2.6.28 kernel, and I’m using Erlang R12B-3 (Debian version: 1:12.b.3-dfsg-4). When fetching messages, Basic.Get is used and no-ack is turned on. I used the Erlang AMQP client.

The first test type is pushing in N messages and then pulling them back out again. I capture the elapsed time for each action (be it a publish or a get), and then have graphed them.

So, when N is 64, 128 or 256, it’s not really too exciting. This is easily explained: 256 10MB messages easily fits into the 3GB RAM available. Thus nothing much to report on. First let’s see the cumulative time graphs. Note the axis — we have time on the y-axis, not the x-axis. So a steeper gradient means slower performance. (Click on any of the images to get them a bit bigger.)

Next we can take the first differential of these graphs and see how much time is being spent on each operation. The y-axis is now logarithmic:

In all cases we see that getting messages is slightly faster than publishing messages, and that as the number of messages in the system, and hence memory used, increases, we see slightly bigger spikes — this, I’m guessing is the garbage collector having more work to do, but so far, nothing too surprising. Now let’s see what happens when we ramp up 512 messages. This is 5GB of data, there’s only 4GB RAM in the box, and only 3GB is free at the start of the test. So it’s pretty certain we’re going to hit swap.

Everything’s going along just fine until we get to about 310 messages in the Rabbit, and then performance starts to become somewhat less predictable. Fetching messages is on the whole slower than before, though on the differential graph, we do see some spikes showing that there are periods where performance recovers. Presumeably this correlates to large numbers of pages being swapped back in and then allowing Rabbit to run reasonably quickly for small periods of time.

Just for fun, I also did this with N as 1024, though as it took 20 mins to run, I only did this test once:

It’s clear here that publishing when we’ve run out of RAM isn’t too bad, and this makes sense — all that is required is that a page is swapped out and we’re given a new page to write to. Getting messages is much slower as we may have to both read from and write to swap.

The next test is more interesting. For a given N, start by publishing N messages, then publish-and-fetch-a-message 2N times, and finally drain the remaining N messages. Fewer graphs this time, just one before we hit swap, where N is 64:

Note that for the middle segment, the time is for publishing and getting a message. Now, as soon as we have N as 256, we start running out of memory. This is only in the middle segment and again, does make sense — although we can fit 310 messages in memory, as we are publishing and getting, the memory is (presumably) much more fragmented and as such we can fit in fewer messages. We’re also at the mercy of the garbage collector to reclaim messages to which we no longer need to hold.

In the cumulative graph here, we can see that it starts off pretty much the same as for when N is 64 — the gradient gets a bit steeper when we start publishing and getting, but when we get to about 330 messages, suddenly we hit the first step, when we run out of memory and start making use of swap. Now let’s see about N is 512. Again, this one took so long that I only ran it once:

Again, the step where we start swapping is clearly visible at 310, though of course in this test, we’re still ramping up and just publishing messages at this point. Interestingly, in the one-in-one-out phase of the test, performance seems to repeat its pattern (in the differential graph). Whilst we’ve had some guesses, we’re really not too sure what’s going on here, though it’s likely very specific to the swap algorithms, kernel and interaction with the garbage collector. Fun.

So it’s good to see that nothing really goes wrong: it does keep working, and if you don’t need Rabbit to be amazingly fast but want lots and lots of big messages in your bunny, then this is perhaps a good enough solution. Certainly pairing Rabbit with a good SSD swap disk may work well enough for you. For others though, we now have a repeatable set of metrics that allow us to test different designs for a disk-backed queue.

Update
———-
Some of you may have noticed that when I first published this, all the graphs had y-axis that said milliseconds, not microseconds. Publishing a message does not take over 100 seconds, fear not. I had just managed to not read the documentation about what Erlang’s now() function returned and had failed to consider whether the values were likely to be milliseconds. Fortunately, I’d saved all the graphs in postscript, so a quick find and replace in emacs and everything’s better!

by
matthew
on
02/04/09
  1. I think the other benefit of disk paging will be in the case of 32-bit systems, where Erlang only gets at most 3GB of RAM. On those systems, swapping isn’t even an option; you exhaust your address space and then erl dies. Hopefully disk-based paging will allow for queue sizes that exceed the system’s actual address space.

    I’ve also noticed that Rabbit’s memory usage is about quadruple its disk usage: if the Rabbit is using 1GB of RAM to hold all of its persistent messages, /var/lib/rabbitmq/mnesia/rabbit will only have about 256MB of data in it. I have no idea why that is, but it seems like if disk paging means storing more messages in the disk format and not in the RAM format, you will actually have less data on disk, which is generally a good thing.

  2. @jay, good observations. The RAM overhead you’re seeing could be the cached pre-parsed message data that’s carried around with the in-core message representations. Pretty much only the raw message data is stored in the journal. One thing we could look at in future is reducing the amount of preparsed data we carry around with each message.

  3. martin sustrik
    on 03/04/09 at 6:44 am

    What a nice article! Would you mind if I link it from out website?

  4. @martin, thank you very much, and yes, of course you can link to it externally.

  5. [...] behaves when it runs out of physical memory and starts swapping. Have a look at their results here. Although the measurement was done for RabbitMQ, you should expect similar results for any [...]

  6. B. Factor
    on 09/04/09 at 5:55 pm

    @jay 32-bit systems are history. I personally wouldn’t waste one second of development effort trying to work around a 32-bit address space limitation, especially in an enterprise application like RabbitMQ.

  7. There’s a good argument made to rely upon the OS for such things over on the Varnish project’s wiki:

    http://varnish.projects.linpro.no/wiki/ArchitectNotes

    Sadly it seems (in Linux anyway) that non-LRU style caching algorithms aren’t readily available outside of some experimental patches. See: http://linux-mm.org/AdvancedPageReplacement

    How expensive would it be to touch some of the oldest messages in the queue to ‘hint’ the kernel that that particular bit of memory is dirty and is a bad candidate for swap? It could improve read performance in such a scenario somewhat.

  8. @pauln, I do agree, and it’s certainly a bad idea to reinvent the wheel unless the new one is going to be super round. In this case, it’s a little more complex.

    Firstly, we don’t know how Erlang lays out binary data in memory. We would like to think that if it makes entire pages of binary data (which can contain no pointers) then it shouldn’t have to scan those at all during GC. Further, we would hope that it would make entire pages of binary, and not interleave it with other data. However, none of these assumptions we know for sure. The spikes that you see in the above graphs, at seemingly regular intervals, do really smell of GC runs to me, and unless we’re pretty lucky, it’s likely the GC run will cause swapped out pages to be paged back in. This could well explain the massive divergence we see in the graphs where we really hit swap.

    The other, and even more important issue, is that we don’t believe LRU is the right thing for here. If you think about the way in which a queue of messages grows, it will grow in a fully compacted way at one end, and the reader will punch holes in it, mainly at the other end. So really, if anything, you want freshly written pages to be pushed straight out to disk and then other pages to be slowly read back in as space becomes available and as they get closer to the read-end of the queue. This is absolutely not what any OS swap algorithm can do, and these sorts of observations is what is driving the design of our own disk based queue.

 
 


2 × = twelve

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us