Memory matters – even in Erlang
Some time ago we got an interesting bug report for RabbitMQ. Surprisingly, unlike other complex bugs, this one is easy to describe:
At some point basic.get suddenly starts being very slow – about 9 times slower!
Basic.get doesn’t do anything complex – it just pops a message from a queue. This behaviour was quite unexpected. Our initial tests confirmed that we have a problem when a queue contains thousands of elements:
queuelength: 90001 basicget 3333 times took: 1421.250ms queuelength: 83335 basicget 3333 times took: 1576.664ms queuelength: 60004 basicget 3333 times took: 1403.086ms queuelength: 53338 basicget 3333 times took: 9659.434ms [ look at that! ] queuelength: 50005 basicget 3333 times took: 9885.598ms queuelength: 46672 basicget 3333 times took: 8562.136ms
Let me repeat that. Usually popping a message from a queue takes Xms. At some point, it slows down to 9*Xms.
It turned out that the problem is with the queue:len() function, which is executed during the basic.get. Actually, queue:len() calls only erlang:length() builtin. At some point it switches to the “slow” mode.
Erlang:length() is a builtin that iterates through a linked list and counts it’s length. It’s complexity is O(N), where N is the length of the list. This function is implemented in the VM so it’s expected to be very, very fast.
The problem is not with erlang:length() being slow. It’s about being unpredictably slow. Let’s take a look at Erlang interpreter source code (erlbifguard.c:ertsgclength1). Here’s the main loop for erlang:length():
i=0
while (islist(list)) {
i++;
list = CDR(list_val(list));
}
It does nothing unusual – it just iterates through list elements. However, recompiling Erlang with some debugging information confirms that the problem is indeed here:
clockgettime(CLOCKREALTIME, &t0);
while (islist(list)) {
i++;
list = CDR(listval(list));
}
clockgettime(CLOCKREALTIME, &t1);
tdms = TIMESPECNSECSUBTRACT(t1, t0) / 1000000.0;
if (i > 200000 || tdms > 2.0) {
fprintf(stderr, "gclength1(%p)=%i %.3fms\n\r", reg[live], i, td_ms);
}
gclength1(0x7f4dbfa7fc19)=499999 2.221ms gclength1(0x7f4dbfa7fc19)=499999 2.197ms gclength1(0x7f4dbfa7fc19)=499999 2.208ms (hibernation) gclength1(0x7f4db0572049)=499999 13.793ms gclength1(0x7f4db0572049)=499999 12.806ms gclength1(0x7f4db0572049)=499999 12.531ms
This confirms Matthias’ initial guess – the slowdown starts after Erlang process hibernation.
For those who aren’t Erlang experts: Hibernation is an operation that compacts an Erlang process. It does aggressive garbage collection and reduces the memory footprint of a process to absolute minimum.
The intended result of hibernation is recovering free memory from the process. However its side effect is a new memory layout of objects allocated on the heap.
Ah, how could I have forgotten! The memory is nowadays slow! What happens, is that before hibernation list elements are aligned differently, more dense. Whereas after hibernation they are sparse. It’s easy to test it – let’s count the average distance between pointers to list elements:
gclength1(0x7f5c626fbc19)=499999 2.229ms avg=16.000 dev=0.023 gclength1(0x7f5c626fbc19)=499999 3.349ms avg=16.000 dev=0.023 gclength1(0x7f5c626fbc19)=499999 3.345ms avg=16.000 dev=0.023 (hibernation) gclength1(0x7f5c61f7d049)=499999 13.800ms avg=136.000 dev=0.266 gclength1(0x7f5c61f7d049)=499999 12.726ms avg=136.000 dev=0.266 gclength1(0x7f5c61f7d049)=499999 12.367ms avg=136.000 dev=0.266
- Before hibernation list elements are aligned exactly one after another, values are somewhere else.
- After hibernation list elements are interleaved with values.
This behavior does make sense. In most cases when you traverse the list, you actually do something with the values. After hibernation, when you access list item, the value will be already loaded to the CPU cache.
Knowing the mechanism, it’s easy to write a test case that reproduces the problem.
The average distance between pointers in my case is constant – the standard deviation is negligible. This information has a practical implication – we can “predict” where the next pointer will be. Let’s use that information to “fix” the Erlang VM by prefetching memory!
while (is_list(list)) {
i++;
list2 = CDR(list_val(list));
__builtin_prefetch((char*)list2 + 128*((long)list2-(long)list));
list = list2;
}
length: 300001 avg:0.888792ms dev:0.061587ms length: 300001 avg:0.881030ms dev:0.040961ms length: 300001 avg:0.875158ms dev:0.019436ms hibernate length: 300001 avg:14.861762ms dev:0.150635ms length: 300001 avg:14.833733ms dev:0.017405ms length: 300001 avg:14.884861ms dev:0.220119ms
length: 300001 avg:0.742822ms dev:0.029322ms length: 300001 avg:0.739149ms dev:0.012897ms length: 300001 avg:0.739465ms dev:0.014417ms hibernate length: 300001 avg:7.543693ms dev:0.284355ms length: 300001 avg:7.342802ms dev:0.330158ms length: 300001 avg:7.265960ms dev:0.053176ms
The test runs only a tiny bit faster for the “fast” case (dense conses) and twice as fast for the “slow” case (sparse conses).
Should this patch be merged into mainline Erlang? Not really. I have set the prefetch multiplier value to 128 and I don’t even know if it’s optimal. This was only an experiment. But it was fun to see how low-level system architecture can affect high-level applications.

Will Farr
on 28/02/10 at 9:54 pm
Apparently, the garbage collector of the T Scheme implementation used a depth-first heap traversal scheme by Clark that put the cons cells close together in memory after a GC. You can read about it here: http://www.paulgraham.com/thist.html . (The paragraph describing it begins “T also used a pretty cool GC.”
Thijs (Shenzhen)
on 01/03/10 at 2:07 am
my god, why does the Erlang VM use O(N) counting here? I actually never looked at the code, but I was expecting the VM to keep a “length” field…
Masklinn
on 01/03/10 at 9:54 am
Because it’s a cons cell?
Why would it?
Paul Crowley
on 01/03/10 at 10:39 am
Thanks for the link to the article on T Scheme! At one point it asserts “Norman Ramsey at harvard has cleverly shown that you can implement mark&sweep with exactly the same asymptotic costs as stop©.” Anyone know where I can read more about this?
ryeguy
on 02/03/10 at 7:19 pm
I’m not understanding something here. Hibernate is a BIF to erlang. Does it automatically call it once in awhile or something (maybe during a GC sweep)? How else would it be getting called?
Sylvain Hellegouarch
on 03/03/10 at 8:02 am
I’m wondering, why do you actually need the queue length anyway?
Amadiro
on 03/03/10 at 8:02 am
ryeguy: It doesn’t get called automatically, nope. Either he calls it himself or he specified to his behaviour to hibernate.
Thijs: That’s not really possible, it would break many other more interesting optimisations, like sublist-reusing etc.
marek
on 03/03/10 at 9:01 am
We use gen_server clone which does hibernation automatically after some timeout. See: http://hg.rabbitmq.com/rabbitmq-server/file/default/src/gen_server2.erl
It’s just simpler to count length, rather store it somewhere. But apparently counting length isn’t very good.
matthew
on 03/03/10 at 1:51 pm
@Sylvain. We don’t need it internally at all, it’s simply because various AMQP methods like queue.declare and basic.get return the queue length in the result. Also people tend to like being able to inspect how long the queue has become. But no, operationally, we don’t need it.
That said, we do need it in the new persister branch because controlling sending messages out to disk uses various statistics about the queue, in which the queue length is involved. That’s why in the new persister branch, we carry the queue length explicitly.
@Amadiro. Why couldn’t you hold the count with the reference to the queue? Thus even if you’re doing sublist sharing, you can still have different counters for each pointer into the queue.
Sylvain Hellegouarch
on 03/03/10 at 2:51 pm
Fair enough. I suppose it’s one aspect of functional programing for you :)
Nick
on 22/05/10 at 11:12 am
A sidecount making it O(1): Why not?
alex
on 09/12/10 at 3:53 am
Why do you need length to pop the next message at most you all you need is whether queue is empty or not?
And some people advertise suitability of Erlang for high performance communication systems.