Asynchronous libraries performance

By: on December 15, 2008

Recently I found some pretty libevent benchmarks. For me they show terrifying results. The blood freezing fact is that the more connections you have, the bigger is the cost of adding new connections to asynchronous loop.

It means that if you have 1 connection registered to asynchronous loop, the cost of registering callback would be small; while when you have million of idle connections the cost of registering is big.

It would be horrible if a similar situation would appear on the kernel level. It won’t be as bad as in this benchmark, but still, as a programmer I’d expect that creating and destroying connections is always constant time.

How asynchronous libraries work?

The single chunk of data passed to the async lib is a tuple of three items:

  event =  (socket_descriptor, callback_function, timeout)

This structure is often referred as an event. When programmer registers such a structure to the asynchronous loop, he expects that his callback will be executed when something happens on a descriptor, or when a timeout occurs. That’s what’s asynchronous programming is all about.

Asynchronous library offers a way of adding and removing such events to and from its main data structure – a priority queue. This queue is sorted by timeouts, so async library always knows which events would be timed out first.

Asynchronous loop is waiting for something happening on descriptors or for timeouts.
When something happens, it executes necessary callbacks. See this pseudocode:

the_main_loop:
   nearest_event = find_event_with_smallest_timeout()
   events_triggered = select(<fds...>, nearest_event.timeout)
   if events_triggered: # has anything happened on the wire?
       for event in events_triggered:
           remove_from_datastructure(event)
           execute_callback(event, timeouted=False)
   else: # nothing happened, event timeouted
       remove_from_datastructure(nearest_event)
       execute_callback(nearest_event, timeouted=True)

In 95% of cases the body of programmers callback function, looks like this:

 def my_callback(fd):
    <read response from descriptor>
    <write new request to descriptor>
   add_to_event_loop(fd, my_callback)
   return

When a timeout occurs, in most cases the callback just cleans up the memory. But usually timeouts don’t happen often.

This pseudocode means that:

  • when anything happens on a socket, event structure is always removed from priority queue
  • if socket becomes readable there’s a high probability that next event would be registered for that socket

Putting this conclusions back into the pseudocode:

the_main_loop:
   # less often
   nearest_event = find_event_with_smallest_timeout()
   events_triggered = select(<fds...>, nearest_event.timeout)
   if events_triggered:
       for event in events_triggered:
           # very often, we have data on the socket!
           remove_from_datastructure(event)
           execute_callback(event, timeouted=False)
   else: 
       # timeout happens rather rarely
       remove_from_datastructure(nearest_event)
       execute_callback(nearest_event, timeouted=True)

The problem

The problem is that currently asynchronous libraries often use binary heap as a representation of internal priority queue. It means that the cost of registering and removing item (event) from this structure is logarithmic to the number of already registered events.

binary heap cost frequency in my use case
add item O(log(N)) very often
remove item O(log(N)) very often
event loop iteration O(1) less often

We can clearly see that the data structure was optimized to have smallest possible time per every event loop iteration. While I think that it should be optimized to reduce the cost of adding and removing events.

I can put the same in different words. In the real world, in most cases something happens on a socket before a timeout occurs. Connections very rarely expire because of a timeout.

Considering this, it’s reasonable to use a data structure that would have O(1) as a time of adding and removing items, and it’s not so bad to have a bit higher cost inside the loop.

As you can imagine others faced that problem before me. The data structure that could be used to manage timers with better average cost is called Timer Wheels. It is used inside Linux kernel to manage operating system timers. It’s described in this paper from 1996.

The timer wheels algorithm assumes that events occur every fixed period of time. For the kernel it was obvious that timeouts were checked every kernel tick, usually 10ms. On the other hand waking up asynchronous library every 10ms would be bad.

I believe it’s possible to avoid waking up every tick. To achieve this, when we seek for a smallest timeout we should look through all the buckets and try to find the first one that’s not empty. The cost of that is constant and depends on the number of buckets.

I suggest to use buckets of 1ms. If you need granularity better than this, no asynchronous library can help you anyway.

Other issue is how to handle events very far in the future, that don’t fit into time wheel buckets. The reasonable compromise is to use a binary tree (or heap). In such case, the cost of manipulating that timers would be logarithmic.

timer wheels limited time scale unlimited time scale
timeouts fit in timer wheels (optimistic) timeouts far in the future
(pessimistic)
add item O(1) O(1) O(log(N))
remove item O(1) O(1) O(log(N))
event loop iteration O(K) O(K) O(K)

(K is the number of buckets in all time wheels. This is a constant factor.)

This diagram shows the structure I’m suggesting.

Data structure diagram

There are few disadvantages of this approach:

  • The library must be woken up regularly to check if there aren’t any new events on the long-term heap. In case of my diagram it would be every 33.8 sec, but of course you can add another level timer wheel to expand this time.
  • If there’s a single event third timer wheel, the library would need to be woken up three times to dispatch it.
  • The granularity is fixed to 1ms.

The method I’m suggesting is quite complicated but in my opinion having a constant O(1) cost of managing events is worth it. Unfortunately, for events far in the future the cost would always need to be logarithmic, in the end we must sort the timers somewhere.

You can ask who needs such improvements, not everybody has thousands connections. I agree that that most users doesn’t care. But there are some companies that could save money on this improvement, like Facebook which apparently uses libevent a lot

FacebookTwitterGoogle+

14 Comments

  1. Bogdan says:

    “It would be horrible if a similar situation would appear on the kernel level.”

    Well …

    http://patraulea.com/binds.png

  2. marek says:

    Bogdan: http://patraulea.com/binds.png

    That looks scary. Are you sure the units are correct? 300 seconds at 800k sockets?

  3. Yusuf says:

    Have you talked to Neil Provos about this ?

    Also, have you looked at libev

    http://libev.schmorp.de/

  4. marek says:

    Yusuf: Have you talked to Neil Provos about this ?

    I’d be happy to hear his opinion.

    Yusuf: Also, have you looked at libev

    Yes, I even refer to libev benchmark in the first sentence. Both libraries use binary heap according to this benchmark.

  5. Paul Crowley says:

    There’s clearly something I’m missing here. From what you describe, socket listeners are added and removed very frequently, but timeout events are much rarer. Since only timeout events go into a heap, why is it worth optimising the heap? Surely it’s the data structure that manages socket callbacks that needs to be made faster?

  6. marek says:

    Paul Crowley: Since only timeout events go into a heap…

    Yes, I made few assumptions that can be unclear.
    * I personally prefer the way of programming when I always register a socket with a timeout.
    * Sure, in some cases you need to only register a callback, without any timeout.
    * On the other hand you can optimize timeouts. You can have one timeout registered, that would fire events on a number of sockets.

    From an asynchronous libraries user point of view, I’d like to move all the complexity to the async lib. So a programmer could forget about the low level problems.

    For example the third point (one timeout manages many sockets) is a clear case when the complexity is moved to user field.

    I want to say that it can be possible to create a scalable async library, that would deal with the complexity. And work with reasonable cost.

    In the end we could say, that if you need a timeout event, you shall use alert(). Isn’t it a working solution? Than, async lib won’t have to deal with the very complicated timeouts. But what’s the point of using async lib in such case?

    IMO the point of using async lib is to move the complexity there. So you can forget about epoll/kqueue/select and timeout quirks.

  7. Bogdan says:

    @marek: The units seem right. But the first 50K sockets get opened almost instantly. I don’t think opening 1M sockets all at once is a realistic scenario.

    The source for the test is at http://patraulea.com/temp/manybinds/ .

  8. Beef says:

    The algorithm as given isn’t O(1) everyhwere if you look closely. Finding the next timeout is around O(log n) and inserting is only O(1) if you disregard the management overhead for the buckst, after which you will end up at something around O(log n) again.

    There is a much simpler solution: change your timeouts less often, or use a simple queue if all your timeouts have the same timeout.

    the libev documentation explains various strategies and their advantages/disadvantages.

    (Besides, there are other heaps besides binary heaps that achieve true amortizsed O(1) on those operations, the given algorithm is just a crude approximation to that).

    (see http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod)

  9. beef says:

    Note also that the benchmark you refer to compares libev and libevent, using the same benchmark.

    libev offers more efficient means of updating timers which do not depend on the absolute number of watchers, but libevent doesn’t, which would skew the benchmark, which was designed to compare libevent and libev using the libevent API.

    (but still, the solution to this timeout problem is using the correct algorithm, as explained in the libev docs)

  10. beef says:

    For soem reason, my first comment didn’t show up, but my second one did. The summary is:

    the given algorithm is NOT O(1), it has hidden costs. In general, the problem simply isn’t solvable in O(1), and in fatc, the amortised costs are around O(log n) for all operations.

    Libev also documents various strategies that depend on the problem, which allow you to achieve true O(1) with libev when your problem class allows that, and much better than the naive algorithm of this article in the case of varying inactivity timeouts. For the general case, there are also better (but more complicated) data structures than binary heaps available, that are much more effciient than the given algorithm (research various forms of heaps for example).

    In summary, the problem of this article can be solved very simply, without throwing a very complicated hybrid algorithm, in true O(1).

    See the http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod#Besmartabout_timeouts

  11. marek says:

    the given algorithm is NOT O(1), it has hidden costs.

    I can’t agree on this. I believe that you can manage
    all the timers that fit into timer wheels in constant time.

    From the paper about timer wheels: “This paper shows that by using a circular buffer or timing wheel, it takes O(1) time to start, stop, and maintain timers within the range of the wheel.”

    In general, the problem simply isn’t solvable in O(1)

    That’s correct in general. But I made few assumptions:
    – majority of timers fit into certain time range (like 33.8 seconds in my example)
    – the clock resolution is not better than 1ms

  12. 0xABADC0DA says:

    My problem with libevent is this… for instance, you want to send at 128k/s. To avoid doing writes of a couple bytes you have a minimum of 4k per write (except the last write). Initially, you wait for 31ms for 4k to become ‘available’. The callback happens at T+35ms so you try to write 4588 bytes, but say the socket can only send 1500 bytes immediately. So now you are 3,088 bytes ‘behind’ and only have to wait for 11ms.

    IIRC the only way to wait for a condition NOT happening is by constantly adding/removing the callback for the fd and adding/removing a timer for the send rate. Which effectively means that the library assumes you will send at maximum rate for every descriptor.

  13. beef says:

    @0x…

    yes that is true. you could try libev, which lazily updates the kernel state, so starting or stopping an fd watcher might not actually cause kernel calls or kernel overhead.

    in general, some overhead cannot be avoided, but libev was written with rate-limiting applications in mind and can work around this pattern of starting/stoping fd watchers regularly.

    however, you can, as usual, be more efficient by writing at a fixed rate and querying the kernel about the actual speed of the connection (or actually, the congestion window), and that way, keep latency down and the kernel buffer empty, to get very smooth rates.

Post a comment

Your email address will not be published.

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>