technology from back to front

The fine art of holding a file descriptor

People tend to like certain software packages to be scalable. This can have a number of different meanings but mostly it means that as you throw more work at the program, it may require some more resources, in terms of memory or CPU, but it nevertheless just keeps on working. Strangely enough, it’s fairly difficult to achieve this with finite resources. With things like memory, the classical hierarchy applies: as you use up more and more faster memory, you start to spill to slower memory — i.e. spilling to disk. The assumption tends to be that one always has enough disk space.

Other resources are even more limited, and are harder to manage. One of these is file descriptors. This is especially difficult to manage in a VM such as Erlang’s where lots of other systems are potentially using file descriptors and you have no control over them. In the released versions of Rabbit, Rabbit’s persister used only one or two file descriptors, and the queues themselves used none. This would be an obvious scalability issue in that it would restrict to a single process access to messages stored on disk, were it not for the fact that released versions of Rabbit hold all messages in memory all the time, thus negating that problem (at the expense of a larger scalability issue). However the upside is that as more queues appear, you don’t need more file descriptors, so almost all file descriptors could be set aside for network sockets. If you need to allow more network connections to Rabbit than your OS provides, then just raise the ulimit (and often the default Erlang process limit too) and away you go.

With the new persister, each queue requires at least two file descriptors each, and can use a number of file descriptors bounded only by the number of messages stored on disk. Whilst this has removed all sorts of bottlenecks, it has also made the probability of running out of file descriptors rather higher. Introducing a central file descriptor allocator would reintroduce a bottleneck that we have sought to avoid, and so we have developed an alternative scheme for managing file descriptors. This scheme can go wrong: it is really only probabilistic that it works. But, given the way in which Rabbit works, it seems to be working very successfully.

Firstly, there is a central process. However, this is never asked for a file descriptor, instead all processes who open files asynchronously tell the central process that they’ve opened a file. They also tell the central process when they close a file descriptor. With both of these messages, processes include another piece of information which is the time at which the least recently used file descriptor that the process has open was used. Thus the processes maintain a mapping from file descriptor to the timestamp at which they were last used. The smallest value in this mapping is the value which is included with these messages to the central process. Whenever a process uses a file descriptor, it takes a new timestamp and updates this mapping. No communication is made to the central process on use of a file descriptor.

The central process detects the ulimit imposed by the OS on the number of file descriptors that can be opened. It imposes an artificial limit, 100 less than the real limit. This gives both buffer space, and allows the rest of the Erlang VM some file descriptors beyond our own control. When we reach this artificial lower limit, the central process does the following calculation: for every process that has some open files, it finds the difference between the current timestamp and the most recently reported least recently used file descriptor timestamp. It sums and averages these ages to give the average time since the least recently used file descriptors were used. It then asynchronously sends messages to all the processes with open files, telling them to close any file descriptor that has not been used for more than this average time.

The first time the processes receive this message, they may very well find that they don’t have any file descriptors that have not been used for this long. This is because the central process is only informed or processes’ least recently used file descriptor timestamps when the process opens or closes a file descriptor. As such, if the process then uses the file descriptor then the central process will immediately have out of date information. Thus all processes, when they receive a request to close files descriptors older than the calculated average, they always inform the central process of the timestamp of their current least recently used file descriptor. Thus at this point, the central process is brought up to date, and if it finds that it’s still at or over the limit of open file descriptors, it recalculates the average age (which will now be less than before) and asks all the processes again to close file descriptors older than the new, smaller average age.

What this means is that processes are never blocked from opening files, even when Rabbit’s over the limit of file descriptors. However, immediately after opening a file, when it goes to receive its next message, a process may find a request from the central process, asking it to close the file it’s just opened. Thus the limit is enforced softly, in a way which does as little as possible to impact performance. This is the reason why we have the lower artificial limit: to try and guard against the possibility of lots of processes opening files at the same time, pushing us over the limit before they or any other process receive the close request from the central process. However it can still go wrong: if a process is hell bent on opening as many files as possible then it can do so, and still hit the hard OS limit, crashing the VM. Cooperation from the processes is obviously vital: for example if a process can never open more than one file descriptor before checking its mailbox again, then you’re very likely to be safe in this scheme.

All of this is implemented in a module called file\_handle\_cache.erl which is available in the new persister branch of RabbitMQ. This module also wraps many of the functions of Erlang’s file module, providing many more optimisations (at the cost of, for example, only ever being able to append to files). These optimisations aim to reduce to an absolute minimum the number of OS calls. So, much better control of write buffers is provided, and seeks which would position the file handle at the same location as it currently is are optimised out. Further calls are provided, e.g. to throw away the write buffer contents without writing them out to disk.

When the request to close a file comes in, the file\_handle\_cache module works out which file descriptors to close. If it finds a file handle to close, it flushes any outstanding writes, closes the file, but keeps track of the last state that it was in. Thus the next time the process decides to use that file descriptor, the module can silently reopen the file and seek to the last location. As a result, when writing to this module, you never need to find out whether or not the central process has asked the module to close files and what, if anything, the result of that request was. The result is a system which dynamically closes old and unused file descriptors but without imposing arduous constraints on the client: the module manages all the state of the file descriptors.

Finally, there are some file descriptors which we have decided, after careful consideration, not to arbitrarily close. These are network sockets to AMQP clients. For these, it is indeed right to have a central process controlling whether further sockets can be created. This is simply implemented as a pair of synchronous calls (acquire and release) to the central process which lowers and raises, respectively, the artificial limit on the number of allowed file descriptors.

by
matthew
on
23/03/10
  1. Hi Matthew, nice post. What’s the rationale behind the appendtowrite() function? You’re already enforcing append-only writes in the Erlang code, so why not use O_APPEND and skip all lseek calls when you’re writing?

  2. Ah, so I see that this is “bug21763″

    http://hg.rabbitmq.com/rabbitmq-server/rev/5c473bfae335

    but the tracker is not public, is it?

  3. Oh weirdness – I did exactly the same thing to deal with a slightly different situation (managing access to a pool of queues from a CGI — making sure that two CGI requests never hit the same queue). This was ages ago, and involved some quickly thrown together perl, but it has been working like a charm for us for years now…

 
 


7 − three =

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us