Yet another Key-Value database

By: on August 21, 2009
Following the NoSQL movement, I became a fan of key-value databases. Usually there’s nothing interesting to say as they work fine out-of-the-box. But in a project I was recently working on K-V store started to be a major bottleneck.

I must note that my use case is pretty specific. In a K-V database I store about ten million records. The keys are pretty small, about 12 bytes on average. The probability of hitting every single key is more or less the same, which means that adding a cache layer will not make things go faster. The values are quite small, up to few kilobytes. The access pattern is also not common: I only do record updates. This requires reading a value and writing a new one few bytes larger. So, unlike normal use cases, the number of writes is substantial.
I experimented with Berkeley DB and Tokyo Cabinet, but I noticed that both engines do similar things internally. Let’s take a look at what happens there.
Berkeley DB
In my experiments K-V databases were I/O bound. To see what’s happening inside I used strace. In this strace log it’s clearly visible that BDB is doing random reads and writes all over the file (the last parameter to pread/pwrite syscalls is the file offset):
pread64 (3, ..., 4096, 32632832)
pwrite64(3, ..., 4096, 35389440)
pread64 (3, ..., 4096, 25407488)
pwrite64(3, ..., 4096, 43790336)
pread64 (3, ..., 4096, 23085056)
pwrite64(3, ..., 4096, 43831296)
pread64 (3, ..., 4096, 28508160)
pwrite64(3, ..., 4096, 10280960)
Doing such a random writes is not a good idea; the system quickly gets stuck on synchronizing dirty pages to disk (this is the output of strace -c):
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 87.58    0.013548           1      9228           pwrite64
  0.98    0.000152           0      8475           pread64
Strace reveals similar pattern in TokyoCabinet:
pread64 (3, ..., 48,   108776352)
pwrite64(3, ..., 5,    108772688)
pwrite64(3, ..., 4224, 131860800)
pread64 (3, ..., 48,   81600048)
pwrite64(3, ..., 5,    81600048)
pwrite64(3, ..., 432,  126273472)
pread64 (3, ..., 48,   96069904)
pwrite64(3, ..., 5,    96066128)
pwrite64(3, ..., 4256, 133663632)
 % time     seconds  usecs/call     calls    errors syscall
 ------ ----------- ----------- --------- --------- ----------------
  93.76    0.010568           1      9296           pwrite64
   1.90    0.000214           0     22891           pread64
TokyoCabinet can mmap(2) the whole file and do no read/write syscalls at all. That gives a boost, but dirty pages will eventually need to be written to the disk. I suspect it is done in a fairly unpredictable order.
YDB, Why not?
Both databases try to reuse disk space from deleted records. They overwrite places that are no longer used and we see writes with some random offsets. In my opinion it’s strange: I have a few-hundred-gigs hard drive and I don’t really care how much disk space is wasted. But I really do care about every disk-seek. What if in a database we’d only appended data and never modified already written records? This way we could save half of disk seeks on random writes and do only a sequential append once in a while. Writes will on average cost nothing, but we’d pay the cost of disk space as it (almost) never is reclaimed. That’s the idea behind my experimental, yet another, K-V database: YDB.
Like previous databases, let’s show YDB in action:
pread64(4, ..., 68, 31661472)
write  (5, ..., 104)
write  (5, ..., 96)
pread64(4, ..., 60, 19781776)
write  (5, ..., 100)
pread64(4, ..., 60, 52593732)
write  (5, ..., 104)
As we’re doing writes on an append-only file descriptor, we don’t need to set the write offset and we can avoid using pwrite(2). On the other hand, YDB doesn’t do any caching, so it’s flooding kernel with a lot of tiny write syscalls, way more than it should. But with an append-only file descriptor the writes are really cheap. As opposed to TC/BDB they aren’t the most expensive syscall:
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
31.60    0.000963           0     24520           write
2.23    0.000068           0      4944           pread64
Features and anti-features
The design I chose has some advantages and disadvantages:
  • YDB is append-only key-value database (aka: log-like or journaled).
  • The index is in RAM, so retrieving any item costs at most 1 disk seek.
  • The index is in RAM, so all key values must fit into RAM. This means that on 32 bit systems the index can’t be larger than about 1.5 gigabytes.
  • YDB is not a general-purpose database, it will fall behind other K-V stores in most of the benchmarks. Especially in disk usage, starting time and memory footprint.
  • YDB is designed for spinning disks, but it still can be faster than competition on cheap SSD drives. This is due to the fact that random writes are very expensive on broken SSD’s, while sequential writes are extremely fast.
  • There is a simple Garbage Collector that periodically crawls the logs. It appends the data from old logs to recent one and removes old, no longer used logs.
  • When GC is running you should expect YDB to run about two times slower for a while.
  • Even though we have GC, YDB shouldn’t run in an environment with limited disk space. Assume at least 5-8 times more disk space than you need.
  • This means that you store only about 100GB of usable data on 800GB hard drive. This is how it works.
  • YDB will be considerably faster only if you have a lot of relatively small keys and do a lot of writes. Otherwise TC/BDB can be good enough.
Works for me
Let’s try to emulate my use case. Here are some technical details about a test:
  • I start with an empty database.
  • I do a lot of small updates and end up in having 2MLN of items.
  • Total size of keys and metadata is about 100MB.
  • Average value is 142 bytes long.
Output database sizes:
  • TC 260MiB
  • BDB 320MiB
  • YDB 492MiB
I did the tests on my quad core Linux machine with 4 gigs of RAM and some generic spinning disk inside.  BDB was configured to use 96MiB of cache, TC had some pretty high caching options (cache value of 12MLN, xmsiz value of 750MB). These settings were taken from these benchmarks.
Finally, my benchmark took (lower is better):
  • TC: 610 seconds
  • BDB: 1043 seconds
  • YDB:  271 seconds
Second benchmark
YDB is terrible in most of common benchmarks. It wastes a lot of disk space, it starts slowly. On the other hand, when you have a lot of random writes YDB behaves well. I included simple benchmark software in YDB sources. Feel free to check it yourself!
 $ cd ydb/benchmarks
 $ make
 Running benchmark: more_than_ram2
    tc: 569204ms
   bdb: 816312ms
   ydb:  99022ms
This benchmark, simply sets 16K keys to some random values which vary in size between 4KiB to 20KiB. This is repeated 16 times. In this particular benchmark, after flushing disk caches, YDB was about 5 times faster than TC and 8 times faster than BDB.
Some reasons you might want to try YDB:
  • You are doing a lot of writes to K-V store.
  • Your database size is substantial.
  • You’re running out of disk seeks.
So, if you have too much free disk space and your K-V database is I/O stuck, give YDB a try! Why not?