LShift’s standard version control platform these days is Mercurial, but just before we adopted it, I started a project using Trac and Subversion, mostly because that’s what Trac does out of the box.
Later, we branched the project to add a large new project, and during that branch we converted from using ant to Maven and modularised the project, resulting in a lot of moved files. This made what we were doing on the branch a lot easier, but left us with a merge that subversion wasn’t capable of, even though we had used svn mv to move all the files.
What was capable of the merge was Mercurial. I imported the whole subversion repository using hg convert. See the convert extension documentation. It works exactly as described, but make sure you have 1.0.1 or later - I had problems with earlier versions.
The merge went reasonably well, so I was left with a merged version in a Mercurial repository. I was going to switch to using Mercurial, and its Trac integration, when I discovered that couldn’t cope with multiple repositories. The Trac instance was managing several different source projects, which would have to go into several mercurial repositories, which I couldn’t merge together in any satisfactory way.
There are several projects around to address this (I’ll probably cover them in another post), none of which are ready for production yet. I decided the most expedient thing would be to try and generate a patch for my merge, and apply it to the subversion repository.
A conventional patch would lose the version history of all the moved files, so I decided a git diff would do the job. You can certainly, with some patience, get git-svn to do this, and understand what it was doing. Lacking that patience, I wrote a script to do the job. It parses the git diff and deals with any directory creation needed, calls to svn mv, svn add, and svn rm as required by the diff. It actually turns out to be a bit more work than I was expecting, so I’ve published it here.
Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.
In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.
Let’s start from the beginning.
It’s really simple. When you search for a single keyword, like ‘britney’, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:
‘britney’ -> [document_1, document_3, document_44, document_555]
When you enter two keywords, like ‘britney spears’, you really want to get a list of documents that contains both keywords ‘britney’ and ‘spears’. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.
‘britney’ -> [document_1, document_3, document_44, document_555] ‘spears’ -> [document_2, document_3, document_5, document_44]Computed result:
‘britney’ AND ‘spears’ -> [document_3, document_44]
The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.
Imagine that inverted indexes (I call them `hitlists`) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.
To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.
That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.
The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.
On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations - disk seeks, are really expensive. Consider:
As you can see, it’s better to have a hitlist in one chunk and in one index.
As mentioned before, once this chunk is stored on the disk, it should be immutable - updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.
However, Twitter search serves realtime data. It’s very different from what normal search engines can do.
That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.
But let’s take a closer look at page loading times for a simple query ‘text search’:
Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data - it becomes impossible to read data from disks in reasonable time.
So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.
I don’t know in general.
I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:
Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).
Based on these requirements, let’s try to design the API for such a realtime full-text search engine:
As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.
It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.
On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.
Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:
Okay. The counting begins.
12 gigabytes / 5.1 letters_per_word = 2.45 * 10^9 words
The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (document_id, keyword).
Let’s assume that adding one tuple document-keyword takes 1ms:
(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days
Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.
But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.
The problem is that 0.1ms per tuple means that we need to create a really fast system.
Adding one tuple (document_id, keword) costs us a disk seek - we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.
The result is easy to predict - we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.
To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.
But, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:
Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.
On the other hand - how many disk seeks could we save using this trick?
It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.
I prepared an experiment:
Here’re the results for few a chunks of data from wikipedia:
| DATA |
500MB-chunk#1
|
500MB-chunk#2
|
1GB
|
| optimum - unique words |
1036547
|
971534
|
1634688
|
| worst - total words |
85016295
|
84660053
|
171409013
|
|
16 hitlists cached
|
66941279
|
67396522
|
134685743
|
|
256 hitlists cached
|
34076915
|
33975288
|
68475626
|
|
4096 hitlists cached
|
15648343
|
14814646
|
31308282
|
|
16384 hitlists cached
|
8190192
|
7800053
|
16150592
|
|
65536 hitlists cached
|
3764562
|
3565387
|
7415305
|
|
0.5M hitlists cached
|
1425247
|
1329566
|
2639209
|
|
1M hitlists cached
|
1211102
|
1130571
|
2107068
|
|
16M hitlists cached
|
1040521
|
975186
|
1650374
|
| BOOST - better than worst scenario |
|
|
|
|
16 hitlists
|
1.27
|
1.26
|
1.27
|
|
256 hitlists
|
2.49
|
2.49
|
2.50
|
|
4096 hitlists
|
5.43
|
5.71
|
5.47
|
|
16384 hitlists
|
10.38
|
10.85
|
10.61
|
|
65536 hitlists
|
22.58
|
23.74
|
23.12
|
|
0.5M hitlists
|
59.65
|
63.67
|
64.95
|
|
1M hitlists
|
70.20
|
74.88
|
81.35
|
|
16M hitlists
|
81.71
|
86.81
|
103.86
|
| ANTI BOOST - worst than optimistic scenario |
|
|
|
|
worst case
|
82.01
|
87.14
|
104.86
|
|
16 hitlists
|
64.58
|
69.37
|
82.39
|
|
256 hitlists
|
32.88
|
34.97
|
41.89
|
|
4096 hitlists
|
15.10
|
15.25
|
19.15
|
|
16384 hitlists
|
7.90
|
8.03
|
9.88
|
|
65536 hitlists
|
3.63
|
3.67
|
4.54
|
|
0.5M hitlists
|
1.37
|
1.37
|
1.61
|
|
1M hitlists
|
1.17
|
1.16
|
1.29
|
|
16M hitlists
|
1.00
|
1.00
|
1.01
|
Results:
It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.
That means that we could possibly index 12.5 gigabytes of data in 4 days - using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.
Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.
By scalable I mean being able to scale horizontally - you could add a new disk or a new server and the software would just work without any special configuration changes.
Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.
I believe that the architecture of this project should look like this:
It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.
Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.
The most important features are:
There are various open-source projects that can be used as a scalable persistent key-value storage, but no project fully satisfies my needs. Here’s the list of what I found at this point (I’d be happy to update this list if I missed something yesterday Richard Jones wrote better comparison):
There are some very interesting ideas about how hitlists could be represented and stored.
The basic solution is to store each hitlist as a list of sorted integers. Like:
‘britney’ -> [1,3,4,44,122,123]
To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.
However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:
‘britney’ -> [1,2,1,40,78,1]
The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.
A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.
I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.
There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.
There’s a lot of work to be done in this area.
I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.
In an earlier post I mentioned that I was using CorePy for my cryptographic fiddlings.
Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
Well, I still think CorePy is very cool, but it turned out to introduce some tricky problems of its own, and by taking more care about how I use it I have sped up my program by nearly three orders of magnitude.
At first I thought that it was simply that my assembly was so fast that the Python excution time was dominating. In fact, Python is a little faster than I thought, and it was CorePy itself that was taking the bulk of the time. My inner loop executed in less than 1000 clock cycles, but the per-call overhead was far greater - more like 21,000 cycles. In addition to this per-call overhead, however, I had to reckon with a cost of nearly 1500 cycles for every 32-bit integer I loaded and stored from CorePy’s native arrays. This contrasts with a cost of less than 200 clock cycles to store one in a native Python list, or around 250 cycles for a NumPy array. If you have CorePy installed, you can see for yourself. So I had a very powerful machine, but I could only talk to it through a very narrow pipe.
My workaround can be found in the latest version of trivium-corepy. In addition to the assembly for computing Trivium itself, there’s now assembly specific to performing the cube attack. The code now takes a list of bit indexes to flip. Looping over these indicies it runs Trivium, XORs the output bits into a result area, and then flips the indexed bit in all 128 instances of Trivium inside the buffer before doing it all again. This means that each run lasts much longer, making the 21000 cycle overhead less significant. It’s still better to have more than one run, though, because of the cost of the writes needed to set up this index array - in fact, the sweet spot is reached when the total penalty from the writes to set up the index array is equal to the total penalty from the per-call overhead, because it is easy to double one by halving the other. So the indices are divided into three groups - those that will be handled in Python, those that will be handled by looping in assembly, and those that will be calculated in parallel by our 128 simultaneous instances of Trivium.
It’s with this technique that I’ve been able to find the problematic maxterms in the original Dinur and Shamir paper, and verify that the replacement maxterms I recently received from Dinur all work fine. I’ve even set the code on finding new maxterms, and found some for output bits up to 714, though the techniques I’m using are fairly basic (for example, I don’t yet take advantage of the fact that we get lots of output bits at once for free). Overall I think that I’m still glad of what CorePy gives me, but I think that it could be simpler still to write fast programs if those overheads could be significantly reduced.
You are currently browsing the LShift Ltd. blog archives for January, 2009.