technology from back to front

Archive for January, 2009

Mix and match version control

LShift’s standard version control platform these days is Mercurial, but just before we adopted it, I started a project using Trac and Subversion, mostly because that’s what Trac does out of the box.

Later, we branched the project to add a large new project, and during that branch we converted from using ant to Maven and modularised the project, resulting in a lot of moved files. This made what we were doing on the branch a lot easier, but left us with a merge that subversion wasn’t capable of, even though we had used svn mv to move all the files.

What was capable of the merge was Mercurial. I imported the whole subversion repository using hg convert. See the convert extension documentation. It works exactly as described, but make sure you have 1.0.1 or later – I had problems with earlier versions.

The merge went reasonably well, so I was left with a merged version in a Mercurial repository. I was going to switch to using Mercurial, and its Trac integration, when I discovered that couldn’t cope with multiple repositories. The Trac instance was managing several different source projects, which would have to go into several mercurial repositories, which I couldn’t merge together in any satisfactory way.

There are several projects around to address this (I’ll probably cover them in another post), none of which are ready for production yet. I decided the most expedient thing would be to try and generate a patch for my merge, and apply it to the subversion repository.

A conventional patch would lose the version history of all the moved files, so I decided a git diff would do the job. You can certainly, with some patience, get git-svn to do this, and understand what it was doing. Lacking that patience, I wrote a script to do the job. It parses the git diff and deals with any directory creation needed, calls to svn mv, svn add, and svn rm as required by the diff. It actually turns out to be a bit more work than I was expecting, so I’ve published it here.

by
david
on
21/01/09

My thoughts on real time full-text search

Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.

In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.

Let’s start from the beginning.

How does normal full-text search work?

It’s really simple. When you search for a single keyword, like britney, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:

britney -> [document_1, document_3, document_44, document_555]

When you enter two keywords, like britney spears, you really want to get a list of documents that contains both keywords “britney” and “spears”. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.

"britney" -> [document_1, document_3, document_44, document_555]
 "spears" -> [document_2, document_3, document_5, document_44]

Computed result:

"britney" AND "spears" -> [document_3, document_44]

The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.

So what’s the problem?

Imagine that inverted indexes (I call them `hitlists`) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.

To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.

That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.

The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.

On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations – disk seeks, are really expensive. Consider:

  • reading one hitlist from one index: 1-2 disk seeks – 8-20ms
  • reading one hitlist from ten indexes: 10-20 disk seeks – 80-200ms

As you can see, it’s better to have a hitlist in one chunk and in one index.

As mentioned before, once this chunk is stored on the disk, it should be immutable – updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.

However, Twitter search serves realtime data. It’s very different from what normal search engines can do.

On the other hand Twitter search serves realtime data

That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.

But let’s take a closer look at page loading times for a simple query ‘text search’:

  • 2nd page: 0.44 seconds
  • 20th page: 6.72 seconds
  • 30th page: timeout (proxy error)

Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data – it becomes impossible to read data from disks in reasonable time.

So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.

Is it possible to create true realtime full-text search indexing?

I don’t know in general.

I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:

  • There shouldn’t be an explicit indexing stage. Once you add the data to the search engine, they’re ready to be served.
  • The search engine is always ready to serve queries.
  • We’re talking about pretty big scale. For a small scale (index fits to RAM) Lucene or Zope could do realtime search indexing pretty well.
  • There aren’t any particularly strong latency requirements for queries. It would be good to serve most of the requests in a short time, like 100ms, but it’s nothing bad if someone has to wait a few seconds now and then.
  • Document_ids are monotonic and based on timestamp. That would mean that the newer the document, the bigger the document_id. This simplifies the process of sorting hitlists a lot.
  • We’re talking about a Twitter-like scenario: A lot of tiny documents.
  • Changes to the data should be visible in query results instantly. Let’s say that in the pessimistic case we shouldn’t get longer delays than 5 minutes.

Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).

Based on these requirements, let’s try to design the API for such a realtime full-text search engine:

  • add_to_index(document_id, keywords)
  • del_from_index(document_id)
  • search(query) -> returns list of document_ids

Initial feed of data

As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.

It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.

On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.

Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:

  • Let’s index 12 gigabytes of text: an average social network could have about that much data.
  • Let’s consider Twitter’s data model – a document is at most 140 characters long.
  • On average, an English word is about 5.1 letters long.

Okay. The counting begins.

12 gigabytes / 5.1 letters_per_word = 2.45 * 10^9 words

The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (document_id, keyword).

Let’s assume that adding one tuple document-keyword takes 1ms:

(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days

Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.

But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.

The problem is that 0.1ms per tuple means that we need to create a really fast system.

Disk seeks

Adding one tuple (document_id, keword) costs us a disk seek – we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.

The result is easy to predict – we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.

To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.

Optimizing

But, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:

  • add_to_index(document_id, keywords, synchronization_latency=60seconds)

Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.

On the other hand – how many disk seeks could we save using this trick?

It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.

How much we can win by delaying synchronization?

I prepared an experiment:

  • Take a few gigabytes of text-only English wikipedia dump.
  • Count the number of unique words – this is the number of hitlists produced. In the optimistic case of unlimited memory we would have exactly this number of disk seeks.
  • Count the total number of words. This is a pessimistic number of disk seeks, if we synchronize to disk for every tuple. On the other hand for this model we don’t have to use any memory.
  • Let’s assume that we keep the X most often updated hitlists in memory and are counting the number of disk seeks in this scenario.
  • For every X hitlists cached count the win against the pessimistic case and the loss against the optimistic case.

Here are the results for few a chunks of data from wikipedia:

DATA

500MB-chunk#1
500MB-chunk#2
1GB

optimum – unique words

1036547
971534
1634688

worst – total words

85016295
84660053
171409013
16 hitlists cached
66941279
67396522
134685743
256 hitlists cached
34076915
33975288
68475626
4096 hitlists cached
15648343
14814646
31308282
16384 hitlists cached
8190192
7800053
16150592
65536 hitlists cached
3764562
3565387
7415305
0.5M hitlists cached
1425247
1329566
2639209
1M hitlists cached
1211102
1130571
2107068
16M hitlists cached
1040521
975186
1650374

BOOST – better than worst scenario




16 hitlists
1.27
1.26
1.27
256 hitlists
2.49
2.49
2.50
4096 hitlists
5.43
5.71
5.47
16384 hitlists
10.38
10.85
10.61
65536 hitlists
22.58
23.74
23.12
0.5M hitlists
59.65
63.67
64.95
1M hitlists
70.20
74.88
81.35
16M hitlists
81.71
86.81
103.86

ANTI BOOST – worst than optimistic scenario




worst case
82.01
87.14
104.86
16 hitlists
64.58
69.37
82.39
256 hitlists
32.88
34.97
41.89
4096 hitlists
15.10
15.25
19.15
16384 hitlists
7.90
8.03
9.88
65536 hitlists
3.63
3.67
4.54
0.5M hitlists
1.37
1.37
1.61
1M hitlists
1.17
1.16
1.29
16M hitlists
1.00
1.00
1.01





Results:

  • For caching 16 million hitlists in memory we are very close to optimum (ANTIBOOST = 1). That’s not surprising, since there are not many more unique words in English than 16mln.
  • With 1 million hitlists in memory we are only 30-60% worse than optimum.
  • For 1 million hitlists we are 70x-81x better than the pessimistic scenario.

It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.

That means that we could possibly index 12.5 gigabytes of data in 4 days – using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.

Architecture

Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.

By scalable I mean being able to scale horizontally – you could add a new disk or a new server and the software would just work without any special configuration changes.

Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.

I believe that the architecture of this project should look like this:

It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.

Scalable, persistent key-value storage

Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.

The most important features are:

  • Simple key-value storage is enough for our needs.

  • There must be an “append” operation. We don’t want to retrieve few megabyte hitlist just to add one item to it.

  • As fast as possible. Latency must be kept very low.
  • We don’t need redundancy – speed is more important. In case of a disk failure we can afford to recreate the index from scratch.
  • No eventual-consistency. We need the proper value right when we ask for it.
  • No transactions, we don’t need them. Optimistic-locking is enough.
  • It should cost at most one disk seek to retrieve a record.
  • At most two disk seeks to add data (one to retrieve the record, another to save it)
  • I like memcached binary protocol, so it could be nice to have it as an interface.
  • Scalable – adding a new server doesn’t require stopping the service and needs minimal configuration changes.
  • The client API for this storage needs to be able to retrieve data from many servers in parallel. Remember that every interaction with storage could take a few milliseconds if it requires a disk seek.

There are various open-source projects that can be used as a scalable persistent key-value storage, but no project fully satisfies my needs. Here’s the list of what I found at this point (I’d be happy to update this list if I missed something yesterday Richard Jones wrote better comparison):

Making it even faster

There are some very interesting ideas about how hitlists could be represented and stored.

The basic solution is to store each hitlist as a list of sorted integers. Like:

'britney' -> [1,3,4,44,122,123]

To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.

However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:

'britney' -> [1,2,1,40,78,1]

The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.

A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.

I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.

Conclusion

There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.

There’s a lot of work to be done in this area.

I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.





by
marek
on
20/01/09

CorePy problems and solutions

In an earlier post I mentioned that I was using CorePy for my cryptographic fiddlings.

Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.

Well, I still think CorePy is very cool, but it turned out to introduce some tricky problems of its own, and by taking more care about how I use it I have sped up my program by nearly three orders of magnitude.

At first I thought that it was simply that my assembly was so fast that the Python excution time was dominating. In fact, Python is a little faster than I thought, and it was CorePy itself that was taking the bulk of the time. My inner loop executed in less than 1000 clock cycles, but the per-call overhead was far greater – more like 21,000 cycles. In addition to this per-call overhead, however, I had to reckon with a cost of nearly 1500 cycles for every 32-bit integer I loaded and stored from CorePy’s native arrays. This contrasts with a cost of less than 200 clock cycles to store one in a native Python list, or around 250 cycles for a NumPy array. If you have CorePy installed, you can see for yourself. So I had a very powerful machine, but I could only talk to it through a very narrow pipe.

My workaround can be found in the latest version of trivium-corepy. In addition to the assembly for computing Trivium itself, there’s now assembly specific to performing the cube attack. The code now takes a list of bit indexes to flip. Looping over these indicies it runs Trivium, XORs the output bits into a result area, and then flips the indexed bit in all 128 instances of Trivium inside the buffer before doing it all again. This means that each run lasts much longer, making the 21000 cycle overhead less significant. It’s still better to have more than one run, though, because of the cost of the writes needed to set up this index array – in fact, the sweet spot is reached when the total penalty from the writes to set up the index array is equal to the total penalty from the per-call overhead, because it is easy to double one by halving the other. So the indices are divided into three groups – those that will be handled in Python, those that will be handled by looping in assembly, and those that will be calculated in parallel by our 128 simultaneous instances of Trivium.

It’s with this technique that I’ve been able to find the problematic maxterms in the original Dinur and Shamir paper, and verify that the replacement maxterms I recently received from Dinur all work fine. I’ve even set the code on finding new maxterms, and found some for output bits up to 714, though the techniques I’m using are fairly basic (for example, I don’t yet take advantage of the fact that we get lots of output bits at once for free). Overall I think that I’m still glad of what CorePy gives me, but I think that it could be simpler still to write fast programs if those overheads could be significantly reduced.

by
Paul Crowley
on
18/01/09

Search

Categories

You are currently browsing the LShift Ltd. blog archives for January, 2009.

Feeds

Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us