My thoughts on real time full-text search
Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.
In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.
Let’s start from the beginning.
How does normal full-text search work?
It’s really simple. When you search for a single keyword, like ‘britney’, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:
‘britney’ -> [document1, document3, document44, document555]
When you enter two keywords, like ‘britney spears’, you really want to get a list of documents that contains both keywords ‘britney’ and ‘spears’. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.
‘britney’ -> [document1, document3, document44, document555] ‘spears’ -> [document2, document3, document5, document44]Computed result:
‘britney’ AND ‘spears’ -> [document3, document44]
The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.
So what’s the problem?
Imagine that inverted indexes (I call them
hitlists) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.
To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.
That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.
The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations – disk seeks, are really expensive. Consider:
- reading one hitlist from one index: 1-2 disk seeks – 8-20ms
- reading one hitlist from ten indexes: 10-20 disk seeks – 80-200ms
As you can see, it’s better to have a hitlist in one chunk and in one index.
As mentioned before, once this chunk is stored on the disk, it should be immutable – updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.
However, Twitter search serves realtime data. It’s very different from what normal search engines can do.
On the other hand Twitter search serves realtime data
That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.But let’s take a closer look at page loading times for a simple query ‘text search’:
- 2nd page: 0.44 seconds
- 20th page: 6.72 seconds
- 30th page: timeout (proxy error)
Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data – it becomes impossible to read data from disks in reasonable time.
So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.
Is it possible to create true realtime full-text search indexing?
I don’t know in general.I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:
- There shouldn’t be an explicit indexing stage. Once you add the data to the search engine, they’re ready to be served.
- The search engine is always ready to serve queries.
- We’re talking about pretty big scale. For a small scale (index fits to RAM) Lucene or Zope could do realtime search indexing pretty well.
- There aren’t any particularly strong latency requirements for queries. It would be good to serve most of the requests in a short time, like 100ms, but it’s nothing bad if someone has to wait a few seconds now and then.
- Documentids are monotonic and based on timestamp. That would mean that the newer the document, the bigger the documentid. This simplifies the process of sorting hitlists a lot.
- We’re talking about a Twitter-like scenario: A lot of tiny documents.
- Changes to the data should be visible in query results instantly. Let’s say that in the pessimistic case we shouldn’t get longer delays than 5 minutes.
Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).Based on these requirements, let’s try to design the API for such a realtime full-text search engine:
- search(query) -> returns list of documentids
Initial feed of data
As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.
It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.
On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.
Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:
- Let’s index 12 gigabytes of text: an average social network could have about that much data.
- Let’s consider Twitter’s data model – a document is at most 140 characters long.
- On average, an English word is about 5.1 letters long.
Okay. The counting begins.
12 gigabytes / 5.1 lettersperword = 2.45 * 10^9 words
The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (documentid, keyword).
Let’s assume that adding one tuple document-keyword takes 1ms:
(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days
Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.
But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.
The problem is that 0.1ms per tuple means that we need to create a really fast system.
Adding one tuple (document_id, keword) costs us a disk seek – we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.
The result is easy to predict – we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.
To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.
OptimizingBut, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:
- add_to_index(document_id, keywords, synchronization_latency=60seconds)
Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.
On the other hand – how many disk seeks could we save using this trick?
It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.
How much we can win by delaying synchronization?I prepared an experiment:
- Take a few gigabytes of text-only English wikipedia dump.
- Count the number of unique words – this is the number of hitlists produced. In the optimistic case of unlimited memory we would have exactly this number of disk seeks.
- Count the total number of words. This is a pessimistic number of disk seeks, if we synchronize to disk for every tuple. On the other hand for this model we don’t have to use any memory.
- Let’s assume that we keep the X most often updated hitlists in memory and are counting the number of disk seeks in this scenario.
- For every X hitlists cached count the win against the pessimistic case and the loss against the optimistic case.
Here’re the results for few a chunks of data from wikipedia:
<b>optimum - unique words</b>
<b>worst - total words</b>
<b>BOOST - better than worst scenario</b>
<b>ANTI BOOST - worst than optimistic scenario</b>
- For caching 16 million hitlists in memory we are very close to optimum (ANTIBOOST = 1). That’s not surprising, since there are not many more unique words in English than 16mln.
- With 1 million hitlists in memory we are only 30-60% worse than optimum.
- For 1 million hitlists we are 70x-81x better than the pessimistic scenario.
It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.
That means that we could possibly index 12.5 gigabytes of data in 4 days – using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.
Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.
By scalable I mean being able to scale horizontally – you could add a new disk or a new server and the software would just work without any special configuration changes.
Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.
I believe that the architecture of this project should look like this:
It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.
Scalable, persistent key-value storage
Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.The most important features are:
Simple key-value storage is enough for our needs.
There must be an “append” operation. We don’t want to retrieve few megabyte hitlist just to add one item to it.
- As fast as possible. Latency must be kept very low.
- We don’t need redundancy – speed is more important. In case of a disk failure we can afford to recreate the index from scratch.
No eventual-consistency. We need the proper value right when we ask for it.
- No transactions, we don’t need them. Optimistic-locking is enough.
- It should cost at most one disk seek to retrieve a record.
- At most two disk seeks to add data (one to retrieve the record, another to save it)
- I like memcached binary protocol, so it could be nice to have it as an interface.
- Scalable – adding a new server doesn’t require stopping the service and needs minimal configuration changes.
- The client API for this storage needs to be able to retrieve data from many servers in parallel. Remember that every interaction with storage could take a few milliseconds if it requires a disk seek.
<a href="http://memcachedb.org/"><u>MemcacheDB</u></a> (not scalable, berkeleydb)
<a href="http://aws.amazon.com/simpledb/"><u>SimpleDB</u></a> (not open, way too slow, eventual consistency)
<a href="http://code.google.com/intl/en/appengine/docs/python/datastore/"><u>BigTables from GAE</u></a> (not open, way too slow, bulk updates not possible)
<a href="http://project-voldemort.com/"><u>Project Voldemort</u></a> (java, quite new, berkeleydb, replicated, eventual consistency?)
<a href="http://github.com/tuulos/ringo/tree/master"><u>Ringo</u></a> (erlang, experimental, for immutable data, replicated)
- Cassandra project (java, p2p, replicated)
- Dynomite (erlang, no documentation)
- CouchDB (erlang, not scalable)
- HBase (java, experimental, a lot of unneeded features, latency?)
Making it even faster
There are some very interesting ideas about how hitlists could be represented and stored.
The basic solution is to store each hitlist as a list of sorted integers. Like:
‘britney’ -> [1,3,4,44,122,123]
To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.
However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:
‘britney’ -> [1,2,1,40,78,1]
The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.
A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.
I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.
There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.
There’s a lot of work to be done in this area.
I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.