technology from back to front

My thoughts on real time full-text search

Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.

In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.

Let’s start from the beginning.

How does normal full-text search work?

It’s really simple. When you search for a single keyword, like britney, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:

britney -> [document_1, document_3, document_44, document_555]

When you enter two keywords, like britney spears, you really want to get a list of documents that contains both keywords “britney” and “spears”. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.

"britney" -> [document_1, document_3, document_44, document_555]
 "spears" -> [document_2, document_3, document_5, document_44]

Computed result:

"britney" AND "spears" -> [document_3, document_44]

The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.

So what’s the problem?

Imagine that inverted indexes (I call them `hitlists`) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.

To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.

That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.

The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.

On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations – disk seeks, are really expensive. Consider:

  • reading one hitlist from one index: 1-2 disk seeks – 8-20ms
  • reading one hitlist from ten indexes: 10-20 disk seeks – 80-200ms

As you can see, it’s better to have a hitlist in one chunk and in one index.

As mentioned before, once this chunk is stored on the disk, it should be immutable – updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.

However, Twitter search serves realtime data. It’s very different from what normal search engines can do.

On the other hand Twitter search serves realtime data

That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.

But let’s take a closer look at page loading times for a simple query ‘text search’:

  • 2nd page: 0.44 seconds
  • 20th page: 6.72 seconds
  • 30th page: timeout (proxy error)

Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data – it becomes impossible to read data from disks in reasonable time.

So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.

Is it possible to create true realtime full-text search indexing?

I don’t know in general.

I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:

  • There shouldn’t be an explicit indexing stage. Once you add the data to the search engine, they’re ready to be served.
  • The search engine is always ready to serve queries.
  • We’re talking about pretty big scale. For a small scale (index fits to RAM) Lucene or Zope could do realtime search indexing pretty well.
  • There aren’t any particularly strong latency requirements for queries. It would be good to serve most of the requests in a short time, like 100ms, but it’s nothing bad if someone has to wait a few seconds now and then.
  • Document_ids are monotonic and based on timestamp. That would mean that the newer the document, the bigger the document_id. This simplifies the process of sorting hitlists a lot.
  • We’re talking about a Twitter-like scenario: A lot of tiny documents.
  • Changes to the data should be visible in query results instantly. Let’s say that in the pessimistic case we shouldn’t get longer delays than 5 minutes.

Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).

Based on these requirements, let’s try to design the API for such a realtime full-text search engine:

  • add_to_index(document_id, keywords)
  • del_from_index(document_id)
  • search(query) -> returns list of document_ids

Initial feed of data

As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.

It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.

On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.

Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:

  • Let’s index 12 gigabytes of text: an average social network could have about that much data.
  • Let’s consider Twitter’s data model – a document is at most 140 characters long.
  • On average, an English word is about 5.1 letters long.

Okay. The counting begins.

12 gigabytes / 5.1 letters_per_word = 2.45 * 10^9 words

The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (document_id, keyword).

Let’s assume that adding one tuple document-keyword takes 1ms:

(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days

Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.

But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.

The problem is that 0.1ms per tuple means that we need to create a really fast system.

Disk seeks

Adding one tuple (document_id, keword) costs us a disk seek – we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.

The result is easy to predict – we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.

To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.

Optimizing

But, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:

  • add_to_index(document_id, keywords, synchronization_latency=60seconds)

Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.

On the other hand – how many disk seeks could we save using this trick?

It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.

How much we can win by delaying synchronization?

I prepared an experiment:

  • Take a few gigabytes of text-only English wikipedia dump.
  • Count the number of unique words – this is the number of hitlists produced. In the optimistic case of unlimited memory we would have exactly this number of disk seeks.
  • Count the total number of words. This is a pessimistic number of disk seeks, if we synchronize to disk for every tuple. On the other hand for this model we don’t have to use any memory.
  • Let’s assume that we keep the X most often updated hitlists in memory and are counting the number of disk seeks in this scenario.
  • For every X hitlists cached count the win against the pessimistic case and the loss against the optimistic case.

Here are the results for few a chunks of data from wikipedia:

DATA

500MB-chunk#1
500MB-chunk#2
1GB

optimum – unique words

1036547
971534
1634688

worst – total words

85016295
84660053
171409013
16 hitlists cached
66941279
67396522
134685743
256 hitlists cached
34076915
33975288
68475626
4096 hitlists cached
15648343
14814646
31308282
16384 hitlists cached
8190192
7800053
16150592
65536 hitlists cached
3764562
3565387
7415305
0.5M hitlists cached
1425247
1329566
2639209
1M hitlists cached
1211102
1130571
2107068
16M hitlists cached
1040521
975186
1650374

BOOST – better than worst scenario




16 hitlists
1.27
1.26
1.27
256 hitlists
2.49
2.49
2.50
4096 hitlists
5.43
5.71
5.47
16384 hitlists
10.38
10.85
10.61
65536 hitlists
22.58
23.74
23.12
0.5M hitlists
59.65
63.67
64.95
1M hitlists
70.20
74.88
81.35
16M hitlists
81.71
86.81
103.86

ANTI BOOST – worst than optimistic scenario




worst case
82.01
87.14
104.86
16 hitlists
64.58
69.37
82.39
256 hitlists
32.88
34.97
41.89
4096 hitlists
15.10
15.25
19.15
16384 hitlists
7.90
8.03
9.88
65536 hitlists
3.63
3.67
4.54
0.5M hitlists
1.37
1.37
1.61
1M hitlists
1.17
1.16
1.29
16M hitlists
1.00
1.00
1.01





Results:

  • For caching 16 million hitlists in memory we are very close to optimum (ANTIBOOST = 1). That’s not surprising, since there are not many more unique words in English than 16mln.
  • With 1 million hitlists in memory we are only 30-60% worse than optimum.
  • For 1 million hitlists we are 70x-81x better than the pessimistic scenario.

It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.

That means that we could possibly index 12.5 gigabytes of data in 4 days – using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.

Architecture

Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.

By scalable I mean being able to scale horizontally – you could add a new disk or a new server and the software would just work without any special configuration changes.

Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.

I believe that the architecture of this project should look like this:

It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.

Scalable, persistent key-value storage

Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.

The most important features are:

  • Simple key-value storage is enough for our needs.

  • There must be an “append” operation. We don’t want to retrieve few megabyte hitlist just to add one item to it.

  • As fast as possible. Latency must be kept very low.
  • We don’t need redundancy – speed is more important. In case of a disk failure we can afford to recreate the index from scratch.
  • No eventual-consistency. We need the proper value right when we ask for it.
  • No transactions, we don’t need them. Optimistic-locking is enough.
  • It should cost at most one disk seek to retrieve a record.
  • At most two disk seeks to add data (one to retrieve the record, another to save it)
  • I like memcached binary protocol, so it could be nice to have it as an interface.
  • Scalable – adding a new server doesn’t require stopping the service and needs minimal configuration changes.
  • The client API for this storage needs to be able to retrieve data from many servers in parallel. Remember that every interaction with storage could take a few milliseconds if it requires a disk seek.

There are various open-source projects that can be used as a scalable persistent key-value storage, but no project fully satisfies my needs. Here’s the list of what I found at this point (I’d be happy to update this list if I missed something yesterday Richard Jones wrote better comparison):

Making it even faster

There are some very interesting ideas about how hitlists could be represented and stored.

The basic solution is to store each hitlist as a list of sorted integers. Like:

'britney' -> [1,3,4,44,122,123]

To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.

However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:

'britney' -> [1,2,1,40,78,1]

The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.

A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.

I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.

Conclusion

There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.

There’s a lot of work to be done in this area.

I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.





by
marek
on
20/01/09
  1. Nathan Schrenk
    on 20/01/09 at 8:42 pm

    Google Groups and Google News are two examples of sites that update an index in “real time”. When you send an email to a Google Group it is indexed and available in search results very quickly.

    One technique that can work is to use two different types of indexes: a small in-memory index that holds new data and allows online updates and a large index that is produced offline. The query processor executes each search against both the “new” and the “old” indices in parallel and merges the results. A batch job periodically rebuilds the large index, incorporating data from the small index. When a new large index is brought online the duplicate data can be purged from the small index, freeing memory.

    This example contains 2 indices, but you can extrapolate this to many indices.

  2. My question is: Must be this real-time indexing engine consistent? My answer is does not. Si imagine this scenario. Each new coming message (140B long – approximately 23 words) is send as {keyword, {messageid, timestamp}} to approximately 23 destinations storages one of each possible keywords (~30000). This can be easy distributed and redundant. In each storage messages are stored as {messageid, timestamp} sorted by timestamp. Some amount of last messages are kept in memory. Now querying: For each query one process is spawned, for each keyword in query asks storages and storages sends {messageid, timestamp} ordered form newer to older and query engine do merging (set operation). Messages can be send in chunks. When enough in result stop and go for message storage by messageid. Nobody care if some blending edge new message is not in result. (Requester can’t know if result is bad.) Nobody care if some very old message is not in result. Nothing very bad happen. Easy, scalable, nice. (And nice work for Erlang of course ;-) )

  3. Anonymous
    on 20/01/09 at 11:48 pm

    eBay’s search functionality is real-time, items appear in search results within a few seconds. Check out http://www.softwaresummit.com/2007/speakers/presentations/PritchetteBayCSS2007.pdf starting at slide 32.

  4. The main-delta scheme, that Nathan descriped is very powerful and solves the nearly-realtime-indexing problem good enough. The fabulous open source fulltext search engine Sphinxsearch supports this kind of indexing as described here:
    http://www.sphinxsearch.com/docs/current.html#live-updates


  5. * 2nd page: 0.44 seconds
    * 20th page: 6.72 seconds
    * 30th page: timeout (proxy error)

    I am not sure this basic premise of all this is correct. I just did a search and all of them where less than <0.2 sec

    Also from what I gather from my observation, they use must be using something a memcache for storing keyword -> [doc_ids]
    Thats why they are so fast to pick up the count of new tweets coming into the system. Then when you actually hit refresh it picks up docids from memcache and get those ids from table base.
    From what I gather tehy relay new tweets to some other system where all this search stuff happens. Therefore they must be storing all the tweets twice. They also no. of tweets per keyword to 1000.

  6. I however may be completely and utterly wrong :)

  7. Piyush: I just did a search and all of them where less than <0.2 sec

    Maybe page 99th? For me it returns 502, but probably after few clicks it will became available.

    Stefan: Sphinxsearch supports this kind of indexing

    From docs: “In this case, “live” (almost real time) index updates could be implemented using so called “main+delta” scheme.”

    So every now and then just reindex all the data that are possibly new. That’s a working hack for a small scale, but reindexing data many times is not really good. I was using sphinx a lot and the problem is with managing indexes. This “main+delta” scheme leads to having a lot of small indexes, you need to merge and reconfigure them by hand once they are big enough. Than every few days you need to reindex everything to in case the old data was changed or removed.

    Anonymous: eBay’s search functionality is real-time

    Thanks for the link, it’s very interesting,

    Nathan Schrenk: One technique that can work is to use
    two different types of indexes: a small in-memory index
    that holds new data and allows online updates and
    a large index that is produced offline.

    I really like that idea. I’m still wondering if it’s scalable enough and if there’s off the shelf software that is able to do it.

  8. Nice article. I expect you could improve update and query times dramatically by partitioning the index based on keyword to different machines / disks. If you did this using a consistent / known hash, you’d only hit exactly the right machines for queries and updates.

    Not sure if a generic KVstore would be optimal – maybe scalaris because keys are stored ordered.

    I also entertained a passing thought about trying to use (counting) bloom filters in the solution somewhere, but didn’t get anywhere.

  9. We implement a real time search for Barclay’s BBM customer service using Lucene. We do replication by hand – sending all changes to all the instances of the index. Lucene implements all the techniques you described (although it uses caching, rather than keeping any particular overlay of the index in memory). Older systems like Glimpse (which now seems to be completely extinct) do as well.

    As for your database list: Sleepycat BDB java edition satisfies your requirement, as long as there is enough memory to hold all the non-leaf nodes in the b-tree. Thats true of any database that uses b-trees. Writes take one seek, but often more than one block write – each write replaces all the ancestors of the changed node – its effectively a journal. Because of the journaling property, it can be efficiently replicated: files either get longer, or are completely deleted. Files which should not be present are ignored.

    Sleepycat BDB is now Oracle BDB

    Oracles performance figures concentrate of transactions, which you won’t need – since after a failure you can replicate from another index. If turn off transactions and batch updates, you might be able to cope with 1000 updates per second on stock hardware.

  10. marek,
    I use the Sphinxsearch main+delta scheme on a single server for indexing 6 mln web documents with appr. 10 gb text data. The nightly reindex takes no more than 30 minutes. Thus I don’t get many small indices, just one big main index (rebuilt every night) and a small delta index containing all the new data. And I don’t have to bother with merging and reconfiguring, the process run completely on its own.
    So Sphinxsearch gives me a nearly-live (latency about 5 minutes, but just 1 minute would also be possible) fulltext index on a single server with a single hard disk. I consider this more scalable than having to use 80 disks for the same amount of data – or to mess around with caching and compressing the hitlists.

  11. Stefan: I use the Sphinxsearch main+delta scheme on a single server. The nightly reindex takes no more than 30 minutes.

    It’s great to know that Sphinx works in your scenario.

  12. marek,
    please don’t get me wrong. I think your idea for using Erlang and a persistent key-value storage as the building blocks for fulltext search is very interesting. It’s just bad “marketing” for your idea to concentrate on the real time indexing problem, a problem which can easily be solved using other techniques.
    Using distributed key-value storage tied together by Erlang processes could however solve the scalability issues arriving when your data grows larger and larger until you need dozens of servers. This is the scenario where you really have to deal with a lot of Sphinxsearch indices and manual work.

  13. I also found Fast, a Microsoft Subsidiary. Overview:

    FAST ESP is designed to run in a low-cost, commodity hardware grid-like environmentXML Content Native XML indexing,

    There are some interesting numbers in the pdf

        Highest query rate 2,000 queries/second
        Fastest update rate 500 updates/second
        Highest compression 250M documents/machine
  14. FAST ESP uses something similar to “index + delta”, they call this “partitions”. Basically, you index data into one partition, then when it’s exhausted new partition is created and filled with data. More knowledge unfortunately falls under NDA we have with FAST.

  15. For completeness: [old link: http:00/projects.linpro.no/pipermail/varnish-dev/2009-February/000968.html] search.twitter.com uses varnish

  16. [...] LShift: Thoughts on real time full text search [...]

  17. Very nice article, in the meanwhile Redis 1.0 was released, it has very useful features for solve some points:
    * It’s persistent, so if a node goes down, it could be restarted without any latency due to the “warm up phase”.
    * It implements persistent sort operations for lists, when a new id comes the list could be sorted and to remain in that state. Useful if we want to provide fresh results in a LIFO fashion.

  18. i have one question and didn’t solve although i made many tests.
    What determines fulltext search speed time??database size? or the number of records in database??

 
 


4 − = one

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us