Some time ago we got an interesting bug report for RabbitMQ. Surprisingly, unlike other complex bugs, this one is easy to describe:
At some point basic.get suddenly starts being very slow - about 9 times slower!
Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.
In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.
Let’s start from the beginning.
It’s really simple. When you search for a single keyword, like ‘britney’, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:
‘britney’ -> [document_1, document_3, document_44, document_555]
When you enter two keywords, like ‘britney spears’, you really want to get a list of documents that contains both keywords ‘britney’ and ‘spears’. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.
‘britney’ -> [document_1, document_3, document_44, document_555] ‘spears’ -> [document_2, document_3, document_5, document_44]Computed result:
‘britney’ AND ‘spears’ -> [document_3, document_44]
The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.
Imagine that inverted indexes (I call them `hitlists`) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.
To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.
That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.
The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.
On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations - disk seeks, are really expensive. Consider:
As you can see, it’s better to have a hitlist in one chunk and in one index.
As mentioned before, once this chunk is stored on the disk, it should be immutable - updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.
However, Twitter search serves realtime data. It’s very different from what normal search engines can do.
That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.
But let’s take a closer look at page loading times for a simple query ‘text search’:
Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data - it becomes impossible to read data from disks in reasonable time.
So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.
I don’t know in general.
I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:
Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).
Based on these requirements, let’s try to design the API for such a realtime full-text search engine:
As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.
It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.
On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.
Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:
Okay. The counting begins.
12 gigabytes / 5.1 letters_per_word = 2.45 * 10^9 words
The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (document_id, keyword).
Let’s assume that adding one tuple document-keyword takes 1ms:
(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days
Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.
But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.
The problem is that 0.1ms per tuple means that we need to create a really fast system.
Adding one tuple (document_id, keword) costs us a disk seek - we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.
The result is easy to predict - we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.
To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.
But, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:
Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.
On the other hand - how many disk seeks could we save using this trick?
It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.
I prepared an experiment:
Here’re the results for few a chunks of data from wikipedia:
| DATA |
500MB-chunk#1
|
500MB-chunk#2
|
1GB
|
| optimum - unique words |
1036547
|
971534
|
1634688
|
| worst - total words |
85016295
|
84660053
|
171409013
|
|
16 hitlists cached
|
66941279
|
67396522
|
134685743
|
|
256 hitlists cached
|
34076915
|
33975288
|
68475626
|
|
4096 hitlists cached
|
15648343
|
14814646
|
31308282
|
|
16384 hitlists cached
|
8190192
|
7800053
|
16150592
|
|
65536 hitlists cached
|
3764562
|
3565387
|
7415305
|
|
0.5M hitlists cached
|
1425247
|
1329566
|
2639209
|
|
1M hitlists cached
|
1211102
|
1130571
|
2107068
|
|
16M hitlists cached
|
1040521
|
975186
|
1650374
|
| BOOST - better than worst scenario |
|
|
|
|
16 hitlists
|
1.27
|
1.26
|
1.27
|
|
256 hitlists
|
2.49
|
2.49
|
2.50
|
|
4096 hitlists
|
5.43
|
5.71
|
5.47
|
|
16384 hitlists
|
10.38
|
10.85
|
10.61
|
|
65536 hitlists
|
22.58
|
23.74
|
23.12
|
|
0.5M hitlists
|
59.65
|
63.67
|
64.95
|
|
1M hitlists
|
70.20
|
74.88
|
81.35
|
|
16M hitlists
|
81.71
|
86.81
|
103.86
|
| ANTI BOOST - worst than optimistic scenario |
|
|
|
|
worst case
|
82.01
|
87.14
|
104.86
|
|
16 hitlists
|
64.58
|
69.37
|
82.39
|
|
256 hitlists
|
32.88
|
34.97
|
41.89
|
|
4096 hitlists
|
15.10
|
15.25
|
19.15
|
|
16384 hitlists
|
7.90
|
8.03
|
9.88
|
|
65536 hitlists
|
3.63
|
3.67
|
4.54
|
|
0.5M hitlists
|
1.37
|
1.37
|
1.61
|
|
1M hitlists
|
1.17
|
1.16
|
1.29
|
|
16M hitlists
|
1.00
|
1.00
|
1.01
|
Results:
It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.
That means that we could possibly index 12.5 gigabytes of data in 4 days - using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.
Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.
By scalable I mean being able to scale horizontally - you could add a new disk or a new server and the software would just work without any special configuration changes.
Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.
I believe that the architecture of this project should look like this:
It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.
Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.
The most important features are:
There are various open-source projects that can be used as a scalable persistent key-value storage, but no project fully satisfies my needs. Here’s the list of what I found at this point (I’d be happy to update this list if I missed something yesterday Richard Jones wrote better comparison):
There are some very interesting ideas about how hitlists could be represented and stored.
The basic solution is to store each hitlist as a list of sorted integers. Like:
‘britney’ -> [1,3,4,44,122,123]
To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.
However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:
‘britney’ -> [1,2,1,40,78,1]
The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.
A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.
I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.
There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.
There’s a lot of work to be done in this area.
I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.
Speaking of OMeta/JS and OMeta in general, I’ve implemented an OMeta for Scheme. Currently it runs in MzScheme, but it should be fairly portable, with dependencies only on a handful of commonly-implemented SRFIs. I intend to properly libraryise it — making it into a proper MzScheme module — and to port it to other schemes, most probably starting with SISC.
One interesting feature of this OMeta is that it implements the error-handling mechanisms suggested by Bryan Ford that I implemented previously in a packrat parsing library for Scheme. The packrat-based error-handling techniques seem to generalise fairly well to an OMeta setting.
hg clone http://www.eighty-twenty.org/hgwebdir.cgi/ometa-scheme/.I’m already using it as part of an experimental compiler: the reader, the parser.
Update: Switched from hosting the ometa-scheme code in Darcs to Mercurial.
After my talk on Javascript DVCS at the Osmosoft Open Source Show’n'tell, I went to visit Osmosoft, the developers of TiddlyWiki, to talk about giving TiddlyWiki some DVCS-like abilities. Martin Budden and I sat down and built a couple of prototypes: one where each tiddler is versioned every time it is edited, and one where versions are snapshots of the entire wiki, and are created each time the whole wiki is saved to disk.
| Regular DVCS | SynchroTiddly |
|---|---|
| Repository | The html file contains everything |
| File within repository | Tiddler within wiki |
| Commit a revision | Save the wiki to disk |
| Save a text file | Edit a tiddler |
| Push/pull synchronisation | Import from other file |
If you have Firefox (it doesn’t work with other browsers yet!) you can experiment with an alpha-quality DVCS-enabled TiddlyWiki here. Take a look at the “Versions” tab, in the control panel at the right-hand-side of the page. You’ll have to download it to your local hard disk if you want to save any changes.
It’s still a prototype, a work-in-progress: the user interface for version management is clunky, it’s not cross-browser, there are issues with shadow tiddlers, and I’d like to experiment with a slightly different factoring of the repository format, but it’s good enough to get a feel for the kinds of things you might try with a DVCS-enabled TiddlyWiki.
Despite its prototypical status, it can synchronize content between different instances of itself. For example, you can have a copy of a SynchroTiddly on your laptop, email it to someone else or share it via HTTP, and import and merge their changes when they make their modified copy visible via an HTTP server or email it back to you.
I’ve been documenting it in the wiki itself — if anyone tries it out, please feel free to contribute more documentation; you could even make your altered wiki instance available via public HTTP so I can import and merge your changes back in.
Yesterday I visited Osmosoft, the developers of TiddlyWiki, to chat about getting some DVCS-like functionality into TiddlyWiki. Jeremy mentioned in passing that TiddlyWiki is, if you squint, a slightly cheating kind of a Quine. It struck me this morning that TiddlyWiki has strong similarities to another famous almost-Quine, the Smalltalk system.
TiddlyWiki is a composition of
The HTML/Javascript container of the tiddlers (along with the web browser it runs on!) is like the Smalltalk virtual-machine. The tiddlers themselves are the live objects in the system. The process of running the embedded plugin scripts is the same as running Smalltalk’s post-image-load startup hooks. The plugins-in-tiddlers is the same as the source-in-the-image. The process of saving the TiddlyWiki instance to disk is the same as a Smalltalk instance extracting its live object graph and serializing it to disk.
The main difference I can see is that Smalltalk doesn’t carry so much of its VM around with its images: like Smalltalk’s reliance on an external VM being present, TiddlyWiki can rely on the browser being present for a big chunk of its VM, but has to carry the bootstrapping container code bundled with the stored tiddlers/live-objects. Also, TiddlyWiki instances are (these days!) constructed by a special assembly process from small, separate text files checked into Subversion, whereas Smalltalk images were not historically constructed from scratch very often at all. Finally, TiddlyWiki’s boot process is heavier than Smalltalk’s, because it’s forced by the browser to recompile all the sources in the system, where Smalltalk gets away with having bytecode in the image alongside the sources.
Upon browsing the source to the excellent MochiWeb, I came across a call to a function that, when I looked, wasn’t defined anywhere. This, it turns out, was a clue: Erlang has undocumented syntactic support for late-bound method dispatch, i.e. lightweight object-oriented programming!
The following example, myclass.erl, is a parameterized module, a feature that arrived undocumented in a recent Erlang release. Parameterized modules are explored on the ‘net here and here. (The latter link is to a presentation that also covers an even more experimental module-based inheritance mechanism.)
-module(myclass, [Instvar1, Instvar2]). -export([getInstvar1/0, getInstvar2/0]). getInstvar1() -> Instvar1. getInstvar2() -> Instvar2.
“Instances” of the “class” called myclass can be created with myclass:new(A, B) (which is automatically provided by the compiler, and does not appear in the source code), where A and B become values for the variables Instvar1 and Instvar2, which are implicitly scoped across the entirety of the myclass module body, available to all functions defined within it.
The result of a call to a new method is a simple tuple, much like a record, with the module name in the first position, and the instance variable values in order following it.
Eshell V5.6 (abort with ^G)
1> Handle = myclass:new(123, 234).
{myclass,123,234}
2> Handle:getInstvar1().
123
3> Handle:getInstvar2().
234
While this looks really similar to OO dispatch in other languages, it’s actually an extension to Erlang’s regular function call syntax, and works with other variations on that syntax, too:
4> {myclass,123,234}:getInstvar1().
123
The objects that this system provides are pure-functional objects, which is unusual: many object-oriented languages don’t clearly separate the two orthogonal features of late-binding and mutable state. A well-designed language should let you use one without the other, just as Erlang does here: in Erlang, using parameterized modules for method dispatch doesn’t change the way the usual mechanisms for managing mutable state are used. “Instance variables” of parameterized modules are always immutable, and regular state-threading has to be used to get the effects of mutable state.
I’d like to see this feature promoted to first-class, documented, supported status, and I’d also very much like to see it used to structure the standard library. Unfortunately, it’s not yet very well integrated with existing modules like gb_sets, ordsets and sets. For example, here’s what happens when you try it with a simple lists call:
5> lists:append([1, 2], [3, 4]).
[1,2,3,4]
6> {lists, [1, 2]}:append([3, 4]).
[3,4|{lists,[1,2]}]
Not exactly what we were after. (Although it does give brittle insight into the current internals of the rewrites the system performs: a {foo, ...}:bar(zot) call is translated into foo:bar(zot, {foo, ...}) - that is, the this parameter is placed last in the argument lists.)
A couple of weeks ago, Richard W. M. Jones released JONESFORTH, which I thought was pretty exciting. Today I spent a few hours porting the assembly-language part to PowerPC on Mac OS X 10.3.9. It ended up being 600 non-comment lines of code in total, and took me about eleven hours in total to write and debug. It runs the standard JONESFORTH prelude, up to and including SEE.
You can download the code here: ppcforth.S.m4.
(It’s also available via darcs: darcs get http://www.eighty-twenty.org/~tonyg/Darcs/jonesforth.)
The assembler-macro tricks that the original i386 version uses are sadly unavailable with the default OS X assembler, so I’ve had to resort to using m4 instead; other than that, it’s more-or-less a direct translation of Richard’s original program. To compile it,
m4 ppcforth.S.m4 > ppcforth.S gcc -nostdlib -o ppcforth ppcforth.S rm ppcforth.S
To run it, download the JONESFORTH prelude (save it as jonesforth.f), and
$ cat jonesforth.f - | ./ppcforth JONESFORTH VERSION 14641 OK
Here’s an example session, decompiling the “ELSE” word:
SEE ELSE : ELSE IMMEDIATE ‘ BRANCH , HERE @ 0 , SWAP DUP HERE @ SWAP - SWAP ! ;
I’d like to thank Richard for such an amazingly well-written program: not only is JONESFORTH itself a beautiful piece of software, it’s also an incredibly lucid essay that does a wonderful job of introducing the reader to the concepts and techniques involved in implementing a FORTH.
I’ve recently written an RMI service which has state - transactions. The service is implemented using Sleepycat Java Edition collections, and the transactions map to sleepycat transactions.
The StoredMap class depends on ambient authority: it determines the current transaction from the thread. The methods will all be invoked in on separate threads, so we need to deal with this somehow. This should be pretty simple - surely there will be a ‘join’ method - to join the current thread to the transaction. Nope. So now what?
I’ll just have to keep a thread running for the transaction, and execute transactional code in that thread.
In Java 5, we get Executors and Futures in java.util.concurrent, which make this very simple to implement. In fact, we can add some sugar - I can generate a proxy class for an interface, which executes its methods inside a single thread.
Anyway, I have, and added it to the LShift Java Library. It even mangles stack traces into something you can read.
In normal, direct-style programming in (mostly-)functional languages such as scheme and ML, folding is an operation that crops up all the time in various guises. Most list-manipulation libraries for such languages include implementations of left-fold and right-fold as standard. But what about the situation when you’re programming in continuation-passing style (CPS), such as when you’re writing (or trying to write) a metacircular evaluator? Library support for continuation-passing folds isn’t nearly as common.
Here’s the direct-style left-fold function:
(define (foldl kons knil xs)
(if (null? xs)
knil
(foldl kons (kons (car xs) knil) (cdr xs))))
and here’s the continuation-passing left-fold function:
(define (foldl-k kons knil xs k)
(if (null? xs)
(k knil)
(kons (car xs) knil (lambda (v) (foldl-k kons v (cdr xs) k)))))
Note that kons takes three arguments here, where in the direct-style version, it takes two.
One benefit of having CPS folds available is that they expose more control over the loop. For instance, using a normal fold, there’s no way to terminate the iteration early, but using a CPS fold, your three-argument kons routine can simply omit invoking its continuation parameter (presumably choosing some other continuation to run instead). This means that operations like (short-circuiting) contains?, any, and every can be written with CPS fold, but not with plain direct-style fold:
(define (contains? predicate val elements)
(foldl-k (lambda (elt acc k)
(if (predicate elt val)
#t ;; note: skips the offered continuation!
(k acc)))
#f
elements
(lambda (v) v)))
In every field, and technology is no exception, there is a constant tension between marketing and production attitudes. Marketing is all about making the customer feel that choosing to invest their money with you will make them better off and fulfill their needs. Engineering-driven production, on the other hand, is a constant battle to assure oneself that it is possible, with the resources at hand, to arrive at a satisfactory solution (and making adjustments to the solution or the resources until the equation is solved). Reconciling these two approaches can be very difficult, because the marketing attitude alone is unbounded by the uncomfortable truths of reality, and is therefore always at a risk of promising more than can actually be done. Invariably, if the promises given by people who’s only concern is marketing can’t be met (because they stand in contradiction to what’s actually possible) the marketing value of these promises is diminished, since they end up leaving the customer disappointed.
Take, for example, YAHOO. Earlier today the Internet giant announced that they will, as of May, give users of their web-mail product unlimited storage. As you probably know, nothing is unlimited, or infinite, except (citing a quote attributed to Albert Einstein) the Universe and human stupidity. How then, does YAHOO plan to offer unlimited storage? Surely that would require them to posses unlimited disk-space, in data centers of unlimited size.
What’s going on here? Do the engineers at YAHOO know something we don’t? Probably not. Let’s think for a second what’s really going to happen once YAHOO enables that unlimited storage feature. Most users will continue saving mail at the same amount they did before, which is much less than the 1 GB currently on offer. For these users the change from 1GB to unlimited wont make any difference, and YAHOO wont be suffering from offering them more than they already do.
Other users, the ones who currently almost scratch the top of their allocated 1GB will sigh with relief - no need to delete any of those old emails.
Very few users, perhaps a few hackers, will try to stress-test YAHOO’s offer and will start uploading huge amounts of data. Will they be able to do this ad infinitum? Only time will tell :-) But most likely they wont - YAHOO will have included something in the small letters on their EULA which will allow them to discontinue service to such users, and maybe even chase them in court. We all know where this leads, though - the conflict between those users and YAHOO will become very public, at least for a short while. At the very least, it will reach slashdot, and quite possibly will inspire a public debate about the obligations of service providers offering free use of their online facilities. The resulting effect on public opinion is likely to be embarrassing and even damaging to YAHOO.
Now, we all know why YAHOO is doing this, who their competitors are, which they feel a need to outdo. Google, with its GMail service, has been offering very large amounts of storage space since the very beginning, and the available space is increasing daily, so simply increasing the storage to top whatever it is Google is offering is hardly a satisfactory solution - the only way to be sure that you compete well with Google is to go unlimited.
That is, unless Google responds with its own unlimited offer. Only Google wont (or at least I bet they wont) - and that’s exactly the difference between Google, the Internet wunderkind that has grown so much as to nearly dominate the online world, and YAHOO, the company that probably suffered the most from Google’s growth. Google is an engineering-driven company. It was started by a pair of scientists and is famous for hiring only the best engineers around. When Google is offering 2833MB of storage, it means that there are many people within the company that know exactly what this means, how many disks, in how many data centers, and how is this likely to grow in the context of the company’s global strategy. YAHOO, on the other hand, is very much a marketing-driven company. In YAHOO (I guess), some marketing wizard came up with the brilliant idea of unlimited storage. When the engineers protested politely that there is no such thing as unlimited they were simply hushed. It’s not really unlimited, after all, just sort of unlimited, and anyway, who on earth would take advantage of such an offer … isn’t it illegal or something? We’ll just have to make sure Legal takes care of that.
What does this have to do with LShift, I hear you ask. Well, at LShift we have merged our engineering, marketing and management capacity. If you are an LShift customer, or thinking of becoming one, then whenever you pick up the phone and call us to ask whether this or that can be done, you almost always get to talk to the developer in charge of your project. When we tell you that something can be done, we are positive that it so, and we know why and how to do it. When we tell you that it can’t be done, it is not because we don’t mind disappointing our customers (customer satisfaction is our #1 goal) but because we want to help you get the best solution you can afford. And doing that well, it seems, also happens to be the best marketing strategy we could possibly have.
You are currently browsing the archives for the Reflection category.