technology from back to front

Archive for the ‘Reflection’ Category

Memory matters – even in Erlang

Some time ago we got an interesting bug report for RabbitMQ. Surprisingly, unlike other complex bugs, this one is easy to describe: 

At some point basic.get suddenly starts being very slow – about 9 times slower!

Read more…

by
marek
on
28/02/10

My thoughts on real time full-text search

Usually, search engines can look through data outdated by a few days. But Twitter search seems to be returning real time search results. That’s why it’s interesting how it works.

In this post I’ll present a short introduction to full-text search engines and my private thoughts about a possible implementation of a better one.

Let’s start from the beginning.

How does normal full-text search work?

It’s really simple. When you search for a single keyword, like britney, the search engine finds a list of documents that contain this keyword. This list is called an inverted index. I call this list a ‘hitlist’. For example:

britney -> [document_1, document_3, document_44, document_555]

When you enter two keywords, like britney spears, you really want to get a list of documents that contains both keywords “britney” and “spears”. Under the hood, search engine retrieves inverted indexes for both keywords. Than it looks for documents that appear in both retrieved indexes.

"britney" -> [document_1, document_3, document_44, document_555]
 "spears" -> [document_2, document_3, document_5, document_44]

Computed result:

"britney" AND "spears" -> [document_3, document_44]

The AND operation must be very fast (usually O(n), where n is the number of documents for a single keyword). To achieve such linear speed, inverted indexes must be presorted.

So what’s the problem?

Imagine that inverted indexes (I call them `hitlists`) are big. In a pessimistic case a hitlist will contain references to all of your documents. For example the word ‘the’ would probably appear in every document, so the hitlist for this word would be huge.

To reduce RAM and CPU usage while comparing hitlists, they should be stored in a linear chunk of memory. That makes adding and removing items from hitlists very expensive.

That’s the reason why creating indexes is usually very different process from querying them. Normally creating an index is a separate stage. Once this stage is finished you can use computed, usually immutable, index.

The problem is that it becomes very hard to create indexes when you have more data then RAM. That’s why huge data sets are usually split between many reasonably sized indexes.

On the other hand splitting indexes not very good. Remember, that reading a hitlist from many indexes requires reading data from many index files. Remember that disk operations – disk seeks, are really expensive. Consider:

  • reading one hitlist from one index: 1-2 disk seeks – 8-20ms
  • reading one hitlist from ten indexes: 10-20 disk seeks – 80-200ms

As you can see, it’s better to have a hitlist in one chunk and in one index.

As mentioned before, once this chunk is stored on the disk, it should be immutable – updating it can be very expensive. That’s why usually full-text search engine serve data outdated by a few days. Every now and then the indexes are recreated from scratch.

However, Twitter search serves realtime data. It’s very different from what normal search engines can do.

On the other hand Twitter search serves realtime data

That’s why I was wondering how Twitter search works. Serving realtime data is not a thing that normal full-text search engines can do.

But let’s take a closer look at page loading times for a simple query ‘text search’:

  • 2nd page: 0.44 seconds
  • 20th page: 6.72 seconds
  • 30th page: timeout (proxy error)

Hmm. Timeout on a search is not a very good result. For me it means that they just use some kind of sql-based full text search. It works fine when data is cached in memory. But once you try to read old data – it becomes impossible to read data from disks in reasonable time.

So it looks like Twitter is not really using a full text search, if we understand the term ‘full text search’ as defined at the beginning of this post. Rather Twitter is using some kind of a hack around a SQL database. This seems to be working fine for the first few result pages.

Is it possible to create true realtime full-text search indexing?

I don’t know in general.

I was thinking about a particular Twitter-like use case. I’d also like add some other assumptions:

  • There shouldn’t be an explicit indexing stage. Once you add the data to the search engine, they’re ready to be served.
  • The search engine is always ready to serve queries.
  • We’re talking about pretty big scale. For a small scale (index fits to RAM) Lucene or Zope could do realtime search indexing pretty well.
  • There aren’t any particularly strong latency requirements for queries. It would be good to serve most of the requests in a short time, like 100ms, but it’s nothing bad if someone has to wait a few seconds now and then.
  • Document_ids are monotonic and based on timestamp. That would mean that the newer the document, the bigger the document_id. This simplifies the process of sorting hitlists a lot.
  • We’re talking about a Twitter-like scenario: A lot of tiny documents.
  • Changes to the data should be visible in query results instantly. Let’s say that in the pessimistic case we shouldn’t get longer delays than 5 minutes.

Even with these assumptions, I’m not sure if the problem is solvable. I’m not sure if anyone has ever done a working and scalable implementation for that problem (correct me if I’m wrong!).

Based on these requirements, let’s try to design the API for such a realtime full-text search engine:

  • add_to_index(document_id, keywords)
  • del_from_index(document_id)
  • search(query) -> returns list of document_ids

Initial feed of data

As mentioned before, we don’t have the initial phase of creating and optimizing indices by design. That’s not a particularly bad thing, but it means that in the beginning you need to feed this search engine with current offline data and then update and add new records on-the-fly.

It means that we actually don’t really care about the speed of adding a document to index when we’re in operation. On the web quite a small amount of data is created or changed. So unless adding document is particularly slow (more than a second), we really don’t care about the speed.

On the other hand, when we feed initial data to the index we want that operation to complete in reasonable time. Rebuilding indexes from offline data should take at most a few days.

Let’s count how fast adding documents needs to be, in a situation when we’d like to index a dozen gigabytes of text. This experiment requires some assumptions:

  • Let’s index 12 gigabytes of text: an average social network could have about that much data.
  • Let’s consider Twitter’s data model – a document is at most 140 characters long.
  • On average, an English word is about 5.1 letters long.

Okay. The counting begins.

12 gigabytes / 5.1 letters_per_word = 2.45 * 10^9 words

The main metric of indexing speed is not the number of unique documents, it’s not even the number of unique keywords. What we’re interested in, is the number of unique tuples (document_id, keyword).

Let’s assume that adding one tuple document-keyword takes 1ms:

(2.45*10^9 words * 0.001 seconds)/60.0 sec/60.0 min/24.0 hours = 30.5 days

Ouch. Reindexing of 12 gigabytes of data would take 31 days. That’s not good.

But we can do it faster! Let’s assume that adding one tuple costs 0.1 ms: we do our job in 3.5 days. That’s more reasonable. We can spend 4 days of work to reindex all of the data.

The problem is that 0.1ms per tuple means that we need to create a really fast system.

Disk seeks

Adding one tuple (document_id, keword) costs us a disk seek – we need to save data. That’s around 8ms. But we need to have 0.1ms per added tuple to make everything run in reasonable time.

The result is easy to predict – we need to have 80 disks and add 80 tuples in parallel, leading to 0.1ms average cost.

To sum up: indexing 12.5 gigabytes of data would require 3.5 days of work and 80 disks, possibly on multiple machines. That sounds rather expensive. It also means that we need to be able to scale horizontally.

Optimizing

But, maybe we don’t need to synchronize the hitlists to disk every time we add a tuple. Let’s change our API a bit:

  • add_to_index(document_id, keywords, synchronization_latency=60seconds)

Now we can choose to cache hitlists for a period of time. During this time the hitlists would grow bigger before being written to disk. That’s not really realtime indexing, but it’s a reasonable compromise.

On the other hand – how many disk seeks could we save using this trick?

It’s not so easy to answer this question, but fortunately, we can simplify our model and get an answer. Let’s forget about the latency of cacheing the hitlist and think about caching a particular number of hitlists.

How much we can win by delaying synchronization?

I prepared an experiment:

  • Take a few gigabytes of text-only English wikipedia dump.
  • Count the number of unique words – this is the number of hitlists produced. In the optimistic case of unlimited memory we would have exactly this number of disk seeks.
  • Count the total number of words. This is a pessimistic number of disk seeks, if we synchronize to disk for every tuple. On the other hand for this model we don’t have to use any memory.
  • Let’s assume that we keep the X most often updated hitlists in memory and are counting the number of disk seeks in this scenario.
  • For every X hitlists cached count the win against the pessimistic case and the loss against the optimistic case.

Here are the results for few a chunks of data from wikipedia:

DATA

500MB-chunk#1
500MB-chunk#2
1GB

optimum – unique words

1036547
971534
1634688

worst – total words

85016295
84660053
171409013
16 hitlists cached
66941279
67396522
134685743
256 hitlists cached
34076915
33975288
68475626
4096 hitlists cached
15648343
14814646
31308282
16384 hitlists cached
8190192
7800053
16150592
65536 hitlists cached
3764562
3565387
7415305
0.5M hitlists cached
1425247
1329566
2639209
1M hitlists cached
1211102
1130571
2107068
16M hitlists cached
1040521
975186
1650374

BOOST – better than worst scenario




16 hitlists
1.27
1.26
1.27
256 hitlists
2.49
2.49
2.50
4096 hitlists
5.43
5.71
5.47
16384 hitlists
10.38
10.85
10.61
65536 hitlists
22.58
23.74
23.12
0.5M hitlists
59.65
63.67
64.95
1M hitlists
70.20
74.88
81.35
16M hitlists
81.71
86.81
103.86

ANTI BOOST – worst than optimistic scenario




worst case
82.01
87.14
104.86
16 hitlists
64.58
69.37
82.39
256 hitlists
32.88
34.97
41.89
4096 hitlists
15.10
15.25
19.15
16384 hitlists
7.90
8.03
9.88
65536 hitlists
3.63
3.67
4.54
0.5M hitlists
1.37
1.37
1.61
1M hitlists
1.17
1.16
1.29
16M hitlists
1.00
1.00
1.01





Results:

  • For caching 16 million hitlists in memory we are very close to optimum (ANTIBOOST = 1). That’s not surprising, since there are not many more unique words in English than 16mln.
  • With 1 million hitlists in memory we are only 30-60% worse than optimum.
  • For 1 million hitlists we are 70x-81x better than the pessimistic scenario.

It means that the optimum size of the cache is around 0.5 to 1 mln hitlists. This scenario should not use more than 0.5 GB of RAM. This cache size could give us around 70 times fewer updates to disk. Sounds reasonable.

That means that we could possibly index 12.5 gigabytes of data in 4 days – using one disk. But to achieve that we’d need very fast software, that would be only I/O bound.

Architecture

Of course, the point is to create a real time search engine that is fully scalable. Actually such an approach could simplify the software.

By scalable I mean being able to scale horizontally – you could add a new disk or a new server and the software would just work without any special configuration changes.

Scaling down is more complicated, because it requires moving data. It’s not a requirement for us.

I believe that the architecture of this project should look like this:

It’s clear that the actual indexing, querying or other operations on hitlists are very different than actually storing data on disks. That’s why I think that the biggest challenge is not really the search engine logic but rather a scalable and ultra fast persistence layer.

Scalable, persistent key-value storage

Scalable and persistent key-value databases are a very long topic. We would need such a system as a persistence layer for our search engine.

The most important features are:

  • Simple key-value storage is enough for our needs.

  • There must be an “append” operation. We don’t want to retrieve few megabyte hitlist just to add one item to it.

  • As fast as possible. Latency must be kept very low.
  • We don’t need redundancy – speed is more important. In case of a disk failure we can afford to recreate the index from scratch.
  • No eventual-consistency. We need the proper value right when we ask for it.
  • No transactions, we don’t need them. Optimistic-locking is enough.
  • It should cost at most one disk seek to retrieve a record.
  • At most two disk seeks to add data (one to retrieve the record, another to save it)
  • I like memcached binary protocol, so it could be nice to have it as an interface.
  • Scalable – adding a new server doesn’t require stopping the service and needs minimal configuration changes.
  • The client API for this storage needs to be able to retrieve data from many servers in parallel. Remember that every interaction with storage could take a few milliseconds if it requires a disk seek.

There are various open-source projects that can be used as a scalable persistent key-value storage, but no project fully satisfies my needs. Here’s the list of what I found at this point (I’d be happy to update this list if I missed something yesterday Richard Jones wrote better comparison):

Making it even faster

There are some very interesting ideas about how hitlists could be represented and stored.

The basic solution is to store each hitlist as a list of sorted integers. Like:

'britney' -> [1,3,4,44,122,123]

To reduce disk usage, this list could just be compressed using, for example, gzip. Unfortunately gzip doesn’t give us a good compression ratio. My tests show a ratio of only around 1.16x.

However, we could modify the hitlist to only store the differences between elements. Our ‘britney’ hit list would then look like:

'britney' -> [1,2,1,40,78,1]

The gzip compression ratio is now much better. It’s around 1.63x. That’s not astonishing, but it could be worth the CPU power wasted on compression.

A totally different approach is to store hitlists as bitmaps. It can be memory inefficient, but it makes binary operations AND, OR, NAND very very fast. To reduce memory penalty we could use compressed bitmaps. It could be a perfect way of storing data for very long hitlists. On the other hand it could be worse for small and sparse hitlists.

I’m still not sure which representation is the best; maybe some kind of a hybrid solution. There’s for sure a place for optimization in this area.

Conclusion

There are a lot of open questions for this problem. One of them is what the persistence layer should look like. The next thing is the internal representation of hitlists, how to make them always sorted and whether sorting should be done on retrieval or on update. How to be able to retrieve only part of them if they are compressed. Yet another idea is to use solid state drives as storage to reduce disk seek problem.

There’s a lot of work to be done in this area.

I think that such a full text search engine could fit perfectly as piece of infrastructure in many websites.





by
marek
on
20/01/09

OMeta for Scheme

Speaking of OMeta/JS and OMeta in general, I’ve implemented an OMeta for Scheme. Currently it runs in MzScheme, but it should be fairly portable, with dependencies only on a handful of commonly-implemented SRFIs. I intend to properly libraryise it — making it into a proper MzScheme module — and to port it to other schemes, most probably starting with SISC.

One interesting feature of this OMeta is that it implements the error-handling mechanisms suggested by Bryan Ford that I implemented previously in a packrat parsing library for Scheme. The packrat-based error-handling techniques seem to generalise fairly well to an OMeta setting.

* Browse the code here.
* Check it out with hg clone http://www.eighty-twenty.org/hgwebdir.cgi/ometa-scheme/.

I’m already using it as part of an experimental compiler: the reader, the parser.

Update: Switched from hosting the ometa-scheme code in Darcs to Mercurial.

by
tonyg
on
01/07/08

Adding distributed version control to TiddlyWiki

After my talk on Javascript DVCS at the Osmosoft Open Source Show’n’tell, I went to visit Osmosoft, the developers of TiddlyWiki, to talk about giving TiddlyWiki some DVCS-like abilities. Martin Budden and I sat down and built a couple of prototypes: one where each tiddler is versioned every time it is edited, and one where versions are snapshots of the entire wiki, and are created each time the whole wiki is saved to disk.

Regular DVCS SynchroTiddly
Repository The html file contains everything
File within repository Tiddler within wiki
Commit a revision Save the wiki to disk
Save a text file Edit a tiddler
Push/pull synchronisation Import from other file

If you have Firefox (it doesn’t work with other browsers yet!) you can experiment with an alpha-quality DVCS-enabled TiddlyWiki here. Take a look at the “Versions” tab, in the control panel at the right-hand-side of the page. You’ll have to download it to your local hard disk if you want to save any changes.

It’s still a prototype, a work-in-progress: the user interface for version management is clunky, it’s not cross-browser, there are issues with shadow tiddlers, and I’d like to experiment with a slightly different factoring of the repository format, but it’s good enough to get a feel for the kinds of things you might try with a DVCS-enabled TiddlyWiki.

Despite its prototypical status, it can synchronize content between different instances of itself. For example, you can have a copy of a SynchroTiddly on your laptop, email it to someone else or share it via HTTP, and import and merge their changes when they make their modified copy visible via an HTTP server or email it back to you.

I’ve been documenting it in the wiki itself — if anyone tries it out, please feel free to contribute more documentation; you could even make your altered wiki instance available via public HTTP so I can import and merge your changes back in.

by
tonyg
on

TiddlyWiki, Quining, and Smalltalk

Yesterday I visited Osmosoft, the developers of TiddlyWiki, to chat about getting some DVCS-like functionality into TiddlyWiki. Jeremy mentioned in passing that TiddlyWiki is, if you squint, a slightly cheating kind of a Quine. It struck me this morning that TiddlyWiki has strong similarities to another famous almost-Quine, the Smalltalk system.

TiddlyWiki is a composition of

* a HTML page, with embedded Javascript, acting as a container for the pieces of content, called tiddlers, in the wiki;
* some tiddlers, specially marked as plugins, containing source code permitting extension of the skeleton system; and
* other tiddlers containing wiki markup text.

The HTML/Javascript container of the tiddlers (along with the web browser it runs on!) is like the Smalltalk virtual-machine. The tiddlers themselves are the live objects in the system. The process of running the embedded plugin scripts is the same as running Smalltalk’s post-image-load startup hooks. The plugins-in-tiddlers is the same as the source-in-the-image. The process of saving the TiddlyWiki instance to disk is the same as a Smalltalk instance extracting its live object graph and serializing it to disk.

The main difference I can see is that Smalltalk doesn’t carry so much of its VM around with its images: like Smalltalk’s reliance on an external VM being present, TiddlyWiki can rely on the browser being present for a big chunk of its VM, but has to carry the bootstrapping container code bundled with the stored tiddlers/live-objects. Also, TiddlyWiki instances are (these days!) constructed by a special assembly process from small, separate text files checked into Subversion, whereas Smalltalk images were not historically constructed from scratch very often at all. Finally, TiddlyWiki’s boot process is heavier than Smalltalk’s, because it’s forced by the browser to recompile all the sources in the system, where Smalltalk gets away with having bytecode in the image alongside the sources.

by
tonyg
on
12/06/08

Late-binding with Erlang

Upon browsing the source to the excellent MochiWeb, I came across a call to a function that, when I looked, wasn’t defined anywhere. This, it turns out, was a clue: Erlang has undocumented syntactic support for late-bound method dispatch, i.e. lightweight object-oriented programming!

The following example, myclass.erl, is a parameterized module, a feature that arrived undocumented in a recent Erlang release. Parameterized modules are explored on the ‘net here and here. (The latter link is to a presentation that also covers an even more experimental module-based inheritance mechanism.)

-module(myclass, [Instvar1, Instvar2]).
-export([getInstvar1/0, getInstvar2/0]).
getInstvar1() -> Instvar1.
getInstvar2() -> Instvar2.

“Instances” of the “class” called myclass can be created with myclass:new(A, B) (which is automatically provided by the compiler, and does not appear in the source code), where A and B become values for the variables Instvar1 and Instvar2, which are implicitly scoped across the entirety of the myclass module body, available to all functions defined within it.

The result of a call to a new method is a simple tuple, much like a record, with the module name in the first position, and the instance variable values in order following it.

Eshell V5.6  (abort with ^G)
1> Handle = myclass:new(123, 234).
{myclass,123,234}
2> Handle:getInstvar1().
123
3> Handle:getInstvar2().
234

While this looks really similar to OO dispatch in other languages, it’s actually an extension to Erlang’s regular function call syntax, and works with other variations on that syntax, too:

4> {myclass,123,234}:getInstvar1().
123

The objects that this system provides are pure-functional objects, which is unusual: many object-oriented languages don’t clearly separate the two orthogonal features of late-binding and mutable state. A well-designed language should let you use one without the other, just as Erlang does here: in Erlang, using parameterized modules for method dispatch doesn’t change the way the usual mechanisms for managing mutable state are used. “Instance variables” of parameterized modules are always immutable, and regular state-threading has to be used to get the effects of mutable state.

I’d like to see this feature promoted to first-class, documented, supported status, and I’d also very much like to see it used to structure the standard library. Unfortunately, it’s not yet very well integrated with existing modules like gb_sets, ordsets and sets. For example, here’s what happens when you try it with a simple lists call:

5> lists:append([1, 2], [3, 4]).
[1,2,3,4]
6> {lists, [1, 2]}:append([3, 4]).
[3,4|{lists,[1,2]}]

Not exactly what we were after. (Although it does give brittle insight into the current internals of the rewrites the system performs: a {foo, ...}:bar(zot) call is translated into foo:bar(zot, {foo, ...}) – that is, the this parameter is placed last in the argument lists.)

by
tonyg
on
18/05/08

JONESFORTH ported to PowerPC and Mac OS X

A couple of weeks ago, Richard W. M. Jones released JONESFORTH, which I thought was pretty exciting. Today I spent a few hours porting the assembly-language part to PowerPC on Mac OS X 10.3.9. It ended up being 600 non-comment lines of code in total, and took me about eleven hours in total to write and debug. It runs the standard JONESFORTH prelude, up to and including SEE.

You can download the code here: ppcforth.S.m4.

(It’s also available via darcs: darcs get http://www.eighty-twenty.org/~tonyg/Darcs/jonesforth.)

The assembler-macro tricks that the original i386 version uses are sadly unavailable with the default OS X assembler, so I’ve had to resort to using m4 instead; other than that, it’s more-or-less a direct translation of Richard’s original program. To compile it,

m4 ppcforth.S.m4 > ppcforth.S
gcc -nostdlib -o ppcforth ppcforth.S
rm ppcforth.S

To run it, download the JONESFORTH prelude (save it as jonesforth.f), and

$ cat jonesforth.f - | ./ppcforth 
JONESFORTH VERSION 14641 
OK 

Here’s an example session, decompiling the “ELSE” word:

SEE ELSE
: ELSE IMMEDIATE ' BRANCH , HERE @ 0 , SWAP DUP HERE @ SWAP - SWAP ! ;

I’d like to thank Richard for such an amazingly well-written program: not only is JONESFORTH itself a beautiful piece of software, it’s also an incredibly lucid essay that does a wonderful job of introducing the reader to the concepts and techniques involved in implementing a FORTH.

by
tonyg
on
04/10/07

Ambient authority and Sleepycat Java Edition and stateful services

I’ve recently written an RMI service which has state – transactions. The service is implemented using Sleepycat Java Edition collections, and the transactions map to sleepycat transactions.

The StoredMap class depends on ambient authority: it determines the current transaction from the thread. The methods will all be invoked in on separate threads, so we need to deal with this somehow. This should be pretty simple – surely there will be a ‘join’ method – to join the current thread to the transaction. Nope. So now what?

I’ll just have to keep a thread running for the transaction, and execute transactional code in that thread.

In Java 5, we get Executors and Futures in java.util.concurrent, which make this very simple to implement. In fact, we can add some sugar – I can generate a proxy class for an interface, which executes its methods inside a single thread.

Anyway, I have, and added it to the LShift Java Library. It even mangles stack traces into something you can read.

by
david
on
25/06/07

Folds and continuation-passing-style

In normal, direct-style programming in (mostly-)functional languages such as scheme and ML, folding is an operation that crops up all the time in various guises. Most list-manipulation libraries for such languages include implementations of left-fold and right-fold as standard. But what about the situation when you’re programming in continuation-passing style (CPS), such as when you’re writing (or trying to write) a metacircular evaluator? Library support for continuation-passing folds isn’t nearly as common.

Here’s the direct-style left-fold function:

(define (foldl kons knil xs)
  (if (null? xs)
      knil
      (foldl kons (kons (car xs) knil) (cdr xs))))

and here’s the continuation-passing left-fold function:

(define (foldl-k kons knil xs k)
  (if (null? xs)
      (k knil)
      (kons (car xs) knil (lambda (v) (foldl-k kons v (cdr xs) k)))))

Note that kons takes three arguments here, where in the direct-style version, it takes two.

One benefit of having CPS folds available is that they expose more control over the loop. For instance, using a normal fold, there’s no way to terminate the iteration early, but using a CPS fold, your three-argument kons routine can simply omit invoking its continuation parameter (presumably choosing some other continuation to run instead). This means that operations like (short-circuiting) contains?, any, and every can be written with CPS fold, but not with plain direct-style fold:

(define (contains? predicate val elements)
  (foldl-k (lambda (elt acc k)
             (if (predicate elt val)
                 #t ;; note: skips the offered continuation!
                 (k acc)))
           #f
           elements
           (lambda (v) v)))
by
tonyg
on
11/06/07

Only two things are infinite

In every field, and technology is no exception, there is a constant tension between marketing and production attitudes. Marketing is all about making the customer feel that choosing to invest their money with you will make them better off and fulfill their needs. Engineering-driven production, on the other hand, is a constant battle to assure oneself that it is possible, with the resources at hand, to arrive at a satisfactory solution (and making adjustments to the solution or the resources until the equation is solved). Reconciling these two approaches can be very difficult, because the marketing attitude alone is unbounded by the uncomfortable truths of reality, and is therefore always at a risk of promising more than can actually be done. Invariably, if the promises given by people who’s only concern is marketing can’t be met (because they stand in contradiction to what’s actually possible) the marketing value of these promises is diminished, since they end up leaving the customer disappointed.

Take, for example, YAHOO. Earlier today the Internet giant announced that they will, as of May, give users of their web-mail product unlimited storage. As you probably know, nothing is unlimited, or infinite, except (citing a quote attributed to Albert Einstein) the Universe and human stupidity. How then, does YAHOO plan to offer unlimited storage? Surely that would require them to posses unlimited disk-space, in data centers of unlimited size.

What’s going on here? Do the engineers at YAHOO know something we don’t? Probably not. Let’s think for a second what’s really going to happen once YAHOO enables that unlimited storage feature. Most users will continue saving mail at the same amount they did before, which is much less than the 1 GB currently on offer. For these users the change from 1GB to unlimited wont make any difference, and YAHOO wont be suffering from offering them more than they already do.

Other users, the ones who currently almost scratch the top of their allocated 1GB will sigh with relief – no need to delete any of those old emails.

Very few users, perhaps a few hackers, will try to stress-test YAHOO’s offer and will start uploading huge amounts of data. Will they be able to do this ad infinitum? Only time will tell :-) But most likely they wont – YAHOO will have included something in the small letters on their EULA which will allow them to discontinue service to such users, and maybe even chase them in court. We all know where this leads, though – the conflict between those users and YAHOO will become very public, at least for a short while. At the very least, it will reach slashdot, and quite possibly will inspire a public debate about the obligations of service providers offering free use of their online facilities. The resulting effect on public opinion is likely to be embarrassing and even damaging to YAHOO.

Now, we all know why YAHOO is doing this, who their competitors are, which they feel a need to outdo. Google, with its GMail service, has been offering very large amounts of storage space since the very beginning, and the available space is increasing daily, so simply increasing the storage to top whatever it is Google is offering is hardly a satisfactory solution – the only way to be sure that you compete well with Google is to go unlimited.

That is, unless Google responds with its own unlimited offer. Only Google wont (or at least I bet they wont) – and that’s exactly the difference between Google, the Internet wunderkind that has grown so much as to nearly dominate the online world, and YAHOO, the company that probably suffered the most from Google’s growth. Google is an engineering-driven company. It was started by a pair of scientists and is famous for hiring only the best engineers around. When Google is offering 2833MB of storage, it means that there are many people within the company that know exactly what this means, how many disks, in how many data centers, and how is this likely to grow in the context of the company’s global strategy. YAHOO, on the other hand, is very much a marketing-driven company. In YAHOO (I guess), some marketing wizard came up with the brilliant idea of unlimited storage. When the engineers protested politely that there is no such thing as unlimited they were simply hushed. It’s not really unlimited, after all, just sort of unlimited, and anyway, who on earth would take advantage of such an offer … isn’t it illegal or something? We’ll just have to make sure Legal takes care of that.

What does this have to do with LShift, I hear you ask. Well, at LShift we have merged our engineering, marketing and management capacity. If you are an LShift customer, or thinking of becoming one, then whenever you pick up the phone and call us to ask whether this or that can be done, you almost always get to talk to the developer in charge of your project. When we tell you that something can be done, we are positive that it so, and we know why and how to do it. When we tell you that it can’t be done, it is not because we don’t mind disappointing our customers (customer satisfaction is our #1 goal) but because we want to help you get the best solution you can afford. And doing that well, it seems, also happens to be the best marketing strategy we could possibly have.

by
Tom Berger
on
28/03/07

Search

Categories

You are currently browsing the archives for the Reflection category.

Feeds

Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us