technology from back to front

# Looking for needles in haystacks

Here’s a proper programming challenge, simple to state, and in the end quite challenging to solve. A function produces a 64-bit output given a 64-bit input. It mostly behaves like a random function, so given a random input, most outputs occur with probability between 0 and 2-63; in other words, for a randomly-chosen 64-bit y, there are usually between 0 and 2 solutions to the equation f(x) = y. However, a few occur with much greater frequency – more like 2-32 or so. The function is pretty quick and easy to calculate, but its internal structure defies direct analysis. How would you identify these more frequent occurrences by experiment? How big a computer would you need to do it in something like a few hours to a few days? Think about it for a while before you read on.

First of all, to be at all confident of finding these high frequency outputs, you’re going to have to examine on the order of 234 outputs. That’s 128 GB of data. When you produce your last output, you want to be able to determine, at some stage, whether it’s the same as any output you’ve already produced, and I think that this makes storing all previous outputs unavoidable, so you’re going to need a big disk. Fortunately I have a 200 GB disk attached to my home machine that I’m not using yet, so I can manage this part. The tricky thing is, how are you going to arrange your data on disk so you can find duplicates?

You can’t just use the disk as one big hash table, because each write will call for a seek, and 8.5 ms times 234 is about 4.6 years. You could buffer and sort your writes, but my home machine only has 512 MB of memory and it needs some for other things, so you would have to do about 512 buffered writes and crawl through the entire 128GB data structure every time, which would be desperately slow each time. You could just write it all to disk and then sort it, but I suspect you’d be looking at a long wait for a disk-based sort of 16 billion 8-byte items to finish – just reading and re-writing the entire dataset will take over three hours, and you would probably have to do that at least nine times.

So the first idea I had was this: As you generate each datum, hash it and put it into one of 1024 bins based on the hash code, and write all the bins at once. Each bin now only contains 128 MB of data, which is sufficiently little that we can read it all into memory and do duplicate finding that way using a big hash table. I implemented this by writing to 1024 files at once, one file per bin. I even found out how to get around the limitations on how many files you are allowed to have open at a time (with ulimit). However, it was incredibly slow; it used only about 6.0% CPU because writing the files was keeping it busy all the time. I think that thousands of files all growing at once is pretty much a pathological case for any filesystem, because both Reiser3 and XFS showed this behaviour. Caveat: when I tried this I was using 2048 bins because I was having trouble with getting my post-binning collection phase to handle larger files, but I doubt halving the number of bins would have made a huge difference. I tried buffering the bins. I even tried putting a pre-buffer before the buffers that fitted into L2 cache to make more efficient use of main memory, but all to no avail. It was going to take a day or two to generate all the data.

I asked for advice online, and the best suggestion I got was to preallocate the files before writing to them. Instead, I waited until I got into work the next morning to ask my colleague David, who takes a special interest in making disks behave efficiently. After some discussion, we came to a strategy that made far more efficient use of the disks.

We allocate 256 kB of memory to each bin, ie 256 MB in total for all 1024 bins. As soon as any one bin fills up, we write the entire 256 MB data structure to disk in one great sweep, empty all the bins, and continue. This makes slightly inefficient use of disk space, because not all the bins will be full. However, because the function is so close to being random, they are all very nearly full, and so the inefficiency is only around 2%; in other words, instead of making 512 writes we end up making 522. Now the generation phase is as disk-efficient as it can be – all it does is append giant chunks of data to a file at a time – and it takes only a couple of hours at 60% CPU usage.

Of course, the cost is that now there is a collection phase to read the contents of each bin. This phase takes 522 seeks to piece together the contents of each bin, and it has to be repeated 1024 times, once for each bin; the total cost is about an hour. This is in any case comparable to the cost of finding the duplicates in each bin. Since one process is disk-intensive and the other is memory-intensive, it makes sense to do them in parallel; after discussion with David the simplest parallelization turned out to be the humble Unix pipe. “./extractbins 203 binoutput | ./finddups” means that one can be pulling the data off the disk at the same time as the other is placing it into the hash table (which is slow because practically every write is a cache miss, and you have to read before you write introducing latency). Once the hash table is full, streaming through it to find the duplicates is pretty fast, around half a second.

This story doesn’t yet have a happy ending – hardware problems with my home machine mean that I can’t run the collection phase without crashing it, which I think means I need to buy some decent memory for it. But it was very cool to get a twelve-times speedup for the whole process out of half-an-hour’s discussion with a knowledgable colleague.

by
Paul Crowley
on
26/10/05

# Slate

I’ve just taken another look at where the Slate language project has gotten up to. Last time I looked was almost a year ago now. There’s been a lot of progress! The time seems ripe for Slate to get an event loop, so that proper networking and threading can be built on it.

by
tonyg
on
25/10/05

# Big bucks for big bugs

I was recently working on a project where the client wanted us to add some features and updates to a web-facing database which they had paid some big name consultants to develop for them.

Wow, did what we received look impressive. It used some of the latest industry buzz-words: SQL Server, ASP.NET, Web services, to name a few; It came with pages and pages of documentation, with “wireframe models”, field-by-field breakdowns for each page, and even a schema diagram for the database. The documentation itself must have cost quite a bit (at a guess, it must have taken a Word-monkey at least 2 weeks to prepare).

However, when we started digging in, things didn’t look quite as impressive. Firstly, the documentation, detailed and copious as it was, was useless because it didn’t correspond to the actual code. Even the database schema didn’t match – it looked as if the documentation and the implementation had evolved independently.

As we started to take it apart and work out how it worked, more problems started to surface. Here’s a list of some of the more serious bugs we found.

1. The schema for one of the tables was too wide; there were several very large VARCHAR columns (thousands of characters per column) in the definition of this table. The schema defined records which could be as big as 24000 bytes, whereas SQL Server has a limit of 8060 bytes. This means that inserts (and updates) of large records could fail or be truncated. The client never came up against this problem presumably because they hadn’t created a record big enough. How could the developers have just ignored the warning messages that SQL Server must have given?
2. The website had a search page which sloppily used string concatenation to build the corresponding SQL query. Because no efforts were made to clean-up the user-provided input, this part of the website was vulnerable to SQL-injection attacks! In principle, a remote website visitor could type carefully crafted values into a search field and get the website to execute arbitrary SQL queries!

There were other weird parts in the code. For example, at one stage, the code concatenates several fields into a string using the pipe character (“|”) as a separator, and then later splits it again. Unfortunately, it didn’t make any effort to either check for/escape legitimate pipe characters in the field values, nor document that the pipe character shouldn’t be used in these fields. Why do this? VB.NET has perfectly usable record types (look for the Structure keyword – even QuickBASIC had the TYPE keyword since the late 80′s).

Working on this project has just given me more evidence to believe that the big name consultancies don’t give value for money.

by
francis
on
19/10/05

# Truncated differential cryptanalysis of five rounds of Salsa20

eSTREAM have just put my paper online: Truncated differential cryptanalysis of five rounds of Salsa20 (PDF) (discussion, Wikipedia on Salsa20). This doesn’t break the whole cipher, just a seriously reduced version.

Experimentation played a key role in finding this result. I found the first differential by writing a short Python program that implemented a pretty naive differential-characteristic finding strategy. But when I went to test experimentally that the characteristic worked, I found it was there with eight times the frequency I expected. Further experimentation showed that this was due to clustering differential trails resulting in the same characteristic. From there, it was straightforward to just experimentally count the occurrence of lots of characteristics, and then it makes sense to use lots of them to make the attack far more effective. As a result, many fewer differential pairs are required.

The discovery of trails whose probability is not correctly predicted by theory was also a great and exciting surprise. I’m now thinking about how to investigate how to account for these discrepancies; once we can account for them, maybe we can make use of them to build more powerful attacks.

by
Paul Crowley
on
17/10/05

# Spoon and the Object Visualiser

Take a look at this part of NetJam.ORG, in which Craig Latta builds a visual display of a running Smalltalk Image. Be sure to check out the movie of the first 50ms or so of a running image. Smalltalk gets all the cool toys.

by
tonyg
on

# Something about early, and often?

Naturally, almost as soon as I’d posted about my extensions to pregexp, I discovered a bug in said extensions.

It was something I should have caught, too: in the code I had a constant (3) which I couldn’t explain. I had expected it to be 2, not 3, and the need for the extra increment wasn’t clear to me. Now that I’ve caught the bug, I know why the increment seemed to work: it was compensating for a missing end-of-stream detection in another place.

by
tonyg
on
12/10/05

# When will the Haskell community finally get their act together

…and produce a usable package system?

The absence of a standardised and user-friendly mechanism to package &
distribute libraries, and to locate and install such libraries has
been a major stumbling block in the adoption of non-mainstream
programming languages for a long time. I wrote about it [here](http://www.lshift.net/news.20030928lispconference.html) in the
context of Scheme.

Haskell is actually in a much better position than Scheme for
addressing this, since there are far fewer implementations (and a
single one – GHC – with an overwhelming market share), and far fewer
incompatibilities between implementations. I have high hopes for
[Cabal](http://www.haskell.org/cabal/). However, so far it seems to have failed to deliver on its
promises, as evidenced by the following snippet from the installation

> In some setting, it will be complicated to get everything to work with
the database connection. The reason is that some versions of c2hs
(0.13.6 in particular) do not install properly as a package: the
hs-libraries component of the package description is missing with the
consequence that programs do not link properly. If you can fix they
you fly.

> Installing newer versions (0.14.2 or 0.14.3) is not possible out of
the box because they require cabal with version >= 1.0.1. Alas, GHC
6.4.1 ships with 1.0. Installing cabal-1.1.4 does not help either
because…

> …/c2hs-0.14.3 > ./Setup.hs build
> Preprocessing executables for c2hs-0.14.3…
> Building c2hs-0.14.3…
> Chasing modules from: c2hs/toplevel/Main.hs
> Could not find module `CForeign’:

> Oh well.

That’s exactly the kind of thing that will turn off all but the most

by
matthias
on
11/10/05

# xxexpr.ss – an SXML-to-XML converter

The SSAX XML parsing- and processing-library provides robust, high-quality XML reading tools for Scheme, but doesn’t include a general purpose XML writer. Over the past couple of years, a few of my projects have had a need to convert SXML-like data to an XML 1.0 external representation, and so I’ve written a portable SXML-to-XML printing library (both a snapshot and a darcs repository). The library has been used with Chicken, MzScheme, and SISC, and currently includes module wrappers for Mzscheme and SISC (or other psyntax-based Schemes).

The library is parameterized over a choice of double- or single-quotes for attribute printing, and can, if required, be instructed to use explicit close-tags when an empty-tag is encountered. It provides procedures for producing a string representation of an XML fragment, for printing an XML fragment directly to a port, and for pretty-printing an XML fragment with indentation. For example,

(pretty-print-xxexpr
(let ((title “My Page”))
(list
‘(*pi* xml (version “1.0″))
(body (h1 ,title)
(p “Hello, world!”))))))

produces the following output (don’t mind the invalid XML PI, that’s something WordPress is doing – the actual output from the library is well-formed!):

< ?xml version='1.0'?>

# My Page

Hello, world!

You can either browse the code, or retrieve it using

\$ darcs get http://www.lshift.net/~tonyg/xxexpr/

by
tonyg
on

# A SRFI-10-style extension to JSON

Background

The data language JSON is a great replacement for XML for many applications. It’s very similar in spirit to Lisp and Scheme S-expressions, as well as to XML: it is a pure data language, with no intrinsic semantics.

XML doesn’t allow direct literal representation of any data types other than strings and XML nodes (to oversimplify slightly); S-expressions don’t allow direct literal representation of anything but atoms, pairs and vectors; and JSON doesn’t allow direct literal representation of anything but atoms, lists, and dictionaries.

There are currently no extensions to XML to remedy this, to allow programmers to extend the XML reader to support literal notation for other classes of object. For S-expressions, however, there is SRFI 10: Sharp-Comma External Form, which allows the S-expression reader to be extended to allow read-time construction of literal objects.

I’ve implemented a SRFI-10-like construction for JSON, allowing extension of the JSON reader and writer, which makes it easier to use JSON for things like serialization or high-level messaging protocols.

The basic extension is the addition of `@constructor ...` notation to the reader: whenever the JSON parser sees an “`@`” sign, it parses a single identifier, and then a complete JSON object called the argument. The identifier is looked up in a table of named constructor functions, and if a match is found, the corresponding constructor function is called with the argument. If no matching constructor function is found, a parse error is signalled.

The corresponding extension to the writer is the `toJsonString` method. If an object that the `JSON.stringify` function encounters possesses a `toJsonString` implementation, the method is called and its result is returned as the stringification of the object.

Example

As an example, imagine an application where a client and server shared some notion of a database of objects, indexable by some unique identifier. The client side uses proxies to manipulate these objects, and the proxies send JSON RPC requests to the server (perhaps using `XmlHttpRequest`) to query and update the database held by the server. There are two ways we could represent one of these database objects: as just its unique identifier, which requires a manual lookup on both the server and client side whenever a reference to the object is to be sent over the wire, or using the `@constructor` extension.

MyDatabaseObject.prototype.toJsonString = function () {
return “@dbobject ” + this.id;
}

An example message from the client to the server might be

{
“action”: “retrieveNextObject”,
“arg”: @dbobject 42
}

When the server parses the message out of the HTTP request (or other transport), it calls `JSON.parse` with an extra argument:

var requestFromClient = JSON.parse(requestText, {
“dbobject”: function (arg) {
return DB.lookupById(arg);
}
});

When the parser sees the “`@dbobject ...`” part of the input string, it will call the constructor given to `JSON.parse`, which looks up and returns the object in the database by its identifier, making for a convenient, hassle-free deserialization of a complex object.

by
tonyg
on

# Continuation-Passing Style in Javascript

As part of my recent experiments with Javascript, I happened to want a `sleep` function. Interestingly, there isn’t one available in browser-side Javascript environments – instead, there are four functions

setInterval
clearInterval
setTimeout
clearTimeout

and they take functions to call after a certain number of milliseconds have gone by. This leads to programming in continuation-passing style (see here, here, here, and here):

// … some code …
// At this point, we want to pause for one second before continuing.
setTimeout(function () {
// … code here continues running after 1000 milliseconds have gone by.
}, 1000);

It seems that a (laudable!) design decision has been made for avoiding blocking operations in browser scripting languages. It’s not just `sleep`-a-likes, either: the decision affects the design of `XmlHttpRequest`, which is used in AJAX-like server callbacks. The way you receive the server’s response to an `XmlHttpRequest` message is by supplying a callback function. This again leads to a continuation-passing style of programming – see the definition of `AJAJ.proxy` in this file. Here’s an example of `AJAJ.proxy` in action, as part of a simple AJAX+JSON chat application:

function rpc(action, arg, k, kfail) {
var urlPrefix = (new RegExp(“(.*/)([^/]+)”)).exec(document.location)[1];
AJAJ.proxy(urlPrefix + “chat-server/simple-chat.asp”)
({action: action, arg: arg}, k, kfail);
}

function consoleRpc(action, arg) {
rpc(action, arg,
function (r) { console.recordOutput(formatConsoleRpc(action, arg),
JSON.stringify(r)); },
function (e) { console.recordError(formatConsoleRpc(action, arg),
JSON.stringify(e)); });
}

The two functions `k` and `kfail` passed to the `rpc` function serve as the success and failure continuations, respectively. If the `XmlHttpRequest` succeeds, then the function starting with “`function (r) { ...`” will be called with the response from the server; otherwise, the function starting with “`function (e) { ...`” will be called with an exception object that reports on the failure. We’re using closures to extend the available control structures of the language.

by
tonyg
on

#### Categories

You are currently browsing the LShift Ltd. blog archives for October, 2005.

#### Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060