The internet does reliability end to end. That is, when a node A sends a message to node B, the message travels through many other nodes. These intervening nodes make a best effort at delivery, but if they don’t succeed, they just forget about it. B must confirm receipt, and A must try again if it doesn’t.
Communicating securely requires the same approach. We should not guarantee each step is secure, and hence the whole path is secure, but design a system in which only the design of A and B matters.
The more we do our computing in the cloud, the more true this becomes. Simply put, the security of our communications depends on too many people, and the chances of failure multiply.
For example, we might use a messaging service, provided in the cloud. A and B could authenticate to the message server, and ensure privacy by connecting using SSL. The server could implement access controls, to ensure other users of the service won’t receive messages they are not entitled to receive. In this arrangement, you are making some assumptions:
Alternatively, A could encrypt messages using B’s private encryption key, and sign them with its private signing key. Then we are assuming:
Unfortunately good excuses abound:
I’ve recently designed a solution which uses end to end encryption of messages, and a message server to minimise the attack surface of the solution, and ensure privacy while transferring data over the internet. The attack surface is minimised to the extent that the nodes are behind firewalls that allow no incoming connections. Because messages must have valid signatures to be accepted, an attacker cannot use specific payloads in messages to attack the system. They are limited to attacking the code which parses the envelope. This code is extremely simple.
To do this efficiently, we have designed our own cryptographic suite, based on the NSA Suite B specification. Because this specification does not specify an encoding, we have designed our own envelope for messages. The resulting library also includes a sophisticated, extensible trust model. We plan to publish the result as open source.
We see lots of sites that use RabbitMQ, and other message servers, and would benefit from the security this architecture provides. Either these sites allow clients to interact with services via a message server, or use cloud based messaging services. Using cloud based messaging services has significant cost benefits over hosting your own message server, and in this architecture, the decision to do this has no security implications.
In an earlier post I mentioned that I was using CorePy for my cryptographic fiddlings.
Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
Well, I still think CorePy is very cool, but it turned out to introduce some tricky problems of its own, and by taking more care about how I use it I have sped up my program by nearly three orders of magnitude.
At first I thought that it was simply that my assembly was so fast that the Python excution time was dominating. In fact, Python is a little faster than I thought, and it was CorePy itself that was taking the bulk of the time. My inner loop executed in less than 1000 clock cycles, but the per-call overhead was far greater – more like 21,000 cycles. In addition to this per-call overhead, however, I had to reckon with a cost of nearly 1500 cycles for every 32-bit integer I loaded and stored from CorePy’s native arrays. This contrasts with a cost of less than 200 clock cycles to store one in a native Python list, or around 250 cycles for a NumPy array. If you have CorePy installed, you can see for yourself. So I had a very powerful machine, but I could only talk to it through a very narrow pipe.
My workaround can be found in the latest version of trivium-corepy. In addition to the assembly for computing Trivium itself, there’s now assembly specific to performing the cube attack. The code now takes a list of bit indexes to flip. Looping over these indicies it runs Trivium, XORs the output bits into a result area, and then flips the indexed bit in all 128 instances of Trivium inside the buffer before doing it all again. This means that each run lasts much longer, making the 21000 cycle overhead less significant. It’s still better to have more than one run, though, because of the cost of the writes needed to set up this index array – in fact, the sweet spot is reached when the total penalty from the writes to set up the index array is equal to the total penalty from the per-call overhead, because it is easy to double one by halving the other. So the indices are divided into three groups – those that will be handled in Python, those that will be handled by looping in assembly, and those that will be calculated in parallel by our 128 simultaneous instances of Trivium.
It’s with this technique that I’ve been able to find the problematic maxterms in the original Dinur and Shamir paper, and verify that the replacement maxterms I recently received from Dinur all work fine. I’ve even set the code on finding new maxterms, and found some for output bits up to 714, though the techniques I’m using are fairly basic (for example, I don’t yet take advantage of the fact that we get lots of output bits at once for free). Overall I think that I’m still glad of what CorePy gives me, but I think that it could be simpler still to write fast programs if those overheads could be significantly reduced.
It looks like there are errors in the tables at the back of the “cube attack” paper which show how to apply the attack to Trivium: some of the entries don’t work. This could mean simply that there are typos in the table, or it could mean that the attack is somewhat less effective against Trivium than the authors believed. Or, of course, it could mean that there’s a bug in my software.
As discussed in my previous entry, to apply the “cube attack” to a cipher like Trivium, you start by searching for a set of “maxterms”. A maxterm is an affine relation between the key bits and the parity of all the values of an output bit for a particular large set of IVs known as a “cube”. Once you have enough maxterms, you can collect the output bits for all the IVs in all the cubes in the maxterms, then find the parity for each maxterm, and then use linear algebra to invert and find the key bits; so you have a complete set of maxterms when you have as many maxterms as key bits and they are all linearly independent in key bits. In practice you would make do with fewer linearly independent maxterms and make up the difference with exhaustive search.
Trivium has 1152 initialization steps, and no-one currently knows how to break it. Before the cube attack, the best attack on Trivium broke a weakened variant with 672 initialization rounds in 255 steps. The cube attack paper presents three tables of maxterms for Trivium: one set of 63 to break the 672 step variant in 219, one set of 52 to break a stronger 735-step variant in less than 230 steps, and one set of four that might form part of an attack on 770-step Trivium requiring around 236 steps.
However, it looks like 15 of these 119 maxterms presented don’t work. Of these, two work 0% of the time, and so are easily converted to working maxterms by adding or removing a “1+” to the first column. Five appear to be a little unreliable, working 79%-99.2% of the time; a slightly unreliable maxterm is useful to an attacker, but the success probability of the attack is the product of the success probabilities of all the maxterms, so too many unreliable ones will make the attack unlikely to work. But eight seem have empirical success probabilities between 46% and 58%, suggesting that in practice they work 50% of the time ie no better than guesswork.
This could of course be an error in my software; I’d love to see these results verified. You can test for yourself by downloading the latest version of the software here:
http://hg.opensource.lshift.net/trivium-corepy
Updated: In correspondence between myself and Dinur we’ve fixed another two of the maxterms – fixes to the other 11 are on the way.
I present a new implementation of the stream cipher Trivium designed for cryptanalysts, in particular those interested in applying the “cube attack” to Trivium. It generates 128 simultaneous output streams using SSE2 intrinsics, and achieves under 1 cycle/byte, over four times faster than standard implementations. The entire program is in Python; SSE2 machine instructions are generated and called using the tool CorePy, an approach I am happy to recommend to others with similar needs. The code is under the MIT licence and may be found in this Mercurial repository.
Dinur and Shamir’s “cube attack” caused something of a media storm when it was first presented at a rump session at Crypto 2008. The attack may be widely applicable: it takes an extremely general view of the cipher under attack, and can be applied even where the cipher is treated as a “black box” into which key and IV bits are put in and output bits result, with no consideration of its internal structure. Early reports that this attack would break all of our most trusted primitives are false, however – the attack depends on the degree of the polynomial representing the working of the cipher, and for a cipher like AES the degree is far too high for the attack to work.
The one real cipher against which the attack has been applied is Trivium, a personal favourite of mine with real beauty in the design; astonishingly simple, neatly implementable in both hardware and software, and carefully thought out resistance to cryptanalysis. Trivium has 1152 initialization rounds; the cube attack breaks a weakened variant with 735 initialization rounds in less than 230 operations, and the authors say “We believe that by using more advanced techniques, we can break much stronger Trivium variants, and maybe even the original cipher”.
As referenced above, applying the cube attack to a cipher is to a certain extent an empirical business; one searches for a set of “maxterms” that break the cipher, and this involves running the cipher many, many times. It occurred to me that for this use case, we could improve on the standard implementation of the cipher by going back to the roots of bitslicing.
The block cipher DES is designed entirely around efficient hardware implementation, and is notoriously difficult to implement in software. In particular, several parts of the cipher require complex permutations in bits – in hardware, this is just some wires crossing, but in software a fiddly sequence of shifts, rotates and masks. In 1997, Eli Biham presented A Fast New DES Implementation in Software. This implementation achieved its speed – about five times faster than the fastest previously known – by doing 64 simultaneous block encryptions in parallel, treating a normal 64-bit Alpha computer as a SIMD computer, and using each bit position of a word to describe part of the state of one of the 64 simultaneous encryptions. Starting from a description of a logic gate circuit to implement DES, this implementation would use standard logical operations to mirror the work of the circuit. No shifts or rotates are needed – each column of bits is an independent part of the work, and each logical operation carries 64 times its weight in useful work done. This technique was dubbed “bitslicing” by Matthew Kwan, and the name stuck.
Not long after this, the block cipher Serpent was published by Anderson, Biham, and Knudsen. Serpent takes advantage of bitslicing, not for multiple parallel instances, but to make a single instance of the cipher. The state of the cipher is four 32-bit words, and the nonlinearity in the cipher is a four-bit S-box on bit columsn of these words implemented using bitslicing. These alternate with a linear mixing stage based on shifts, rotates and XORs. This proved a popular technique, and Trivium itself is designed to be implemented using bitslicing in a similar way, advancing the state by up to 64 bits at a time using logical operations on words.
However, this implementation is bitsliced in the old fashioned way – it is 128 independent instances of the Trivium cipher, represented as vectors of 128 bits. This has two advantages over a standard Trivium implementation: twice as many bits can be operated on in a single instruction using SSE2 intrinsics, and no shifts or masks are needed. As a result, it operates at over four times the speed, achieving under 1 cycle/byte – 2.4 GBytes/sec – on my 2.4 GHz Intel Core 2 Duo PC.
One unusual feature of this implementation is worth mentioning – it’s in Python. Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
The codebase includes our bitslice implementation and a very simple pure-Python Trivium implementation for testing, as well as code to test one against the other, to benchmark the fast implementation, and to use our fast implementation to test the “cube attack”. Please do let me know if you’re playing with this or putting it to use, I look forward to hearing from you.
A few days ago, in Squaring Zooko’s Triangle, I sketched out a proposal for a new naming scheme which to a limited extent achieved all of the contradictory-seeming aims set out in Zooko’s Triangle. Having discussed the essay with a few folk since, it seems like it might be worthwhile trying to clarify the ideas behind it.
(Updated 2012-04-27 to use Tahoe-LAFS’s superior terminology, and correct some related errors.)
A brief aside before we start: Zooko’s Triangle is actually a tetrahedron. There are four desirable properties of a name, and we know how to achieve any three of these four. You hear it discussed as a triangle because authors take one of the four points as so essential and non-negotiable that it isn’t even discussed, but different authors take different points as essential so it can be confusing to read multiple accounts of it. The four desiderata are
Sacrifice any one of these and there’s an existing system that provides the other three.
In this framework, the original article on Zooko’s Triangle assumes transferability is non-negotiable, Clay Shirky’s essay assumes that specificity is non-negotiable. Petnames assumes that authoritylessness is non-negotiable, and describes a system that tries to achieve all three goals by switching seamlessly between the three different kinds of names that remain.
| Zooko | Shirky | Steigler | Tahoe-LAFS |
|---|---|---|---|
| human-readable | memorable | memorable | memorable |
| decentralized | non-political | - | decentralized |
| secure | - | securely unique | specific |
| - | global | global | transferable |
Now back to my proposal, which attempts to achieve all four. This proposal combines three ideas towards this goal.
I definitely don’t claim that what I propose here does away with the need for systems like Petnames, by the way; it has its own disadvantages and is meant only to complement those systems.
We start with PGP key fingerprints, which look like this:
Specific, decentralized, transferrable, and totally unmemorable.
So, the first idea is to use word lists instead of hexadecimal to improve readability. I went to Diceware for my wordlist, since that wordlist is specifically meant to provide a long list of short memorable words for cryptographic purposes. Perhaps there are other suitable lists – do let me know – but that one is certainly good enough for a demonstration of principle. I stripped out all the ones containing special characters, since they are inconvenient for a variety of reasons and it only makes the list slightly shorter. The resulting list contains 7322 words, so you can encode about 12.8 bits with each word, and 160 bits comes to 13 words:
Well, that’s an improvement, but it’s still a long way from what I’d call “memorable”. It might be memorable if we could cut it down to something much shorter – I aimed for five words:
Now that’s getting close to memorable. It’s also completely insecure. At five words, it’s encoding about 64 bits. If that’s your name, and I want to imitate you, I have to generate and hash 264 public keys before I find one that matches this one; because it’s in bulk the public key generation can be done very cheaply, so the only real cost is the hash. This is comparable to the effort involved in the distributed RC5-64 attack which was successfully completed five years ago and would be perhaps ten times easier today.
Actually, the situation is much worse than this, but I’ll come to that in a moment.
We can foil this attack by increasing the number of words, but at the cost of memorability. Instead we make this attack more expensive in a different way: we increase the cost of generating a name given a public key. One possible technique would be something like key stretching; we could subject the key to an iterated hash with many steps before generating the name from it. If we used for example 224 steps, then generating a name would take 10-20 seconds – acceptable, since generating names is rare – while the difficulty of our attack goes up to 288 steps, well outside the range of even the best funded attacker today.
Unfortunately, this approach also slows down verifying a public key given a name to 10-20 seconds, and that’s much more of a problem because that’s a frequent operation. On a small device like a phone, it would be much worse; again, you can probably sacrifice generating names on your phone but you won’t want to sacrifice verifying them. Instead, the second idea: we use a proof-of-work based scheme. A more complete proposal should probably use one of the more sophisticated proof-of-work schemes out there; I saw a great one recently involving function inversion to force a minimum memory usage on the attacker, increasing the cost of a parallel hardware attack at no cost to real users, but I can’t now find the reference. Again, for proof of concept, I’ve used a pretty simple scheme.
In this scheme, you hash the public key together with a nonce to generate the name, and the combination is only valid if the last 24 bits of the hash are all zero; then you use the first 64 bits to generate the name. You’ll have to try about 224 different nonces before you find a valid one, so name generation takes 10-20 seconds as before, but now you supply the nonce along with the public key to someone verifying your name, and they can verify the name instantly with a single hash function invocation, making sure to check that you’ve done your work and the last 24 bits are zero.
This plan has another advantage too: there’s now more than one name for each public key. That means that if the name the system gives you is a particularly unfortunate choice of words, you can just run it again, wait another 10-20 seconds, and get another one. I suspect this is pretty much an essential property of any such scheme.
We’re not quite finished, because our attacker has another trick up their sleeve: multiple-target attacks. All of the foregoing assumes they’re trying to imitate one person: you. Perhaps they want to imitate either you or your sister? Now every name they generate has two chances of matching – one against your name, and one against your sisters – and so on average it will take them half as long to find a match. What if they’re happy to match any of the 26,000 people who work at the Pentagon? What if they’re happy to match anyone in the world? In a world where public keys are in widespread use by nearly everyone, this could make their attack easier by a factor of 232 – this probably isn’t the most realistic attack scenario, but in crypto it pays to be pessimistic, and this brings the cost of the attack down to a very achievable 256 steps, even given the countermeasures described above.
Again, we could prevent that by making the name longer – another two components pushes the cost back up to over 281. However, we can make a lesser sacrifice of memorability with the third idea: we add a component to the name that you choose. Your name might be a good choice, but you can choose a pseudonym or a handle so long as you think that other people making the same choice will be reasonably rare. We hash this “handle” in along with the nonce and the public key. Now our attacker can’t attack all the names at once – only all the names with the same handle, which hopefully is a much smaller subset. Even if there are a dozen John Smiths working at the Pentagon, the gain for our attacker is vastly reduced – they’re still looking at a work factor of over 284 to hit any of them. Of course our attacker can make up as many names with this prefix as they like, but it doesn’t help them much, because they’ll just end up attacking their own names rather than the names they care about. And this big improvement in security comes at a relatively small price in memorability, since the “handle” is chosen to be meaningful to both sender and recipient.
You can improve memorability further by using a mnemonic. I’ve been able to memorize the randomly generated name I used as an example by imagining a square meter pool of oil dripping slowly onto the opening credits of a film: area-fluid-above-movie-start.
I’d like to get some feedback and look into using a more sophisticated proof-of-work scheme before I propose an implementation of this. However, one further improvement comes to mind: it might make life a lot easier if I used a URI-like syntax. Something like this:
(Note: a more detailed explanation and justification can be found in Squaring Zooko’s Triangle, part two)
Zooko’s Triangle encompasses the three properties you’d like a name to have:
The thesis behind the triangle is that of these three desirable properties, you can have any two. If you’re not familiar with Zooko’s Triangle, Clay Shirky’s essay is the clearest explanation (he independently rediscovered the same idea), and Marc Stiegler’s essay on petnames illustrates it and discusses one solution.
In this essay I propose a kind of name which is entirely decentralized, reasonably secure, and at least somewhat memorable. They look like this:
Paul_Crowley:area-fluid-above-movie-start
The name is in two parts: the handle h (“Paul_Crowley”) and the tag t (“area-fluid-above-movie-start”). Associated with a particular name will be a public key k and a 32-bit nonce n. I can print the name on my business cards, and publish the key and nonce on my website, and you can securely verify using the nonce that the key belongs to the name.
The scheme is based on a fixed, globally shared word list. I got a possible list using the Beale list from Diceware, which is designed for a different but related purpose, and removing all the words that contained special characters; the resulting list contains 7322 words. To fill out the details on this scheme, we’ll also need to fix a hash function and an input encoding for the hash function; SHA-256 and SPKI-like S-expressions seem plausible choices.
Given a name h:t as above, a key k, and the nonce n as above, you verify the match as follows: hash three of these together to generate v = H((h, k, n)). Now verify that the last 23 bits of v are all zero. Finally, treat the first 65 bits as five groups of 13 bits, and use each group of 13 bits as an index into the word list; ensure the resulting 5 words match the five words in the tag.
To generate yourself a name, you’ll need a public key k and a handle h; you should choose a handle that you don’t think many other people will have chosen. Repeatedly try lots of random values for the nonce n until you find one which results in a hash which is zero in the last 23 bits, and for which each of the five groups of 13 bits yields an index small enough that it doesn’t point off the end of the word table (since they can yield 8192 different values but there are only 7322 words in the table). On a standard PC it should take about 10-20 seconds to find a suitable nonce. Of course, you may not like the resulting tag (“paulcrowley:hitch-marty-grail-mild-coot”) in which case you can keep searching for a nonce you like better.
It’s clear that this is decentralized, and you can assess the memorability for yourself; the example I’ve used is randomly generated, though I had a few goes to find one I liked. How is it for security?
Let’s suppose our attacker wants to pretend to be me. They must find a (h, k’, n’) such that H((h, k, n)) matches H((h, k’, n’)) in 88 bits: the 65 bits at the start that make up the tag, and the 23 zero bits at the end. Assuming there’s no way of exploiting some useful property of the hash function, this will take about 288 trials, far outside the reach of even the best-funded attacker.
Our attacker might like to avail themselves of a multiple-target attack (as I’ve described earlier). However, they can only simultanously attack all those with the same handle, which if you’ve chosen your handle well will be few.
This makes sacrifices on both the memorability and security fronts, but it’s the closest thing I’ve seen to something that achieves all three goals.
(Continued in Squaring Zooko’s Triangle, part two)
Hopefully this will be crossposted in several places.
My current plan to change the world involves writing a manifesto for a proposed mailing list to work out crypto standards that actually work and stand a chance of getting widely adopted in the open source world. This is essentially version 0.1.5 of that rant, and may contain some inaccuracies or overstatements; I look forward to your comments and corrections.
Currently there are four crypto standards that see any real use in open source land; in order of deployment, they are:
It’s worth noting one other infuriating consequence of the PKI problems these applications display: none of them really talk to each other. You can buy an X.509 certificate that will do for both your SSL and IPSec applications, if you’re really rich; these certificates will cost you far more than a normal SSL certificate, and for no better reason than that they are more useful and so Verisign and their friends are going to ream you harder for them. Apart from that, each application is an island that will not help you get the others set up at all.
I’ve left out WEP/WPA basically because it’s barely even trying. It should never have existed, and wouldn’t have if IPSec had been any good. <hr/>
I’m now in the position of wanting to make crypto recommendations for the next generation of the Monotone revision control system. I wish I had a better idea what to tell them. They need transport-level crypto for server-to-server connections, but I hesitate to recommend SSL because the poison that is X.509 is hard to remove and it makes all the libraries for using SSL ugly and hard to use. They need to sign things, but I don’t want to recommend OpenPGP: it’s hard to talk to and the Web of Trust is a truly terrible fit for their problem; on top of which, OpenPGP has no systematic way to assert the type of what you’re signing. They need a way for one key to make assertions about another, and we’re going to invent all that from scratch because nothing out there is even remotely suitable.
Monotone has re-invented all the crypto for everything it does, and may be about to again. And in doing so, it’s repeating what many, many open source applications have done before, in incompatible and (always) broken ways, because the existing standards demand too much of them and give back too little in return. As a result, crypto goes unused in practically all the circumstances where it would be useful, and in the rare case that it is used it is at best inconvenient and unnecessarily insecure.
I don’t believe that things are better in the closed source world either; in fact they are probably worse. I just care more about what happens in the open source world.
We can do better than this. Let’s use what we’ve learned in the thirty-odd years there’s been a public crypto community to do something better. Let’s leave the vendors out, with their gratuitous complexity and incompatibility as commercial interests screw up the standards process, and write our own standards that we’ll actually feel like working with. We can make useful software without their support, and it seems in this instance that their support is worse than useless.
A good starting point is SPKI. SPKI has a very nice, clean syntax that’s easy to work with in any programming language, very straightforward semantics, and supports constructions that anticipate the central ideas behind petnames and Zooko’s Triangle. Unfortunately SPKI seems to be abandoned today; the feeling when I last looked at it was that despite their inadequacies, the victory of PKIX and X.509 was now inevitable and resistance was futile.
Well, it turns out that X.509 was so bad that no amount of industry support could turn it into the universal standard for key management applications. There are places that it will simply never be able to go, and in fact these are the vast majority of real crypto applications. On top of which, there is a limit to how far a standard that hardly anyone will ever understand the application of can go.
It’s time we brought back SPKI. But more than that, it’s time we adapted it for the times it finds itself in; take out the parts that complicate it unnecessarily or slow its adoption, extend it to do more than just PKI, and specify how it can talk to the existing broken cryptographic applications in as useful a way as possible. Once we’ve built a working petnames system to serve as a workable PKI, my current feeling is that we should start with no lesser a goal than replacing all of the standards listed above.
Does anyone else think this sounds like a good idea? What other way forward is there?
GF(232-5) is a finite field.
Many of the most useful bits of coding theory and in particular network codes work by treating the data you want to transmit as a vector over a finite field. The link will tell you the full story, but in summary a finite field is a set of finitely many elements, with addition and multiplication operations defined on them such that they are very well behaved: a + b = b + a, (a + b) c = ac + bc and so on.
For any prime p and any positive integer n, there’s exactly one finite field with pn elements; this field is called GF(pn). In particular, since 2 is a prime, there’s conveniently a finite field with 2n elements. And there are straightforward ways to treat, for example, an array of bytes as an array of elements of GF(28); or if it’s more convenient, break up your data into larger chunks and work in, say, GF(216) or GF(232).
Addition in GF(2n) is simply XOR, which is as fast as you could want in both hardware and software. Multiplication hardware for GF(2n) is similarly much simpler than multiplication hardware for ordinary arithmetic, which has made these fields very attractive for a variety of coding and cryptographic applications in hardware.
Unfortunately, doing these multiplies in software is much less efficient, because processors don’t have suitable instructions and hardware dedicated to these jobs. If you’re using coding theory to transmit large amounts of data, this could give rise to quite an inconvenient CPU load.
My colleague Sebastian Egner and myself have been investigating an alternative that could prove considerably more efficient: instead of working in GF(2n), work in GF(p) for some prime p. This is probably the most familiar finite field: you simply do addition and multiplication “mod p“. So you can use the addition and multiplication units on a normal processor to do most of the work. Two problems arise:
Multiply may be fast, but division isn’t. How will we prevent all those “mod p” operations slowing us to a crawl?
We have a vector of bits to encode, and we have to encode it as another vector of bits. Won’t an intermediate representation “mod p“, with its awkward prime order, introduce unacceptable inefficiencies in coding?
Our proposal addresses both these problems. One application in particular motivated our thinking about all this. Dot products over finite fields are at the heart of both the distribution and the authentication in Microsoft’s BitTorrent-like research project, Avalanche. Our approach has the potential to make the Avalanche software several times more efficient.
We propose to work in the finite field GF(232-5) – here p = 232-5 = 4294967291 is the largest prime below 232. So representing a vector in this field for transmission is easy: just transmit it as a stream of 32-bit words. Converting from a vector of bits to a vector in this field is more interesting, but we’ll come to that later – first, we’ll consider how to do arithmetic efficiently in this field.
The basic idea is to put off that expensive “mod p” operation for as long as possible, and then do it as cheaply as we can. The operation we’re most concerned about is the inner product of two vectors – multiply each component of one vector with the corresponding component in the other, and sum all the products. We do this as follows:
The “multiply by five” trick works because 232 ≡ 5 (mod p). This assumes that vectors have 232 or fewer components, though if they have more than 232/25 components the final summing up steps have to be modified slightly.
Sebastian has implemented this approach and compared it to his own highly optimized GF(216) implementation – a matter in which he has no mean expertise – and found it to be over twice as fast. We believe that considerably greater speedups are possible since use of this field makes the work considerably more amenable to implementation using SIMD instruction sets such as SSE2.
So that leaves us with one last problem – how do we convert the bits to be encoded into a vector in this field ready for encoding?
The most efficient encoding possible reads the input as one huge base-2 integer, and converts it to a base-p integer. This encoder has an asymptotic coding rate of about 1 – 2-34, but encoding and decoding is far, far too slow – in fact, the time grows quadratically on the length of the input. A more time-efficient encoding is needed, even at some sacrifice of coding rate.
Our first ideas to this end were based on the idea of escaping. If the data is compressed, then words of value p or above will be rare. We could just escape them in some way – for example, represent every word x where x ≥ p -1 as a pair (p -1, x – p +1). However, this means that the length of the encoded vector is unpredictable, which could be a pain. Worse, there will be pathological files whose encoded size is twice the size as other files of the same length.
We played with a variety of solutions to this problem, some quite efficient, but since then found a much more efficient encoding based on a somewhat different principle: break the data into blocks of a certain maximum size, and XOR the blocks with a constant.
Decoding in this encoding is lighting fast. Given an encoded vector
(y0, y1, … ,yk)
the decoded vector is simply (2y0 ⊕ y1, … 2y0 ⊕ yk). In other words, multiply the first word by two (as a 32-bit integer, not as a member of the finite field) and then XOR it with all the other words.
Note that for many applications there are many more recipients than senders, so while encoding need only be acceptably efficient, we want decoding to be very efficient. It’s hard to imagine that a more efficient decode is possible here.
Why does this work? Does every possible block have a valid encoding, and how do we find it? Suppose first that we set the maximum block size to, say 219 -1 words. By the pidgeonhole principle, there must be at least one 19-bit prefix that does not appear in the block. Let m be this prefix – then we choose y0 = 212(m ⊕ 0×7FFFF). Now XOR every word in the block with 2y0 – the result is guaranteed to have at least one of its first 19 bits clear, and is therefore guaranteed to be below p.
We find m by constructing a 64 kbyte bitmap, and combing through the data marking every prefix that does appear, then searching the bitmap for the first hole; as we note above there is guaranteed to be at least one. This bitmap will just fit into L1 cache on many processors so the search can be quite efficient.
219 -1 is an awkward kind of block size. We can extend it to 219 as follows: if we find that every 19-bit prefix is represented, each of them must appear exactly once. If d0 is the first word of the data to be encoded, then we choose y0 = (d0 ⊕ 0xFFFFFFF8) >> 1. Now the first word will encode as 232 -8 or 232 -7, which fits below p, and the subsequent words will have at least one zero in the first nineteen bits of their encoding and therefore also fit below p.
This approach adds one word to a block of 219 words, so the rate of the code is 1 – 2-19, which is efficient enough for nearly any purpose. However, we can push the rate further if desired. This approach extends straightforwardly to a block size of 229, but a 64 Mbyte bitmap is impractical – the table is randomly accessed, so once it extends beyond the size of the L2 cache the software will crawl due to cache misses. Instead we use a recursive approach:
This gives us a rate of 1 – 2-29.
We can squeeze one more bit out. In a block of 230-1 words, at least one 29-bit prefix must appear at most once. In other words, there must exist a value m such that either
In either instance, we choose y0 = (m ⊕ 0xFFFFFFF8) >> 1. The words not equal to m have a different 29-bit prefix, and so they encode to values below 232 -8 and thus below p. m itself will encode to either 232 -8 or 232 -7, which are also both below p.
In summary: if you need to treat data as vectors over some field on a conventional computer architecture, it can be significantly faster to compute modulo the prime number p = 232-5 than using a field with size a power of two. You will have to encode arbitrary 32-bit words into values modulo p, but the encoding and decoding algorithms can be extremely fast, and at negligible redundancy: the rate of the code can be as high as 1 – 2-30. This is particularly useful for network coding applications.
This paper proposes to reduce the workload of SSL servers by making the clients carry as much of the crypto-related load as possible. I think it’s possible to do even better.
Key agreement: In the above, the server only has to do an RSA public key operation, which is cheap if the exponent is low (eg three). However, we can do even better (and have a stronger security bound too) by using the Rabin operation – modular squaring – instead. This is more than twice as fast as the RSA operation with exponent three. Normally, Rabin encryption is slowed down by certain steps that are needed to handle the fact that modulo a large semi-prime, most numbers that have square roots have four of them, and the recipient has to know which one you mean. However, modern KEM schemes greatly reduce this cost, and Rabin-KEM encryption is just about the fastest public key operation I know of, with the exception of certain probabalistic signature checking schemes.
Signatures: a trick has been missed here. Modern “one-time” signature schemes (eg “Better than BiBa“) can actually sign many messages before the private key must be discarded for security, which in an online/offline signature scheme greatly reduces the number of documents to be signed. For even greater efficiency, a hash tree can be used to sign many one-time keys simultaneously. At the cost of imposing a small latency on all clients, we can even discard the one-time signatures, avoiding a patent, and directly use hash trees; as many clients try to connect, the server can place the documents to be signed in a hash tree and sign them all with one operation. This scheme scales very nicely: the server performs its public key operation at a constant rate of, say, ten per second, and no matter how many clients are trying to connect these signatures will serve to authenticate the server to them all. The clients may have to wait an extra tenth of a second for the connection to complete, but this cost will be small in the cost of connecting to a busy server.
Client puzzles I’m not sure I understand why mixing up the client puzzle step and the public key generation step is beneficial.
With this scheme, the server only has to do one modular squaring per client – and even that only when the client has proven its worth by solving a client puzzle. I wonder if it’s possible to do even better?
Here’s a proper programming challenge, simple to state, and in the end quite challenging to solve. A function produces a 64-bit output given a 64-bit input. It mostly behaves like a random function, so given a random input, most outputs occur with probability between 0 and 2-63; in other words, for a randomly-chosen 64-bit y, there are usually between 0 and 2 solutions to the equation f(x) = y. However, a few occur with much greater frequency – more like 2-32 or so. The function is pretty quick and easy to calculate, but its internal structure defies direct analysis. How would you identify these more frequent occurrences by experiment? How big a computer would you need to do it in something like a few hours to a few days? Think about it for a while before you read on.
First of all, to be at all confident of finding these high frequency outputs, you’re going to have to examine on the order of 234 outputs. That’s 128 GB of data. When you produce your last output, you want to be able to determine, at some stage, whether it’s the same as any output you’ve already produced, and I think that this makes storing all previous outputs unavoidable, so you’re going to need a big disk. Fortunately I have a 200 GB disk attached to my home machine that I’m not using yet, so I can manage this part. The tricky thing is, how are you going to arrange your data on disk so you can find duplicates?
You can’t just use the disk as one big hash table, because each write will call for a seek, and 8.5 ms times 234 is about 4.6 years. You could buffer and sort your writes, but my home machine only has 512 MB of memory and it needs some for other things, so you would have to do about 512 buffered writes and crawl through the entire 128GB data structure every time, which would be desperately slow each time. You could just write it all to disk and then sort it, but I suspect you’d be looking at a long wait for a disk-based sort of 16 billion 8-byte items to finish – just reading and re-writing the entire dataset will take over three hours, and you would probably have to do that at least nine times.
So the first idea I had was this: As you generate each datum, hash it and put it into one of 1024 bins based on the hash code, and write all the bins at once. Each bin now only contains 128 MB of data, which is sufficiently little that we can read it all into memory and do duplicate finding that way using a big hash table. I implemented this by writing to 1024 files at once, one file per bin. I even found out how to get around the limitations on how many files you are allowed to have open at a time (with ulimit). However, it was incredibly slow; it used only about 6.0% CPU because writing the files was keeping it busy all the time. I think that thousands of files all growing at once is pretty much a pathological case for any filesystem, because both Reiser3 and XFS showed this behaviour. Caveat: when I tried this I was using 2048 bins because I was having trouble with getting my post-binning collection phase to handle larger files, but I doubt halving the number of bins would have made a huge difference. I tried buffering the bins. I even tried putting a pre-buffer before the buffers that fitted into L2 cache to make more efficient use of main memory, but all to no avail. It was going to take a day or two to generate all the data.
I asked for advice online, and the best suggestion I got was to preallocate the files before writing to them. Instead, I waited until I got into work the next morning to ask my colleague David, who takes a special interest in making disks behave efficiently. After some discussion, we came to a strategy that made far more efficient use of the disks.
We allocate 256 kB of memory to each bin, ie 256 MB in total for all 1024 bins. As soon as any one bin fills up, we write the entire 256 MB data structure to disk in one great sweep, empty all the bins, and continue. This makes slightly inefficient use of disk space, because not all the bins will be full. However, because the function is so close to being random, they are all very nearly full, and so the inefficiency is only around 2%; in other words, instead of making 512 writes we end up making 522. Now the generation phase is as disk-efficient as it can be – all it does is append giant chunks of data to a file at a time – and it takes only a couple of hours at 60% CPU usage.
Of course, the cost is that now there is a collection phase to read the contents of each bin. This phase takes 522 seeks to piece together the contents of each bin, and it has to be repeated 1024 times, once for each bin; the total cost is about an hour. This is in any case comparable to the cost of finding the duplicates in each bin. Since one process is disk-intensive and the other is memory-intensive, it makes sense to do them in parallel; after discussion with David the simplest parallelization turned out to be the humble Unix pipe. “./extractbins 203 binoutput | ./finddups” means that one can be pulling the data off the disk at the same time as the other is placing it into the hash table (which is slow because practically every write is a cache miss, and you have to read before you write introducing latency). Once the hash table is full, streaming through it to find the duplicates is pretty fast, around half a second.
This story doesn’t yet have a happy ending – hardware problems with my home machine mean that I can’t run the collection phase without crashing it, which I think means I need to buy some decent memory for it. But it was very cool to get a twelve-times speedup for the whole process out of half-an-hour’s discussion with a knowledgable colleague.
You are currently browsing the archives for the Cryptology category.