technology from back to front

Security: Squaring Zooko’s Triangle, part two

A few days ago, in Squaring Zooko’s Triangle, I sketched out a proposal for a new naming scheme which to a limited extent achieved all of the contradictory-seeming aims set out in Zooko’s Triangle. Having discussed the essay with a few folk since, it seems like it might be worthwhile trying to clarify the ideas behind it.

A brief aside: Zooko’s Tetrahedron

A brief aside before we start: Zooko’s Triangle is actually a tetrahedron. There are four desirable properties of a name, and we know how to achieve any three of these four. You hear it discussed as a triangle because authors take one of the four points as so essential and non-negotiable that it isn’t even discussed, but different authors take different points as essential so it can be confusing to read multiple accounts of it. The four desiderata are

  • Security/uniqueness: one person has a better claim to the name than any other, and can establish that claim.
  • Memorability: you can remember the name if you see it on the side of a bus.
  • Transferability: you can give the name to someone else, and it means the same to them as it does do you
  • Authoritylessness: there is no central body to whom ultimate authority over the entire namespace is delegated

Sacrifice any one of these and there’s an existing system that provides the other three.

  • Sacrifice security/uniqueness and you get ordinary names and nicknames. I can claim to be “ciphergoth” with no authority, and you can just take in on trust from me if you like.
  • Sacrifice memorability and you get PGP key fingerprints - I can generate a key with a fingerprint any time I like, and if the system is secure, no-one else can for example sign a message with a key matching that fingerprint.
  • Sacrifice transferability, and I can locally bind a memorable name to a public key, as for example my ssh cache does.
  • Sacrifice authoritylessness and you get domain names, as maladministered by ICANN/Verisign.

In this framework, the original article on Zooko’s Triangle and Clay Shirky’s essay assume that transferability is non-negotiable, while Petnames assumes that authoritylessness is non-negotiable, and describes a system that tries to achieve all three goals by switching seamlessly between the three different kinds of names that remain.

Word lists

Now back to my proposal, which attempts to achieve all four. This proposal combines three ideas towards this goal.

I definitely don’t claim that what I propose here does away with the need for systems like Petnames, by the way; it has its own disadvantages and is meant only to complement those systems.

We start with PGP key fingerprints, which look like this:

ECA1 0308 8103 0F40 3983 FC7E F971 8B0F 70C1 A697

Secure, authorityless, transferrable, and totally unmemorable.

So, the first idea is to use word lists instead of hexadecimal to improve readability. I went to Diceware for my wordlist, since that wordlist is specifically meant to provide a long list of short memorable words for cryptographic purposes. Perhaps there are other suitable lists - do let me know - but that one is certainly good enough for a demonstration of principle. I stripped out all the ones containing special characters, since they are inconvenient for a variety of reasons and it only makes the list slightly shorter. The resulting list contains 7322 words, so you can encode about 12.8 bits with each word, and 160 bits comes to 13 words:

haiti-blotch-emma-wage-dusty-galaxy-dw-skate-mid-quiz-ivan-cramp-lamb

Proof of work

Well, that’s an improvement, but it’s still a long way from what I’d call “memorable”. It might be memorable if we could cut it down to something much shorter - I aimed for five words:

haiti-blotch-emma-wage-dusty

Now that’s getting close to memorable. It’s also completely insecure. At five words, it’s encoding about 64 bits. If that’s your name, and I want to imitate you, I have to generate and hash 264 public keys before I find one that matches this one; because it’s in bulk the public key generation can be done very cheaply, so the only real cost is the hash. This is comparable to the effort involved in the distributed RC5-64 attack which was successfully completed five years ago and would be perhaps ten times easier today.

Actually, the situation is much worse than this, but I’ll come to that in a moment.

We can foil this attack by increasing the number of words, but at the cost of memorability. Instead we make this attack more expensive in a different way: we increase the cost of generating a name given a public key. One possible technique would be something like key stretching; we could subject the key to an iterated hash with many steps before generating the name from it. If we used for example 224 steps, then generating a name would take 10-20 seconds - acceptable, since generating names is rare - while the difficulty of our attack goes up to 288 steps, well outside the range of even the best funded attacker today.

Unfortunately, this approach also slows down verifying a public key given a name to 10-20 seconds, and that’s much more of a problem because that’s a frequent operation. On a small device like a phone, it would be much worse; again, you can probably sacrifice generating names on your phone but you won’t want to sacrifice verifying them. Instead, the second idea: we use a proof-of-work based scheme. A more complete proposal should probably use one of the more sophisticated proof-of-work schemes out there; I saw a great one recently involving function inversion to force a minimum memory usage on the attacker, increasing the cost of a parallel hardware attack at no cost to real users, but I can’t now find the reference. Again, for proof of concept, I’ve used a pretty simple scheme.

In this scheme, you hash the public key together with a nonce to generate the name, and the combination is only valid if the last 24 bits of the hash are all zero; then you use the first 64 bits to generate the name. You’ll have to try about 224 different nonces before you find a valid one, so name generation takes 10-20 seconds as before, but now you supply the nonce along with the public key to someone verifying your name, and they can verify the name instantly with a single hash function invocation, making sure to check that you’ve done your work and the last 24 bits are zero.

This plan has another advantage too: there’s now more than one name for each public key. That means that if the name the system gives you is a particularly unfortunate choice of words, you can just run it again, wait another 10-20 seconds, and get another one. I suspect this is pretty much an essential property of any such scheme.

Handle

We’re not quite finished, because our attacker has another trick up their sleeve: multiple-target attacks. All of the foregoing assumes they’re trying to imitate one person: you. Perhaps they want to imitate either you or your sister? Now every name they generate has two chances of matching - one against your name, and one against your sisters - and so on average it will take them half as long to find a match. What if they’re happy to match any of the 26,000 people who work at the Pentagon? What if they’re happy to match anyone in the world? In a world where public keys are in widespread use by nearly everyone, this could make their attack easier by a factor of 232 - this probably isn’t the most realistic attack scenario, but in crypto it pays to be pessimistic, and this brings the cost of the attack down to a very achievable 256 steps, even given the countermeasures described above.

Again, we could prevent that by making the name longer - another two components pushes the cost back up to over 281. However, we can make a lesser sacrifice of memorability with the third idea: we add a component to the name that you choose. Your name might be a good choice, but you can choose a pseudonym or a handle so long as you think that other people making the same choice will be reasonably rare. We hash this “handle” in along with the nonce and the public key. Now our attacker can’t attack all the names at once - only all the names with the same handle, which hopefully is a much smaller subset. Even if there are a dozen John Smiths working at the Pentagon, the gain for our attacker is vastly reduced - they’re still looking at a work factor of over 284 to hit any of them. Of course our attacker can make up as many names with this prefix as they like, but it doesn’t help them much, because they’ll just end up attacking their own names rather than the names they care about. And this big improvement in security comes at a relatively small price in memorability, since the “handle” is chosen to be meaningful to both sender and recipient.

You can improve memorability further by using a mnemonic. I’ve been able to memorize the randomly generated name I used as an example by imagining a square meter pool of oil dripping slowly onto the opening credits of a film: area-fluid-above-movie-start.

I’d like to get some feedback and look into using a more sophisticated proof-of-work scheme before I propose an implementation of this. However, one further improvement comes to mind: it might make life a lot easier if I used a URI-like syntax. Something like this:

zooko://area-fluid-above-movie-start/Paul Crowley
by
Paul Crowley
on
21/11/07

Security: Squaring Zooko’s triangle

(Note: a more detailed explanation and justification can be found in Squaring Zooko’s Triangle, part two)

Zooko’s Triangle encompasses the three properties you’d like a name to have:

  • secure given the name, you can securely find the person it belongs to in order to, say, check their cryptographic signature on an email
  • memorable you should be able to see a name on a poster, and recognise it later when you come to use it
  • decentralized there should be no central naming authority who can screw things up the way that ICANN/Verisign inevitably do

The thesis behind the triangle is that of these three desirable properties, you can have any two. If you’re not familiar with Zooko’s Triangle, Clay Shirky’s essay is the clearest explanation (he independently rediscovered the same idea), and Marc Stiegler’s essay on petnames illustrates it and discusses one solution.

In this essay I propose a kind of name which is entirely decentralized, reasonably secure, and at least somewhat memorable. They look like this:

Paul_Crowley:area-fluid-above-movie-start

The name is in two parts: the handle h (”Paul_Crowley”) and the tag t (”area-fluid-above-movie-start”). Associated with a particular name will be a public key k and a 32-bit nonce n. I can print the name on my business cards, and publish the key and nonce on my website, and you can securely verify using the nonce that the key belongs to the name.

The scheme is based on a fixed, globally shared word list. I got a possible list using the Beale list from Diceware, which is designed for a different but related purpose, and removing all the words that contained special characters; the resulting list contains 7322 words. To fill out the details on this scheme, we’ll also need to fix a hash function and an input encoding for the hash function; SHA-256 and SPKI-like S-expressions seem plausible choices.

Given a name h:t as above, a key k, and the nonce n as above, you verify the match as follows: hash three of these together to generate v = H((h, k, n)). Now verify that the last 23 bits of v are all zero. Finally, treat the first 65 bits as five groups of 13 bits, and use each group of 13 bits as an index into the word list; ensure the resulting 5 words match the five words in the tag.

To generate yourself a name, you’ll need a public key k and a handle h; you should choose a handle that you don’t think many other people will have chosen. Repeatedly try lots of random values for the nonce n until you find one which results in a hash which is zero in the last 23 bits, and for which each of the five groups of 13 bits yields an index small enough that it doesn’t point off the end of the word table (since they can yield 8192 different values but there are only 7322 words in the table). On a standard PC it should take about 10-20 seconds to find a suitable nonce. Of course, you may not like the resulting tag (”paulcrowley:hitch-marty-grail-mild-coot”) in which case you can keep searching for a nonce you like better.

It’s clear that this is decentralized, and you can assess the memorability for yourself; the example I’ve used is randomly generated, though I had a few goes to find one I liked. How is it for security?

Let’s suppose our attacker wants to pretend to be me. They must find a (h, k’, n’) such that H((h, k, n)) matches H((h, k’, n’)) in 88 bits: the 65 bits at the start that make up the tag, and the 23 zero bits at the end. Assuming there’s no way of exploiting some useful property of the hash function, this will take about 288 trials, far outside the reach of even the best-funded attacker.

Our attacker might like to avail themselves of a multiple-target attack (as I’ve described earlier). However, they can only simultanously attack all those with the same handle, which if you’ve chosen your handle well will be few.

This makes sacrifices on both the memorability and security fronts, but it’s the closest thing I’ve seen to something that achieves all three goals.

(Continued in Squaring Zooko’s Triangle, part two)

by
Paul Crowley
on
10/11/07

Security: A crypto standards manifesto

Hopefully this will be crossposted in several places.

My current plan to change the world involves writing a manifesto for a proposed mailing list to work out crypto standards that actually work and stand a chance of getting widely adopted in the open source world. This is essentially version 0.1.5 of that rant, and may contain some inaccuracies or overstatements; I look forward to your comments and corrections.

Currently there are four crypto standards that see any real use in open source land; in order of deployment, they are:

  • SSL/TLS
  • SSH
  • OpenPGP/GnuPG
  • IPSec
These are the best examples of good practice that we cite when we’re trying to encourage people to use standards rather than make it up, and all of them fail to be any good as a practical, convenient basis by which people writing open source software can make their software more secure through cryptography. All of them suffer from three problems; in order of increasing severity
  • They are all designed long ago, in three cases initially by people who were not cryptographers, and are difficult to adapt to new knowledge in the crypto world about how to build good secure software. As a result, deprecated constructions for which there are no good security reductions are common. They are also generally far less efficient than they need to be, which would be a very minor problem if it didn’t put people off using them.
  • In every case protocols and file formats introduce far more complexity than is needed to get the job done, and often this shows up as complexity for the users and administrators trying to make them work, as well as unnecessary opportunities to make them insecure through misconfiguration.
  • But by far the worst of all is the parlous state of PKI. This of course is something I’ve ranted about before:
    • SSL’s dependence on the disaster that is X.509 makes it insecure, painful for clients, and imposes the ridiculous Verisign Tax on servers, as well as making it very unattractive as a platform for new software development.
    • SSH occasionally shows you a dialog saying “you haven’t connected to this server before, are you sure?” I’m sure someone’s going to tell me they actually check the fingerprints before connecting, but let me assure you, you are practically alone in this. I can’t even share this information across all the machines I log in from, even if I use ssh-agent. The situation for authenticating clients to servers is slightly better, but still involves copying private keys about by hand if you want the most convenience out of it. It makes you copy whole public keys rather than something shorter and more convenient like OpenPGP fingerprints. It certainly doesn’t make use of the basic fact that keys can sign assertions about other keys to make life more convenient.
    • OpenPGP’s authentication is based on the PGP Web of Trust, which is all about binding keys to real names using things like passports. As I’ve argued before, this is a poor match for what people actually want keys to do; it’s a very poor match for authenticating anything other than a person.
    • IPSec is also tied to the X.509 disaster. It is also so complex and hard to set up that AFAICT most IPSec installations don’t use public key cryptography at all.
Perhaps the key management problems in all these applications can be pinned down to one thing: they were all designed and deployed before Zooko’s triangle was articulated with sufficient clarity to understand the options available.

It’s worth noting one other infuriating consequence of the PKI problems these applications display: none of them really talk to each other. You can buy an X.509 certificate that will do for both your SSL and IPSec applications, if you’re really rich; these certificates will cost you far more than a normal SSL certificate, and for no better reason than that they are more useful and so Verisign and their friends are going to ream you harder for them. Apart from that, each application is an island that will not help you get the others set up at all.

I’ve left out WEP/WPA basically because it’s barely even trying. It should never have existed, and wouldn’t have if IPSec had been any good.


I’m now in the position of wanting to make crypto recommendations for the next generation of the Monotone revision control system. I wish I had a better idea what to tell them. They need transport-level crypto for server-to-server connections, but I hesitate to recommend SSL because the poison that is X.509 is hard to remove and it makes all the libraries for using SSL ugly and hard to use. They need to sign things, but I don’t want to recommend OpenPGP: it’s hard to talk to and the Web of Trust is a truly terrible fit for their problem; on top of which, OpenPGP has no systematic way to assert the type of what you’re signing. They need a way for one key to make assertions about another, and we’re going to invent all that from scratch because nothing out there is even remotely suitable.

Monotone has re-invented all the crypto for everything it does, and may be about to again. And in doing so, it’s repeating what many, many open source applications have done before, in incompatible and (always) broken ways, because the existing standards demand too much of them and give back too little in return. As a result, crypto goes unused in practically all the circumstances where it would be useful, and in the rare case that it is used it is at best inconvenient and unnecessarily insecure.

I don’t believe that things are better in the closed source world either; in fact they are probably worse. I just care more about what happens in the open source world.

We can do better than this. Let’s use what we’ve learned in the thirty-odd years there’s been a public crypto community to do something better. Let’s leave the vendors out, with their gratuitous complexity and incompatibility as commercial interests screw up the standards process, and write our own standards that we’ll actually feel like working with. We can make useful software without their support, and it seems in this instance that their support is worse than useless.

A good starting point is SPKI. SPKI has a very nice, clean syntax that’s easy to work with in any programming language, very straightforward semantics, and supports constructions that anticipate the central ideas behind petnames and Zooko’s Triangle. Unfortunately SPKI seems to be abandoned today; the feeling when I last looked at it was that despite their inadequacies, the victory of PKIX and X.509 was now inevitable and resistance was futile.

Well, it turns out that X.509 was so bad that no amount of industry support could turn it into the universal standard for key management applications. There are places that it will simply never be able to go, and in fact these are the vast majority of real crypto applications. On top of which, there is a limit to how far a standard that hardly anyone will ever understand the application of can go.

It’s time we brought back SPKI. But more than that, it’s time we adapted it for the times it finds itself in; take out the parts that complicate it unnecessarily or slow its adoption, extend it to do more than just PKI, and specify how it can talk to the existing broken cryptographic applications in as useful a way as possible. Once we’ve built a working petnames system to serve as a workable PKI, my current feeling is that we should start with no lesser a goal than replacing all of the standards listed above.

Does anyone else think this sounds like a good idea? What other way forward is there?

by
Paul Crowley
on
20/02/07

Security: GF(232-5)

GF(232-5) is a finite field.

Many of the most useful bits of coding theory and in particular network codes work by treating the data you want to transmit as a vector over a finite field. The link will tell you the full story, but in summary a finite field is a set of finitely many elements, with addition and multiplication operations defined on them such that they are very well behaved: a + b = b + a, (a + b) c = ac + bc and so on.

For any prime p and any positive integer n, there’s exactly one finite field with pn elements; this field is called GF(pn). In particular, since 2 is a prime, there’s conveniently a finite field with 2n elements. And there are straightforward ways to treat, for example, an array of bytes as an array of elements of GF(28); or if it’s more convenient, break up your data into larger chunks and work in, say, GF(216) or GF(232).

Addition in GF(2n) is simply XOR, which is as fast as you could want in both hardware and software. Multiplication hardware for GF(2n) is similarly much simpler than multiplication hardware for ordinary arithmetic, which has made these fields very attractive for a variety of coding and cryptographic applications in hardware.

Unfortunately, doing these multiplies in software is much less efficient, because processors don’t have suitable instructions and hardware dedicated to these jobs. If you’re using coding theory to transmit large amounts of data, this could give rise to quite an inconvenient CPU load.

My colleague Sebastian Egner and myself have been investigating an alternative that could prove considerably more efficient: instead of working in GF(2n), work in GF(p) for some prime p. This is probably the most familiar finite field: you simply do addition and multiplication “mod p“. So you can use the addition and multiplication units on a normal processor to do most of the work. Two problems arise:

  • Multiply may be fast, but division isn’t. How will we prevent all those “mod p” operations slowing us to a crawl?

  • We have a vector of bits to encode, and we have to encode it as another vector of bits. Won’t an intermediate representation “mod p“, with its awkward prime order, introduce unacceptable inefficiencies in coding?

Our proposal addresses both these problems. One application in particular motivated our thinking about all this. Dot products over finite fields are at the heart of both the distribution and the authentication in Microsoft’s BitTorrent-like research project, Avalanche. Our approach has the potential to make the Avalanche software several times more efficient.

We propose to work in the finite field GF(232-5) - here p = 232-5 = 4294967291 is the largest prime below 232. So representing a vector in this field for transmission is easy: just transmit it as a stream of 32-bit words. Converting from a vector of bits to a vector in this field is more interesting, but we’ll come to that later - first, we’ll consider how to do arithmetic efficiently in this field.

The basic idea is to put off that expensive “mod p” operation for as long as possible, and then do it as cheaply as we can. The operation we’re most concerned about is the inner product of two vectors - multiply each component of one vector with the corresponding component in the other, and sum all the products. We do this as follows:

  • Multiply together the 32-bit vector components to get a vector of 64-bit products
  • Treat each product as consisting of a 32-bit “high half” and a 32-bit “low half”
  • Sum the “high halves” and “low halves” separately to get two 64-bit sums
  • Multiply the “high half” sum by five, and add it to the low half.
  • Multiply the “high half” of this 64-bit sum by five, and add it to the low half to get a new 64-bit sum
  • Finally, while the sum is p or greater, subtract p.

The “multiply by five” trick works because 232 ≡ 5 (mod p). This assumes that vectors have 232 or fewer components, though if they have more than 232/25 components the final summing up steps have to be modified slightly.

Sebastian has implemented this approach and compared it to his own highly optimized GF(216) implementation - a matter in which he has no mean expertise - and found it to be over twice as fast. We believe that considerably greater speedups are possible since use of this field makes the work considerably more amenable to implementation using SIMD instruction sets such as SSE2.

So that leaves us with one last problem - how do we convert the bits to be encoded into a vector in this field ready for encoding?

The most efficient encoding possible reads the input as one huge base-2 integer, and converts it to a base-p integer. This encoder has an asymptotic coding rate of about 1 - 2-34, but encoding and decoding is far, far too slow - in fact, the time grows quadratically on the length of the input. A more time-efficient encoding is needed, even at some sacrifice of coding rate.

Our first ideas to this end were based on the idea of escaping. If the data is compressed, then words of value p or above will be rare. We could just escape them in some way - for example, represent every word x where xp -1 as a pair (p -1, x - p +1). However, this means that the length of the encoded vector is unpredictable, which could be a pain. Worse, there will be pathological files whose encoded size is twice the size as other files of the same length.

We played with a variety of solutions to this problem, some quite efficient, but since then found a much more efficient encoding based on a somewhat different principle: break the data into blocks of a certain maximum size, and XOR the blocks with a constant.

Decoding in this encoding is lighting fast. Given an encoded vector

(y0, y1, … ,yk)

the decoded vector is simply (2y0y1, … 2y0yk). In other words, multiply the first word by two (as a 32-bit integer, not as a member of the finite field) and then XOR it with all the other words.

Note that for many applications there are many more recipients than senders, so while encoding need only be acceptably efficient, we want decoding to be very efficient. It’s hard to imagine that a more efficient decode is possible here.

Why does this work? Does every possible block have a valid encoding, and how do we find it? Suppose first that we set the maximum block size to, say 219 -1 words. By the pidgeonhole principle, there must be at least one 19-bit prefix that does not appear in the block. Let m be this prefix - then we choose y0 = 212(m ⊕ 0×7FFFF). Now XOR every word in the block with 2y0 - the result is guaranteed to have at least one of its first 19 bits clear, and is therefore guaranteed to be below p.

We find m by constructing a 64 kbyte bitmap, and combing through the data marking every prefix that does appear, then searching the bitmap for the first hole; as we note above there is guaranteed to be at least one. This bitmap will just fit into L1 cache on many processors so the search can be quite efficient.

219 -1 is an awkward kind of block size. We can extend it to 219 as follows: if we find that every 19-bit prefix is represented, each of them must appear exactly once. If d0 is the first word of the data to be encoded, then we choose y0 = (d0 ⊕ 0xFFFFFFF8) >> 1. Now the first word will encode as 232 -8 or 232 -7, which fits below p, and the subsequent words will have at least one zero in the first nineteen bits of their encoding and therefore also fit below p.

This approach adds one word to a block of 219 words, so the rate of the code is 1 - 2-19, which is efficient enough for nearly any purpose. However, we can push the rate further if desired. This approach extends straightforwardly to a block size of 229, but a 64 Mbyte bitmap is impractical - the table is randomly accessed, so once it extends beyond the size of the L2 cache the software will crawl due to cache misses. Instead we use a recursive approach:

  • Pass over the dataset once, counting the frequency of the 10-bit prefixes; this requires only a 4kb table
  • Find the least frequent 10-bit prefix
  • Pass over the dataset a second time copying the words with this prefix to an auxiliary table
  • Recurse over this auxiliary table

This gives us a rate of 1 - 2-29.

We can squeeze one more bit out. In a block of 230-1 words, at least one 29-bit prefix must appear at most once. In other words, there must exist a value m such that either

  • m is the value of a word whose 29-bit prefix does not appear in the data, or
  • m is the value of a word in the data whose 29-bit prefix does not appear anywhere else in the data

In either instance, we choose y0 = (m ⊕ 0xFFFFFFF8) >> 1. The words not equal to m have a different 29-bit prefix, and so they encode to values below 232 -8 and thus below p. m itself will encode to either 232 -8 or 232 -7, which are also both below p.

In summary: if you need to treat data as vectors over some field on a conventional computer architecture, it can be significantly faster to compute modulo the prime number p = 232-5 than using a field with size a power of two. You will have to encode arbitrary 32-bit words into values modulo p, but the encoding and decoding algorithms can be extremely fast, and at negligible redundancy: the rate of the code can be as high as 1 - 2-30. This is particularly useful for network coding applications.

by
Paul Crowley
on
29/11/06

Security: Making the client do all the work

This paper proposes to reduce the workload of SSL servers by making the clients carry as much of the crypto-related load as possible. I think it’s possible to do even better.

Key agreement: In the above, the server only has to do an RSA public key operation, which is cheap if the exponent is low (eg three). However, we can do even better (and have a stronger security bound too) by using the Rabin operation - modular squaring - instead. This is more than twice as fast as the RSA operation with exponent three. Normally, Rabin encryption is slowed down by certain steps that are needed to handle the fact that modulo a large semi-prime, most numbers that have square roots have four of them, and the recipient has to know which one you mean. However, modern KEM schemes greatly reduce this cost, and Rabin-KEM encryption is just about the fastest public key operation I know of, with the exception of certain probabalistic signature checking schemes.

Signatures: a trick has been missed here. Modern “one-time” signature schemes (eg “Better than BiBa“) can actually sign many messages before the private key must be discarded for security, which in an online/offline signature scheme greatly reduces the number of documents to be signed. For even greater efficiency, a hash tree can be used to sign many one-time keys simultaneously. At the cost of imposing a small latency on all clients, we can even discard the one-time signatures, avoiding a patent, and directly use hash trees; as many clients try to connect, the server can place the documents to be signed in a hash tree and sign them all with one operation. This scheme scales very nicely: the server performs its public key operation at a constant rate of, say, ten per second, and no matter how many clients are trying to connect these signatures will serve to authenticate the server to them all. The clients may have to wait an extra tenth of a second for the connection to complete, but this cost will be small in the cost of connecting to a busy server.

Client puzzles I’m not sure I understand why mixing up the client puzzle step and the public key generation step is beneficial.

With this scheme, the server only has to do one modular squaring per client - and even that only when the client has proven its worth by solving a client puzzle. I wonder if it’s possible to do even better?

by
Paul Crowley
on
26/06/06

Security: A trust metric enabled Wikipedia?

Wikipedia has been described as “the encyclopaedia that works in practice but not in theory”. In theory, an encyclopaedia that anyone can edit should suffer from out-of-control trolling and vandalism, and collapse under its own weight in an instant. In practice, that hasn’t happened; vandals attack Wikipedia daily, but it still does very well as a useful source of information, especially if you bear in mind that its primary competition is not Britannica or other paid-for sources, but the rest of the Internet. Nonetheless, a recent controversy over malicious insertion of false information into Wikipedia had some concluding that the very process by which Wikipedia has “incurable flaws”.

Are these flaws incurable? The problem - that anyone can say anything - is one the Internet itself also suffers from. But in 1996/7, something changed which preserved its usefulness in the face of the avalanche of noise and spam that threatened to bury it: Google arrived. To Internet users increasingly finding that their searches were returning nothing but synthesized advertising pages, Google provided a way to pick the wheat from the chaff. The fundamental innovation that made Google possible we now know as a trust metric.

All trust metrics work on the same principle: if I think well of Ann, and Ann thinks well of Bob, I might be more interested in what Bob has to say than just any random Joe. We operate by this principle all the time when we introduce people to others or make recommendations. A trust metric automates the principle, applying it to a very large number of people and recommendations, to find recommendations we would never have been able to find by the manual method. Google’s trust metric simply treated every hyperlink as a recommendation; if lots of trusted pages link to you, then you must be trusted too.

Better than that, Google’s trust metric is attack resistant - it is designed to give good results in the face of active attackers trying to artificially inflate their rank. This simple requirement is one that many trust metrics fail dismally; for example, until a couple of years ago the open source discussion site kuro5hin trivially allowed users to inflate their own rank by creating new “sock puppet” users who rated them highly.

Can attack resistant trust metrics save Wikipedia? Noted free software author Raph Levien thinks so - he has been researching trust metrics for many years, and expresses his frustration that they are not treated as the primary solution to abusive users despite the amazing success of Google in applying them for this purpose.

I agree with him. If you combine trust metrics and wikis, you can get some interesting possibilities.

by
Paul Crowley
on
09/03/06

Security: Looking for needles in haystacks

Here’s a proper programming challenge, simple to state, and in the end quite challenging to solve. A function produces a 64-bit output given a 64-bit input. It mostly behaves like a random function, so given a random input, most outputs occur with probability between 0 and 2-63; in other words, for a randomly-chosen 64-bit y, there are usually between 0 and 2 solutions to the equation f(x) = y. However, a few occur with much greater frequency - more like 2-32 or so. The function is pretty quick and easy to calculate, but its internal structure defies direct analysis. How would you identify these more frequent occurrences by experiment? How big a computer would you need to do it in something like a few hours to a few days? Think about it for a while before you read on.

First of all, to be at all confident of finding these high frequency outputs, you’re going to have to examine on the order of 234 outputs. That’s 128 GB of data. When you produce your last output, you want to be able to determine, at some stage, whether it’s the same as any output you’ve already produced, and I think that this makes storing all previous outputs unavoidable, so you’re going to need a big disk. Fortunately I have a 200 GB disk attached to my home machine that I’m not using yet, so I can manage this part. The tricky thing is, how are you going to arrange your data on disk so you can find duplicates?

You can’t just use the disk as one big hash table, because each write will call for a seek, and 8.5 ms times 234 is about 4.6 years. You could buffer and sort your writes, but my home machine only has 512 MB of memory and it needs some for other things, so you would have to do about 512 buffered writes and crawl through the entire 128GB data structure every time, which would be desperately slow each time. You could just write it all to disk and then sort it, but I suspect you’d be looking at a long wait for a disk-based sort of 16 billion 8-byte items to finish - just reading and re-writing the entire dataset will take over three hours, and you would probably have to do that at least nine times.

So the first idea I had was this: As you generate each datum, hash it and put it into one of 1024 bins based on the hash code, and write all the bins at once. Each bin now only contains 128 MB of data, which is sufficiently little that we can read it all into memory and do duplicate finding that way using a big hash table. I implemented this by writing to 1024 files at once, one file per bin. I even found out how to get around the limitations on how many files you are allowed to have open at a time (with ulimit). However, it was incredibly slow; it used only about 6.0% CPU because writing the files was keeping it busy all the time. I think that thousands of files all growing at once is pretty much a pathological case for any filesystem, because both Reiser3 and XFS showed this behaviour. Caveat: when I tried this I was using 2048 bins because I was having trouble with getting my post-binning collection phase to handle larger files, but I doubt halving the number of bins would have made a huge difference. I tried buffering the bins. I even tried putting a pre-buffer before the buffers that fitted into L2 cache to make more efficient use of main memory, but all to no avail. It was going to take a day or two to generate all the data.

I asked for advice online, and the best suggestion I got was to preallocate the files before writing to them. Instead, I waited until I got into work the next morning to ask my colleague David, who takes a special interest in making disks behave efficiently. After some discussion, we came to a strategy that made far more efficient use of the disks.

We allocate 256 kB of memory to each bin, ie 256 MB in total for all 1024 bins. As soon as any one bin fills up, we write the entire 256 MB data structure to disk in one great sweep, empty all the bins, and continue. This makes slightly inefficient use of disk space, because not all the bins will be full. However, because the function is so close to being random, they are all very nearly full, and so the inefficiency is only around 2%; in other words, instead of making 512 writes we end up making 522. Now the generation phase is as disk-efficient as it can be - all it does is append giant chunks of data to a file at a time - and it takes only a couple of hours at 60% CPU usage.

Of course, the cost is that now there is a collection phase to read the contents of each bin. This phase takes 522 seeks to piece together the contents of each bin, and it has to be repeated 1024 times, once for each bin; the total cost is about an hour. This is in any case comparable to the cost of finding the duplicates in each bin. Since one process is disk-intensive and the other is memory-intensive, it makes sense to do them in parallel; after discussion with David the simplest parallelization turned out to be the humble Unix pipe. “./extractbins 203 binoutput | ./finddups” means that one can be pulling the data off the disk at the same time as the other is placing it into the hash table (which is slow because practically every write is a cache miss, and you have to read before you write introducing latency). Once the hash table is full, streaming through it to find the duplicates is pretty fast, around half a second.

This story doesn’t yet have a happy ending - hardware problems with my home machine mean that I can’t run the collection phase without crashing it, which I think means I need to buy some decent memory for it. But it was very cool to get a twelve-times speedup for the whole process out of half-an-hour’s discussion with a knowledgable colleague.

by
Paul Crowley
on
26/10/05

Security: Truncated differential cryptanalysis of five rounds of Salsa20

eSTREAM have just put my paper online: Truncated differential cryptanalysis of five rounds of Salsa20 (PDF) (discussion, Wikipedia on Salsa20). This doesn’t break the whole cipher, just a seriously reduced version.

Experimentation played a key role in finding this result. I found the first differential by writing a short Python program that implemented a pretty naive differential-characteristic finding strategy. But when I went to test experimentally that the characteristic worked, I found it was there with eight times the frequency I expected. Further experimentation showed that this was due to clustering differential trails resulting in the same characteristic. From there, it was straightforward to just experimentally count the occurrence of lots of characteristics, and then it makes sense to use lots of them to make the attack far more effective. As a result, many fewer differential pairs are required.

The discovery of trails whose probability is not correctly predicted by theory was also a great and exciting surprise. I’m now thinking about how to investigate how to account for these discrepancies; once we can account for them, maybe we can make use of them to build more powerful attacks.

by
Paul Crowley
on
17/10/05

Security: Understanding “Understanding Brute Force”

D J Bernstein’s draft paper Understanding Brute Force argues that the way we currently measure the cost of cryptanalytic attacks is highly misleading. The paper is a good example of Bernstein’s unconventional style, and mixes quite informal writing with very formal and precise descriptions of cryptanalytic methods and costs. Though his conclusions are correct, I think he hasn’t quite put his finger on how people have come to be misled in the past, so I shall have a go here at arguing what I think is the same point in different words.

The traditional argument goes like this: an n-processor machine can never solve a problem more than n times faster than the same machine with only one processor, and sometimes it will be much less than n times faster because some problems don’t parallelize well. So we give the attacker the best advantage by measuring the cost of their attack on a uniprocessor machine, and that’s the best way forward.

The premise is (approximately) correct, but the conclusion is false. The problem is that this reasoning ignores memory cost. Specifically, if your memory costs are high you might be better off spending less on memory and more on processors. By making this error, you can overestimate the efficacy of a cryptanalytic attack.

To highlight this, imagine an attack on a block cipher with a 128-bit key which takes 290 time and requires 290 storage. By current standards, this is a publishable attack which would be considered to “break” the block cipher. But compare this attack with a straightforward parallel brute-force attack: for far less cost, one could build a machine with 264 dedicated processors, which would brute-force through the keyspace of the cipher in 264 steps. Clearly, the cipher is not really any weaker for the existence of this attack.

Or consider the attacks on triple DES presented in my colleague Stefan Lucks’ paper, Attacking Triple Encryption. The final attack presented requires only time equivalent to 290 DES encryptions, which may seem a huge step forward over the 2168 steps required for a brute-force attack. However, it also requires 288 memory. Considered in the above light, this attack seems less impressive: if a dedicated Triple-DES key cracker can be built for the cost of 28 units of storage, then brute force will beat this attack on time and cost quite handily. As it is, the cost of such a unit is likely to be somewhat more, so this attack survives but barely.

This brings us to our first conclusion: the true measure of the cost of an attack is not time, but the product of time and hardware cost. Only by this measure are our comparisons with brute force fair.

Once we accept this measure, however, we open ourselves up to the reverse error; if we persist in considering only serial attacks, we will now underestimate the power available to attackers. Though I don’t understand it well enough to know for sure, it’s possible that the above Triple-DES attack might parallelize quite nicely. If 270 processors each with 218 memory can perform the attack with the full efficacy of parallelism, then only 220 steps are needed to perform the attack, and this is a far lower time and hardware cost than any brute force search. Suddenly this attack becomes far more tempting - but not all attacks are amenable to such parallelization, and the details have to be considered carefully.

From this we draw our second conclusion: it is insufficient to consider serial attacks, and assume that parallel attacks are implicitly covered. Parallel attacks must be considered explicitly. The existence of an attack with hardware cost h and time cost t (total cost ht) does not imply either the existence of an attack with hardware cost h/2 and time cost 2t or one with hardware cost 2h and time cost t/2.

However, one should be careful in asserting that a parallel attack can be “more powerful” than a serial attack. This assertion can be misleading, because it seems to contradict our opening premise which was correct. Rather, this assertion should be taken as the combination of our two conclusions: the correct measure of an attack is not time but the product of time and hardware cost, and it is by this measure that we might find that against some cryptosystems there exist parallel attacks that outperform any serial attack.

Bernstein’s paper also highlights a third point, which touches only tangentally on these arguments about parallelism but also adds to our understanding of brute force: if our attacker has multiple cryptanalytic targets, their life becomes proportionally easier. If our attacker need only break any one of 220 keys to defeat the system, this can make the job up to 220 times easier, because each guess at a key has a 220 times greater chance of being right. This insight was first presented by Biham in 1996 (How to Forge DES-Encrypted Messages in 228 Steps), and is a key component of the attack presented in Extracting a 3DES key from an IBM 4758: the attackers find a way to break into the system if any of 214 DES keys can be discovered, and this make the job of brute-forcing DES sufficiently easy (only 242 steps) that it can be done on an off-the-shelf FPGA evaluation board costing under $1000. Bernstein demonstrates that these advantages can be preserved in a highly parallel attack, and goes on to argue convincingly that the best countermeasure is a larger keyspace - echoing the conclusion of Biham’s original paper on the subject, which suggests that our keys should be twice as long as they need to be to exhaust the computational resources of our attacker.

This leads to an interesting open problem in cryptography. The current effective working definition of what it means for a cipher to be “unbroken” is that there are no distinguishing attacks more effective than brute force. Bernstein suggests that we should modify this definition to allow a cipher to have a key length greater than its design strength; one could declare that a cipher with a 256-bit key has a design strength of only 128 bits, so an attack that required 2160 resources would be considered irrelevant even though it greatly outperforms brute force. The purpose of this distinction would be to gain the resistance against multiple-target cryptanalysis that comes with larger keys without paying the performance cost associated with resisting the impossibly powerful attacker who can muster up to 2256 computing resources.

However, this leaves a problem of defining exactly what security up to a design strength of 128-bits means. If one defines it in the straightforward way - that the cipher is indistinguishable from random by an attacker who can expend up to 2128 resources - then one can achieve this by discarding half the key bits and using a strong 128-bit cipher, which would lend no extra resistance to multiple-target attacks and defeats the purpose of extending the key. I see no more elegant way to do it than directly considering an attacker with access to multiple targets, but I hope that further research will shed more light on this useful question.

by
Paul Crowley
on
21/09/05

Security: Trivium

Of the many new ciphers proposed as part of the ECRYPT Stream Cipher Project, one of the most interesting is Christophe De Cannière and Bart Preneel’s TRIVIUM. TRIVIUM is designed to be very simple, admit a very low gate count implementation in hardware, and be reasonably efficient in both hardware and software, parallelizing in a straightforward way for fast hardware implementations and admitting of a straightforward bitslice implementation in software.

Today’s tip: I now think that the fastest way to write a software implementation for x86 is to store the inverses of the register contents, rather than the register contents themselves. Super bonus points for anyone who can figure out why.

by
Paul Crowley
on
15/08/05
2000-9 LShift Ltd, 1st Floor Office, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK +44 (0)20 7729 7060