I present a new implementation of the stream cipher Trivium designed for cryptanalysts, in particular those interested in applying the “cube attack” to Trivium. It generates 128 simultaneous output streams using SSE2 intrinsics, and achieves under 1 cycle/byte, over four times faster than standard implementations. The entire program is in Python; SSE2 machine instructions are generated and called using the tool CorePy, an approach I am happy to recommend to others with similar needs. The code is under the MIT licence and may be found in this Mercurial repository.
Dinur and Shamir’s “cube attack” caused something of a media storm when it was first presented at a rump session at Crypto 2008. The attack may be widely applicable: it takes an extremely general view of the cipher under attack, and can be applied even where the cipher is treated as a “black box” into which key and IV bits are put in and output bits result, with no consideration of its internal structure. Early reports that this attack would break all of our most trusted primitives are false, however - the attack depends on the degree of the polynomial representing the working of the cipher, and for a cipher like AES the degree is far too high for the attack to work.
The one real cipher against which the attack has been applied is Trivium, a personal favourite of mine with real beauty in the design; astonishingly simple, neatly implementable in both hardware and software, and carefully thought out resistance to cryptanalysis. Trivium has 1152 initialization rounds; the cube attack breaks a weakened variant with 735 initialization rounds in less than 230 operations, and the authors say “We believe that by using more advanced techniques, we can break much stronger Trivium variants, and maybe even the original cipher”.
As referenced above, applying the cube attack to a cipher is to a certain extent an empirical business; one searches for a set of “maxterms” that break the cipher, and this involves running the cipher many, many times. It occurred to me that for this use case, we could improve on the standard implementation of the cipher by going back to the roots of bitslicing.
The block cipher DES is designed entirely around efficient hardware implementation, and is notoriously difficult to implement in software. In particular, several parts of the cipher require complex permutations in bits - in hardware, this is just some wires crossing, but in software a fiddly sequence of shifts, rotates and masks. In 1997, Eli Biham presented A Fast New DES Implementation in Software. This implementation achieved its speed - about five times faster than the fastest previously known - by doing 64 simultaneous block encryptions in parallel, treating a normal 64-bit Alpha computer as a SIMD computer, and using each bit position of a word to describe part of the state of one of the 64 simultaneous encryptions. Starting from a description of a logic gate circuit to implement DES, this implementation would use standard logical operations to mirror the work of the circuit. No shifts or rotates are needed - each column of bits is an independent part of the work, and each logical operation carries 64 times its weight in useful work done. This technique was dubbed “bitslicing” by Matthew Kwan, and the name stuck.
Not long after this, the block cipher Serpent was published by Anderson, Biham, and Knudsen. Serpent takes advantage of bitslicing, not for multiple parallel instances, but to make a single instance of the cipher. The state of the cipher is four 32-bit words, and the nonlinearity in the cipher is a four-bit S-box on bit columsn of these words implemented using bitslicing. These alternate with a linear mixing stage based on shifts, rotates and XORs. This proved a popular technique, and Trivium itself is designed to be implemented using bitslicing in a similar way, advancing the state by up to 64 bits at a time using logical operations on words.
However, this implementation is bitsliced in the old fashioned way - it is 128 independent instances of the Trivium cipher, represented as vectors of 128 bits. This has two advantages over a standard Trivium implementation: twice as many bits can be operated on in a single instruction using SSE2 intrinsics, and no shifts or masks are needed. As a result, it operates at over four times the speed, achieving under 1 cycle/byte - 2.4 GBytes/sec - on my 2.4 GHz Intel Core 2 Duo PC.
One unusual feature of this implementation is worth mentioning - it’s in Python. Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
The codebase includes our bitslice implementation and a very simple pure-Python Trivium implementation for testing, as well as code to test one against the other, to benchmark the fast implementation, and to use our fast implementation to test the “cube attack”. Please do let me know if you’re playing with this or putting it to use, I look forward to hearing from you.
GF(232-5) is a finite field.
Many of the most useful bits of coding theory and in particular network codes work by treating the data you want to transmit as a vector over a finite field. The link will tell you the full story, but in summary a finite field is a set of finitely many elements, with addition and multiplication operations defined on them such that they are very well behaved: a + b = b + a, (a + b) c = ac + bc and so on.
For any prime p and any positive integer n, there’s exactly one finite field with pn elements; this field is called GF(pn). In particular, since 2 is a prime, there’s conveniently a finite field with 2n elements. And there are straightforward ways to treat, for example, an array of bytes as an array of elements of GF(28); or if it’s more convenient, break up your data into larger chunks and work in, say, GF(216) or GF(232).
Addition in GF(2n) is simply XOR, which is as fast as you could want in both hardware and software. Multiplication hardware for GF(2n) is similarly much simpler than multiplication hardware for ordinary arithmetic, which has made these fields very attractive for a variety of coding and cryptographic applications in hardware.
Unfortunately, doing these multiplies in software is much less efficient, because processors don’t have suitable instructions and hardware dedicated to these jobs. If you’re using coding theory to transmit large amounts of data, this could give rise to quite an inconvenient CPU load.
My colleague Sebastian Egner and myself have been investigating an alternative that could prove considerably more efficient: instead of working in GF(2n), work in GF(p) for some prime p. This is probably the most familiar finite field: you simply do addition and multiplication “mod p“. So you can use the addition and multiplication units on a normal processor to do most of the work. Two problems arise:
Multiply may be fast, but division isn’t. How will we prevent all those “mod p” operations slowing us to a crawl?
We have a vector of bits to encode, and we have to encode it as another vector of bits. Won’t an intermediate representation “mod p“, with its awkward prime order, introduce unacceptable inefficiencies in coding?
Our proposal addresses both these problems. One application in particular motivated our thinking about all this. Dot products over finite fields are at the heart of both the distribution and the authentication in Microsoft’s BitTorrent-like research project, Avalanche. Our approach has the potential to make the Avalanche software several times more efficient.
We propose to work in the finite field GF(232-5) - here p = 232-5 = 4294967291 is the largest prime below 232. So representing a vector in this field for transmission is easy: just transmit it as a stream of 32-bit words. Converting from a vector of bits to a vector in this field is more interesting, but we’ll come to that later - first, we’ll consider how to do arithmetic efficiently in this field.
The basic idea is to put off that expensive “mod p” operation for as long as possible, and then do it as cheaply as we can. The operation we’re most concerned about is the inner product of two vectors - multiply each component of one vector with the corresponding component in the other, and sum all the products. We do this as follows:
The “multiply by five” trick works because 232 ≡ 5 (mod p). This assumes that vectors have 232 or fewer components, though if they have more than 232/25 components the final summing up steps have to be modified slightly.
Sebastian has implemented this approach and compared it to his own highly optimized GF(216) implementation - a matter in which he has no mean expertise - and found it to be over twice as fast. We believe that considerably greater speedups are possible since use of this field makes the work considerably more amenable to implementation using SIMD instruction sets such as SSE2.
So that leaves us with one last problem - how do we convert the bits to be encoded into a vector in this field ready for encoding?
The most efficient encoding possible reads the input as one huge base-2 integer, and converts it to a base-p integer. This encoder has an asymptotic coding rate of about 1 - 2-34, but encoding and decoding is far, far too slow - in fact, the time grows quadratically on the length of the input. A more time-efficient encoding is needed, even at some sacrifice of coding rate.
Our first ideas to this end were based on the idea of escaping. If the data is compressed, then words of value p or above will be rare. We could just escape them in some way - for example, represent every word x where x ≥ p -1 as a pair (p -1, x - p +1). However, this means that the length of the encoded vector is unpredictable, which could be a pain. Worse, there will be pathological files whose encoded size is twice the size as other files of the same length.
We played with a variety of solutions to this problem, some quite efficient, but since then found a much more efficient encoding based on a somewhat different principle: break the data into blocks of a certain maximum size, and XOR the blocks with a constant.
Decoding in this encoding is lighting fast. Given an encoded vector
(y0, y1, … ,yk)
the decoded vector is simply (2y0 ⊕ y1, … 2y0 ⊕ yk). In other words, multiply the first word by two (as a 32-bit integer, not as a member of the finite field) and then XOR it with all the other words.
Note that for many applications there are many more recipients than senders, so while encoding need only be acceptably efficient, we want decoding to be very efficient. It’s hard to imagine that a more efficient decode is possible here.
Why does this work? Does every possible block have a valid encoding, and how do we find it? Suppose first that we set the maximum block size to, say 219 -1 words. By the pidgeonhole principle, there must be at least one 19-bit prefix that does not appear in the block. Let m be this prefix - then we choose y0 = 212(m ⊕ 0×7FFFF). Now XOR every word in the block with 2y0 - the result is guaranteed to have at least one of its first 19 bits clear, and is therefore guaranteed to be below p.
We find m by constructing a 64 kbyte bitmap, and combing through the data marking every prefix that does appear, then searching the bitmap for the first hole; as we note above there is guaranteed to be at least one. This bitmap will just fit into L1 cache on many processors so the search can be quite efficient.
219 -1 is an awkward kind of block size. We can extend it to 219 as follows: if we find that every 19-bit prefix is represented, each of them must appear exactly once. If d0 is the first word of the data to be encoded, then we choose y0 = (d0 ⊕ 0xFFFFFFF8) >> 1. Now the first word will encode as 232 -8 or 232 -7, which fits below p, and the subsequent words will have at least one zero in the first nineteen bits of their encoding and therefore also fit below p.
This approach adds one word to a block of 219 words, so the rate of the code is 1 - 2-19, which is efficient enough for nearly any purpose. However, we can push the rate further if desired. This approach extends straightforwardly to a block size of 229, but a 64 Mbyte bitmap is impractical - the table is randomly accessed, so once it extends beyond the size of the L2 cache the software will crawl due to cache misses. Instead we use a recursive approach:
This gives us a rate of 1 - 2-29.
We can squeeze one more bit out. In a block of 230-1 words, at least one 29-bit prefix must appear at most once. In other words, there must exist a value m such that either
In either instance, we choose y0 = (m ⊕ 0xFFFFFFF8) >> 1. The words not equal to m have a different 29-bit prefix, and so they encode to values below 232 -8 and thus below p. m itself will encode to either 232 -8 or 232 -7, which are also both below p.
In summary: if you need to treat data as vectors over some field on a conventional computer architecture, it can be significantly faster to compute modulo the prime number p = 232-5 than using a field with size a power of two. You will have to encode arbitrary 32-bit words into values modulo p, but the encoding and decoding algorithms can be extremely fast, and at negligible redundancy: the rate of the code can be as high as 1 - 2-30. This is particularly useful for network coding applications.
You are currently browsing the archives for the Cryptology category.