It looks like there are errors in the tables at the back of the “cube attack” paper which show how to apply the attack to Trivium: some of the entries don’t work. This could mean simply that there are typos in the table, or it could mean that the attack is somewhat less effective against Trivium than the authors believed. Or, of course, it could mean that there’s a bug in my software.
As discussed in my previous entry, to apply the “cube attack” to a cipher like Trivium, you start by searching for a set of “maxterms”. A maxterm is an affine relation between the key bits and the parity of all the values of an output bit for a particular large set of IVs known as a “cube”. Once you have enough maxterms, you can collect the output bits for all the IVs in all the cubes in the maxterms, then find the parity for each maxterm, and then use linear algebra to invert and find the key bits; so you have a complete set of maxterms when you have as many maxterms as key bits and they are all linearly independent in key bits. In practice you would make do with fewer linearly independent maxterms and make up the difference with exhaustive search.
Trivium has 1152 initialization steps, and no-one currently knows how to break it. Before the cube attack, the best attack on Trivium broke a weakened variant with 672 initialization rounds in 255 steps. The cube attack paper presents three tables of maxterms for Trivium: one set of 63 to break the 672 step variant in 219, one set of 52 to break a stronger 735-step variant in less than 230 steps, and one set of four that might form part of an attack on 770-step Trivium requiring around 236 steps.
However, it looks like 15 of these 119 maxterms presented don’t work. Of these, two work 0% of the time, and so are easily converted to working maxterms by adding or removing a “1+” to the first column. Five appear to be a little unreliable, working 79%-99.2% of the time; a slightly unreliable maxterm is useful to an attacker, but the success probability of the attack is the product of the success probabilities of all the maxterms, so too many unreliable ones will make the attack unlikely to work. But eight seem have empirical success probabilities between 46% and 58%, suggesting that in practice they work 50% of the time ie no better than guesswork.
This could of course be an error in my software; I’d love to see these results verified. You can test for yourself by downloading the latest version of the software here:
Updated: In correspondence between myself and Dinur we’ve fixed another two of the maxterms – fixes to the other 11 are on the way.
Recently I found some pretty libevent benchmarks. For me they show terrifying results. The blood freezing fact is that the more connections you have, the bigger is the cost of adding new connections to asynchronous loop.
It means that if you have 1 connection registered to asynchronous loop, the cost of registering callback would be small; while when you have million of idle connections the cost of registering is big.
It would be horrible if a similar situation would appear on the kernel level. It won’t be as bad as in this benchmark, but still, as a programmer I’d expect that creating and destroying connections is always constant time.
How asynchronous libraries work?
The single chunk of data passed to the async lib is a tuple of three items:
event = (socket_descriptor, callback_function, timeout)
This structure is often referred as an event. When programmer registers such a structure to the asynchronous loop, he expects that his callback will be executed when something happens on a descriptor, or when a timeout occurs. That’s what’s asynchronous programming is all about.
Asynchronous library offers a way of adding and removing such events to and from its main data structure – a priority queue. This queue is sorted by timeouts, so async library always knows which events would be timed out first.
Asynchronous loop is waiting for something happening on descriptors or for timeouts.
When something happens, it executes necessary callbacks. See this pseudocode:
the_main_loop: nearest_event = find_event_with_smallest_timeout() events_triggered = select(<fds...>, nearest_event.timeout) if events_triggered: # has anything happened on the wire? for event in events_triggered: remove_from_datastructure(event) execute_callback(event, timeouted=False) else: # nothing happened, event timeouted remove_from_datastructure(nearest_event) execute_callback(nearest_event, timeouted=True)
In 95% of cases the body of programmers callback function, looks like this:
def my_callback(fd): <read response from descriptor> <write new request to descriptor> add_to_event_loop(fd, my_callback) return
When a timeout occurs, in most cases the callback just cleans up the memory. But usually timeouts don’t happen often.
This pseudocode means that:
Putting this conclusions back into the pseudocode:
the_main_loop: # less often nearest_event = find_event_with_smallest_timeout() events_triggered = select(<fds...>, nearest_event.timeout) if events_triggered: for event in events_triggered: # very often, we have data on the socket! remove_from_datastructure(event) execute_callback(event, timeouted=False) else: # timeout happens rather rarely remove_from_datastructure(nearest_event) execute_callback(nearest_event, timeouted=True)
The problem is that currently asynchronous libraries often use binary heap as a representation of internal priority queue. It means that the cost of registering and removing item (event) from this structure is logarithmic to the number of already registered events.
|binary heap||cost||frequency in my use case|
|add item||O(log(N))||very often|
|remove item||O(log(N))||very often|
|event loop iteration||O(1)||less often|
We can clearly see that the data structure was optimized to have smallest possible time per every event loop iteration. While I think that it should be optimized to reduce the cost of adding and removing events.
I can put the same in different words. In the real world, in most cases something happens on a socket before a timeout occurs. Connections very rarely expire because of a timeout.
Considering this, it’s reasonable to use a data structure that would have O(1) as a time of adding and removing items, and it’s not so bad to have a bit higher cost inside the loop.
As you can imagine others faced that problem before me. The data structure that could be used to manage timers with better average cost is called Timer Wheels. It is used inside Linux kernel to manage operating system timers. It’s described in this paper from 1996.
The timer wheels algorithm assumes that events occur every fixed period of time. For the kernel it was obvious that timeouts were checked every kernel tick, usually 10ms. On the other hand waking up asynchronous library every 10ms would be bad.
I believe it’s possible to avoid waking up every tick. To achieve this, when we seek for a smallest timeout we should look through all the buckets and try to find the first one that’s not empty. The cost of that is constant and depends on the number of buckets.
I suggest to use buckets of 1ms. If you need granularity better than this, no asynchronous library can help you anyway.
Other issue is how to handle events very far in the future, that don’t fit into time wheel buckets. The reasonable compromise is to use a binary tree (or heap). In such case, the cost of manipulating that timers would be logarithmic.
|timer wheels||limited time scale||unlimited time scale|
|timeouts fit in timer wheels (optimistic)||timeouts far in the future
|event loop iteration||O(K)||O(K)||O(K)|
(K is the number of buckets in all time wheels. This is a constant factor.)
This diagram shows the structure I’m suggesting.
There are few disadvantages of this approach:
The method I’m suggesting is quite complicated but in my opinion having a constant O(1) cost of managing events is worth it. Unfortunately, for events far in the future the cost would always need to be logarithmic, in the end we must sort the timers somewhere.
You can ask who needs such improvements, not everybody has thousands connections. I agree that that most users doesn’t care. But there are some companies that could save money on this improvement, like Facebook which apparently uses libevent a lot…
I present a new implementation of the stream cipher Trivium designed for cryptanalysts, in particular those interested in applying the “cube attack” to Trivium. It generates 128 simultaneous output streams using SSE2 intrinsics, and achieves under 1 cycle/byte, over four times faster than standard implementations. The entire program is in Python; SSE2 machine instructions are generated and called using the tool CorePy, an approach I am happy to recommend to others with similar needs. The code is under the MIT licence and may be found in Dinur and Shamir’s “cube attack” caused something of a media storm when it was first presented at a rump session at Crypto 2008. The attack may be widely applicable: it takes an extremely general view of the cipher under attack, and can be applied even where the cipher is treated as a “black box” into which key and IV bits are put in and output bits result, with no consideration of its internal structure. Early reports that this attack would break all of our most trusted primitives are false, however – the attack depends on the degree of the polynomial representing the working of the cipher, and for a cipher like AES the degree is far too high for the attack to work.
The one real cipher against which the attack has been applied is Trivium, a personal favourite of mine with real beauty in the design; astonishingly simple, neatly implementable in both hardware and software, and carefully thought out resistance to cryptanalysis. Trivium has 1152 initialization rounds; the cube attack breaks a weakened variant with 735 initialization rounds in less than 230 operations, and the authors say “We believe that by using more advanced techniques, we can break much stronger Trivium variants, and maybe even the original cipher”.
As referenced above, applying the cube attack to a cipher is to a certain extent an empirical business; one searches for a set of “maxterms” that break the cipher, and this involves running the cipher many, many times. It occurred to me that for this use case, we could improve on the standard implementation of the cipher by going back to the roots of bitslicing.
The block cipher DES is designed entirely around efficient hardware implementation, and is notoriously difficult to implement in software. In particular, several parts of the cipher require complex permutations in bits – in hardware, this is just some wires crossing, but in software a fiddly sequence of shifts, rotates and masks. In 1997, Eli Biham presented A Fast New DES Implementation in Software. This implementation achieved its speed – about five times faster than the fastest previously known – by doing 64 simultaneous block encryptions in parallel, treating a normal 64-bit Alpha computer as a SIMD computer, and using each bit position of a word to describe part of the state of one of the 64 simultaneous encryptions. Starting from a description of a logic gate circuit to implement DES, this implementation would use standard logical operations to mirror the work of the circuit. No shifts or rotates are needed – each column of bits is an independent part of the work, and each logical operation carries 64 times its weight in useful work done. This technique was dubbed “bitslicing” by Matthew Kwan, and the name stuck.
Not long after this, the block cipher Serpent was published by Anderson, Biham, and Knudsen. Serpent takes advantage of bitslicing, not for multiple parallel instances, but to make a single instance of the cipher. The state of the cipher is four 32-bit words, and the nonlinearity in the cipher is a four-bit S-box on bit columsn of these words implemented using bitslicing. These alternate with a linear mixing stage based on shifts, rotates and XORs. This proved a popular technique, and Trivium itself is designed to be implemented using bitslicing in a similar way, advancing the state by up to 64 bits at a time using logical operations on words.
However, this implementation is bitsliced in the old fashioned way – it is 128 independent instances of the Trivium cipher, represented as vectors of 128 bits. This has two advantages over a standard Trivium implementation: twice as many bits can be operated on in a single instruction using SSE2 intrinsics, and no shifts or masks are needed. As a result, it operates at over four times the speed, achieving under 1 cycle/byte – 2.4 GBytes/sec – on my 2.4 GHz Intel Core 2 Duo PC.
One unusual feature of this implementation is worth mentioning – it’s in Python. Rather than writing the code in assember in the traditional way, I took advantage of CorePy to program directly against the x86 ISA in Python. In CorePy, machine instructions, registers and suchlike are first-class objects which can be composed to create executable code. The result is something like writing code to generate assembler input as text, but far more satisfying to work with, resulting in cleaner abstractions and code. We do not shell out to an assembler, but directly call the CorePy code object we generate. I probably would not have written this at all if I had not been inspired to try CorePy when our colleage Majek drew attention to it, and if I had I doubt it would have been finished within a few hours of starting as it was.
The codebase includes our bitslice implementation and a very simple pure-Python Trivium implementation for testing, as well as code to test one against the other, to benchmark the fast implementation, and to use our fast implementation to test the “cube attack”. Please do let me know if you’re playing with this or putting it to use, I look forward to hearing from you.
You are currently browsing the LShift Ltd. blog archives for December, 2008.