## GF(232-5)

**GF**(2^{32}-5) is a finite field.

Many of the most useful bits of coding theory and in particular network codes work by treating the data you want to transmit as a vector over a finite field. The link will tell you the full story, but in summary a finite field is a set of finitely many elements, with addition and multiplication operations defined on them such that they are very well behaved: *a* + *b* = *b* + *a*, (*a* + *b*) *c* = *ac* + *bc* and so on.

For any prime *p* and any positive integer *n*, there’s exactly one finite field with *p*^{n} elements; this field is called **GF**(*p*^{n}). In particular, since 2 is a prime, there’s conveniently a finite field with 2^{n} elements. And there are straightforward ways to treat, for example, an array of bytes as an array of elements of **GF**(2^{8}); or if it’s more convenient, break up your data into larger chunks and work in, say, **GF**(2^{16}) or **GF**(2^{32}).

Addition in **GF**(2^{n}) is simply XOR, which is as fast as you could want in both hardware and software. Multiplication hardware for **GF**(2^{n}) is similarly much simpler than multiplication hardware for ordinary arithmetic, which has made these fields very attractive for a variety of coding and cryptographic applications in hardware.

Unfortunately, doing these multiplies in software is much less efficient, because processors don’t have suitable instructions and hardware dedicated to these jobs. If you’re using coding theory to transmit large amounts of data, this could give rise to quite an inconvenient CPU load.

My colleague Sebastian Egner and myself have been investigating an alternative that could prove considerably more efficient: instead of working in **GF**(2^{n}), work in **GF**(*p*) for some prime *p*. This is probably the most familiar finite field: you simply do addition and multiplication “mod *p*“. So you can use the addition and multiplication units on a normal processor to do most of the work. Two problems arise:

* Multiply may be fast, but division isn’t. How will we prevent all those “mod *p*” operations slowing us to a crawl?

* We have a vector of bits to encode, and we have to encode it as another vector of bits. Won’t an intermediate representation “mod *p*“, with its awkward prime order, introduce unacceptable inefficiencies in coding?

Our proposal addresses both these problems. One application in particular motivated our thinking about all this. Dot products over finite fields are at the heart of both the distribution and the authentication in Microsoft’s BitTorrent-like research project, Avalanche. Our approach has the potential to make the Avalanche software several times more efficient.

We propose to work in the finite field **GF**(2^{32}-5) – here *p* = 2^{32}-5 = 4294967291 is the largest prime below 2^{32}. So representing a vector in this field for transmission is easy: just transmit it as a stream of 32-bit words. Converting from a vector of bits to a vector in this field is more interesting, but we’ll come to that later – first, we’ll consider how to do arithmetic efficiently in this field.

The basic idea is to put off that expensive “mod p” operation for as long as possible, and then do it as cheaply as we can. The operation we’re most concerned about is the inner product of two vectors – multiply each component of one vector with the corresponding component in the other, and sum all the products. We do this as follows:

* Multiply together the 32-bit vector components to get a vector of 64-bit products

* Treat each product as consisting of a 32-bit “high half” and a 32-bit “low half”

* Sum the “high halves” and “low halves” separately to get two 64-bit sums

* Multiply the “high half” sum by five, and add it to the low half.

* Multiply the “high half” of this 64-bit sum by five, and add it to the low half to get a new 64-bit sum

* Finally, while the sum is *p* or greater, subtract *p*.

The “multiply by five” trick works because 2^{32} ≡ 5 (mod *p*). This assumes that vectors have 2^{32} or fewer components, though if they have more than 2^{32}/25 components the final summing up steps have to be modified slightly.

Sebastian has implemented this approach and compared it to his own highly optimized **GF**(2^{16}) implementation – a matter in which he has no mean expertise – and found it to be over twice as fast. We believe that considerably greater speedups are possible since use of this field makes the work considerably more amenable to implementation using SIMD instruction sets such as SSE2.

So that leaves us with one last problem – how do we convert the bits to be encoded into a vector in this field ready for encoding?

The most efficient encoding possible reads the input as one huge base-2 integer, and converts it to a base-*p* integer. This encoder has an asymptotic coding rate of about 1 – 2^{-34}, but encoding and decoding is far, far too slow – in fact, the time grows quadratically on the length of the input. A more time-efficient encoding is needed, even at some sacrifice of coding rate.

Our first ideas to this end were based on the idea of escaping. If the data is compressed, then words of value *p* or above will be rare. We could just escape them in some way – for example, represent every word *x* where *x* ≥ *p* -1 as a pair (*p* -1, *x* – *p* +1). However, this means that the length of the encoded vector is unpredictable, which could be a pain. Worse, there will be pathological files whose encoded size is twice the size as other files of the same length.

We played with a variety of solutions to this problem, some quite efficient, but since then found a much more efficient encoding based on a somewhat different principle: break the data into blocks of a certain maximum size, and XOR the blocks with a constant.

Decoding in this encoding is lighting fast. Given an encoded vector

(*y*_{0}, *y*_{1}, … ,*y*_{k})

the decoded vector is simply (2*y*_{0} ⊕ *y*_{1}, … 2*y*_{0} ⊕ *y*_{k}). In other words, multiply the first word by two (as a 32-bit integer, not as a member of the finite field) and then XOR it with all the other words.

Note that for many applications there are many more recipients than senders, so while encoding need only be acceptably efficient, we want decoding to be very efficient. It’s hard to imagine that a more efficient decode is possible here.

Why does this work? Does every possible block have a valid encoding, and how do we find it? Suppose first that we set the maximum block size to, say 2^{19} -1 words. By the pidgeonhole principle, there must be at least one 19-bit prefix that does not appear in the block. Let *m* be this prefix – then we choose *y*_{0} = 2^{12}(*m* ⊕ 0x7FFFF). Now XOR every word in the block with 2*y*_{0} – the result is guaranteed to have at least one of its first 19 bits clear, and is therefore guaranteed to be below *p*.

We find *m* by constructing a 64 kbyte bitmap, and combing through the data marking every prefix that does appear, then searching the bitmap for the first hole; as we note above there is guaranteed to be at least one. This bitmap will just fit into L1 cache on many processors so the search can be quite efficient.

2^{19} -1 is an awkward kind of block size. We can extend it to 2^{19} as follows: if we find that every 19-bit prefix is represented, each of them must appear exactly once. If *d*_{0} is the first word of the data to be encoded, then we choose *y*_{0} = (*d*_{0} ⊕ 0xFFFFFFF8) >> 1. Now the first word will encode as 2^{32} -8 or 2^{32} -7, which fits below *p*, and the subsequent words will have at least one zero in the first nineteen bits of their encoding and therefore also fit below *p*.

This approach adds one word to a block of 2^{19} words, so the rate of the code is 1 – 2^{-19}, which is efficient enough for nearly any purpose. However, we can push the rate further if desired. This approach extends straightforwardly to a block size of 2^{29}, but a 64 Mbyte bitmap is impractical – the table is randomly accessed, so once it extends beyond the size of the L2 cache the software will crawl due to cache misses. Instead we use a recursive approach:

* Pass over the dataset once, counting the frequency of the 10-bit prefixes; this requires only a 4kb table

* Find the least frequent 10-bit prefix

* Pass over the dataset a second time copying the words with this prefix to an auxiliary table

* Recurse over this auxiliary table

This gives us a rate of 1 – 2^{-29}.

We can squeeze one more bit out. In a block of 2^{30}-1 words, at least one 29-bit prefix must appear at most once. In other words, there must exist a value *m* such that either

* *m* is the value of a word whose 29-bit prefix does not appear in the data, or

* *m* is the value of a word in the data whose 29-bit prefix does not appear anywhere else in the data

In either instance, we choose *y*_{0} = (*m* ⊕ 0xFFFFFFF8) >> 1. The words not equal to *m* have a different 29-bit prefix, and so they encode to values below 2^{32} -8 and thus below *p*. *m* itself will encode to either 2^{32} -8 or 2^{32} -7, which are also both below *p*.

In summary: if you need to treat data as vectors over some field on a conventional computer architecture, it can be significantly faster to compute modulo the prime number p = 2^{32}-5 than using a field with size a power of two. You will have to encode arbitrary 32-bit words into values modulo p, but the encoding and decoding algorithms can be extremely fast, and at negligible redundancy: the rate of the code can be as high as 1 – 2^{-30}. This is particularly useful for network coding applications.

## 6 Comments

This is amazing way of conversion of elements of a 32-bit vector to GF (p)! Do you (or your company) have patented this technique or I can use it in my own software?

We haven’t patented this, AFAIK it’s free to use. Very glad someone found this useful! How did you find this post, and what application do you have in mind? Cheers!

I develop Error Correction Codes libraries for use in the distributed storage systems. I has implemented classic algorithms and my own version of Reed-Solomon codes using arithmetic in GF (2^n). My implementation is very efficient for small matrices. However, for very large amounts of data with large matrices (when the data does not fit in the L1-cache) is more efficient to use arithmetic in GF(p), where p is prime. Then will possible to make all calculations done in the registers of the processor using ordinary multiplication instructions (scalar or SSE). This due to the fact that conventional processors do not have the command to multiply elements of GF (2^n) and multiplication in GF(2^n) takes more registers/clock cycles. Your brilliant algorithm of conversion of usual data to the GF(p) simply the best!

I do not remember how I found a link to your article 🙂

Hi, we have been experimenting with using the approach presented here for a Network Coding application. How may we cite you article here – does an publication exist on this technique for encoding 32-bit values to GF(p) elements?

Best regards,

Morten

Digging up a dead post here, but it seems like an improvement can be made on this approach based on complex math:

Pick Fp s.t. i = -1 is not in Fp.

Work over Fp^2, where x in Fp^2 = a + i*b.

If a, b are 16-bit numbers, then multiplication over Fp^2 is:

(ac – bd) + i(bc + ad).

This can be done efficiently in parallel: (ac | bd) in one 64-bit multiplication for example. Or use SIMD.

The biggest advantage of Fp^2 is that taking the inverse (for division) is done over Fp. This can be accelerated in a number of ways since Fp is only 16 bits.