technology from back to front

Archive for the ‘Java’ Category

Optimizing Salsa20 in BouncyCastle

A while ago I started dabbling in cryptography a bit, and inevitably, I ended up toying with performance of the related algorithms and code. In this post I’d like to share my approach to optimizing BouncyCastle’s implementation of Salsa20.

A few observations regarding Salsa20 (and presumably other modern ciphers) in the context of performance:

  • Salsa20 uses 32bit integers, 32bit CPUs are dirt cheap nowadays, there is no reason to design an algo based on smaller words.
  • It produces a pseudo-random stream in a block based fashion, 16 ints per block. A good opportunity to leverage multiple cores and multiple execution units in a core.
  • It only uses basic, fast operations on ints, add, xor and shift.

As far as I could measure it, BouncyCastle’s Salsa20 implementation works out at around 20-22 CPU cycles / byte with latest JVMs. Fastest C implementations, according to DJB’s website can make it at around 4-5 CPU cycles / byte. A noticeable gap, let’s see why this is and what can be done with it. from BouncyCastle sources can be found here. There are two hotspots in it.

First, salsaCore method, the actual implementation of the Salsa20 cipher. It produces a chunk of pseudo random ints and stores it in an internal buffer of bytes. One of the recent commits is an optimisation of this part of code. Buffering ints in variables as seen in the commit can potentially reduce a number of memory lookups in comparison to manipulating array elements as well as a number of boundary checks JVM has to perform.
Unfortunately, because algorithhm requires 16 such variables, JIT is likely to produce extra stores / loads to handle them all. Furthermore, underneath this still is good old serial code with no utilization of SIMD execution units. JITs in Java7 and Java8 can utilise SSE/AVX but are quite finicky about when and how they do it and code in doesn’t compel JIT to make use of SIMD instructions. As comment in the commit says, this optimization yields about 15%, and it is about so far we can go this way. This part of code nonetheless has a potential of yielding more with SIMD but it has to be approached from a different angle. The topic of SIMD use in JVM doesn’t seem to be well covered so I had to resort to experimentation and analysis of the JIT’s disassembly. To explain it all in proper detail would take way too much space to fit in a single post. So, I share a full optimized source code and hope it speaks for itself instead. The last, somewhat generic, note on it is that restructuring execution flow so that it uses SIMD entails extra data rearrangements which in turn take up extra CPU and reduce gains. This often is the case when we try to optimise a part of the picture without changing too much, or when we just cannot carry out deeper structural changes.

The second hotspot, in processBytes method, which implements an API entry, does xor’ing of an input stream of bytes with the sequence of bytes produced by salsaCore. Problem is, Salsa20 algorithm operates on 32-bit ints whereas API takes and outputs streams of bytes. As I mentioned before, converts ints produced by algorithm into a buffer of bytes which in turn is used to xor an input buffer of bytes. The xoring itself is done byte by byte and JIT in fact does produce code that processes it all in 8-bit chunks (including the costly loads / stores). A better approach is to keep the ints produced by algorithm as ints and use them as input bytes go, xor’ing input bytes with respective quarters of precalculated ints.

To test it all, needs to be dropped alongside (into org.bouncycastle.crypto.engines package path), and needs to be compiled and run against this modified BouncyCastle.

On my laptop with a SandyBridge CPU and Java 1.8.0_11, an example output for larger blocks shows an average gain of 200-220%:

        Salsa20   30.6kB :: min= 104mb/s  avg= 109mb/s  max= 111mb/s
  FasterSalsa20   30.6kB :: min= 221mb/s  avg= 227mb/s  max= 241mb/s
        Salsa20   86.5kB :: min=  99mb/s  avg=  99mb/s  max=  99mb/s
  FasterSalsa20   86.5kB :: min= 239mb/s  avg= 239mb/s  max= 239mb/s
        Salsa20   15.6kB :: min=  92mb/s  avg=  92mb/s  max=  92mb/s
  FasterSalsa20   15.6kB :: min= 231mb/s  avg= 231mb/s  max= 231mb/s
        Salsa20   72.4kB :: min=  93mb/s  avg= 100mb/s  max= 111mb/s
  FasterSalsa20   72.4kB :: min= 200mb/s  avg= 207mb/s  max= 221mb/s
        Salsa20    3.8kB :: min=  96mb/s  avg=  97mb/s  max=  98mb/s
  FasterSalsa20    3.8kB :: min= 140mb/s  avg= 193mb/s  max= 207mb/s

Enums: not always the right tool

Enums are a way of encoding a set of ordinal values in a type system. That is, they formalise the notion that a value may be one of a small set of specific values. We’ve had them since at least the 1970s. They’re really useful. So why might they not always be the right tool?

Read more…

Frank Shearar

The unreasoned Javan

I really hate null!

Reflect on that statement. Apparently Tim has a strong dislike for a concept found in lots of programming
languages (even brainiac languages like Haskell) and successfully used in millions of programs. He must be
crazy I wouldn’t like to have a discussion with him about something contentious like tabs versus spaces.

Read more…


Benchmarking simple JSON generation in Java

What is the fastest way to produce JSON output in Java? Well if you have a complicated object
tree to turn into JSON I would guess it is probaby Jackson.
However, not all JSON output is complicated so maybe we can find quicker and simpler alternatives.

My test class is simple, I call him Thing, he has two fields name and content, he isn’t a Java bean
because he doesn’t need to be (maybe your classes don’t need to be Java beans either!), here he is:

public class Thing {
  public Thing(String name, String content)   { = name;
       this.content = content;

   final public String name, content;

We will use JUnitBenchmarks to test
my theory that you can be simpler and faster than Jackson. JUnitBenchmarks allows unit tests to be run
multiple times and measurements taken, it also allows the code to be warmed up so any JIT compilation
should have been carried out before measurements are taken. I’ve set my tests to run a warmup period of
50 iterations followed by 1000 measurement iterations. The Jackson code being tested looks like this:

public class JacksonStreamingSerialiser implements Serialiser {

    public String toJson(Thing thing) {
     StringWriter out = new StringWriter();
      try {
           JsonGenerator generator = jsonFactory.createJsonGenerator(out);
         generator.writeStringField("content", thing.content);
       catch (IOException e) {

       return out.toString();

   private final JsonFactory jsonFactory = new JsonFactory();

When tested we get a mean measurement for writing 250 objects using Jackson of 0.95 seconds

json-lib is usually a worse performer than Jackson.
Here is the equivalent code using json-lib:

public class JsonLibSerialiser implements Serialiser {
  public String toJson(Thing thing) {
     JSONObject object = new JSONObject();
     object.put("content", thing.content);

       return object.toString();

When tested we get a mean measurement for writing 250 objects using json-lib of 10.74 seconds. Not so good!

Who needs library code? Maybe that new fangled String.format will be quicker and simpler. Here is the code:

public class StringFormatSerialiser implements Serialiser {
    public String toJson(Thing thing) {
     return String.format("{\"name\":\"%s\",\"content\":\"%s\"}",, thing.content);

When tested we get a mean measurement for writing 250 objects using String.format of 5.26 seconds. Better than
json-lib but Jackson isn’t looking worried!

I guess that format string must be expensive to parse so lets try a StringBuilder. Here is the code:

public class StringBuilderSerialiser implements Serialiser {
   public String toJson(Thing thing) {
     StringBuilder builder = new StringBuilder();

      return builder.toString();

When tested we get a mean measurement for writing 250 objects using StringBuilder of 0.91 seconds. Finally a
winner, faster than Jackson but you had better quote those strings properly!

People have always told me that StringBuilder is unsyncronised so should be faster than an old fashioned
StringBuffer so lets check. Here is the code:

public class StringBufferSerialiser implements Serialiser {
 public String toJson(Thing thing) {
     StringBuffer buffer = new StringBuffer();

       return buffer.toString();

When tested we get a mean measurement for writing 250 objects using StringBuffer of 0.60 seconds. Hang on!
That is the fastest yet! What is going on?

What is going on is that I have been playing fast and loose with statistics by only presenting the mean
times in seconds for each benchmark. By extracting the raw data and applying the power of statistics (well
I looked at the distributions and standard deviations) it turns out you cannot tell the difference between
the StringBuilder and the StringBuffer, so all is well, and String[Builder|Buffer] are both winners! Jackson
is also a winner since for more complex object trees it will allow you to write more maintainable code than
using a StringBuilder combined with loops and conditional logic and is almost as fast as a StringBuilder
(or StringBuffer).

So what have we learnt? Firstly, use Jackson if you have to serialise your objects into JSON. Secondly,
JUnitBenchmarks is a very handy library. Thirdly, if you don’t present a standard deviation with your benchmark
results then your results may not mean what you think they mean.




You are currently browsing the archives for the Java category.



2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us