technology from back to front

Archive for December, 2011

Translating a persistent union-find from ML to Smalltalk

When I wrote my unification library a while back, I tried to add an “or matcher”. That is, something that would allow

| matcher mgu |
matcher := OrUnifier
    left: (TreeNode left: #x asVariable)
    right: (TreeNode right: #x asVariable).

mgu := matcher =? (TreeNode left: (Leaf value: 1)).
mgu at: (#x asVariable) "=> (Leaf value: 1)".

mgu := matcher =? (TreeNode right: (Leaf value: 1)).
mgu at: (#x asVariable) "=> (Leaf value: 1)".

Easy enough… until one tries to use an OrUnifier as an operand on the right hand side. See, as the unification progresses, if the first option fails, you’d like to backtrack part of the equivalence relation calculation, and with the imperative union-find in Nutcracker that’s not possible. What to do, what to do?

Read more…

by
Frank Shearar
on
31/12/11

Finding new albums by old bands

I’m once again visting music-related problems, with a look at a different aspect of the “discovering new music” problem. Now, the LShift jukebox is very good at introducing me to new and weird artists, and it also occasionally tells me about other work by artists I’m already aware of, but this is all rather haphazard.

The core problem I’m trying to address is as follows: you find a new artist, and you really like them and buy all their CDs. Problem is, is that there’s nothing notifying me when they do a new release, and given new releases are a relatively rare event, I forget to check… essentially I want Songkick, but for CDs (or MP3 albums, but the principle still applies). Read more…

by
Tom Parker
on
29/12/11

Benchmarking simple JSON generation in Java

What is the fastest way to produce JSON output in Java? Well if you have a complicated object
tree to turn into JSON I would guess it is probaby Jackson.
However, not all JSON output is complicated so maybe we can find quicker and simpler alternatives.

My test class is simple, I call him Thing, he has two fields name and content, he isn’t a Java bean
because he doesn’t need to be (maybe your classes don’t need to be Java beans either!), here he is:

public class Thing {
  public Thing(String name, String content)   {
       this.name = name;
       this.content = content;
 }

   final public String name, content;
}

We will use JUnitBenchmarks to test
my theory that you can be simpler and faster than Jackson. JUnitBenchmarks allows unit tests to be run
multiple times and measurements taken, it also allows the code to be warmed up so any JIT compilation
should have been carried out before measurements are taken. I’ve set my tests to run a warmup period of
50 iterations followed by 1000 measurement iterations. The Jackson code being tested looks like this:

public class JacksonStreamingSerialiser implements Serialiser {

    public String toJson(Thing thing) {
     StringWriter out = new StringWriter();
      try {
           JsonGenerator generator = jsonFactory.createJsonGenerator(out);
         generator.writeStartObject();
           generator.writeStringField("name", thing.name);
         generator.writeStringField("content", thing.content);
           generator.writeEndObject();
         generator.close();
      }
       catch (IOException e) {
         e.printStackTrace();
        }

       return out.toString();
  }

   private final JsonFactory jsonFactory = new JsonFactory();
}

When tested we get a mean measurement for writing 250 objects using Jackson of 0.95 seconds

json-lib is usually a worse performer than Jackson.
Here is the equivalent code using json-lib:

public class JsonLibSerialiser implements Serialiser {
  public String toJson(Thing thing) {
     JSONObject object = new JSONObject();
       object.put("name", thing.name);
     object.put("content", thing.content);

       return object.toString();
   }
}

When tested we get a mean measurement for writing 250 objects using json-lib of 10.74 seconds. Not so good!

Who needs library code? Maybe that new fangled String.format will be quicker and simpler. Here is the code:

public class StringFormatSerialiser implements Serialiser {
    public String toJson(Thing thing) {
     return String.format("{\"name\":\"%s\",\"content\":\"%s\"}", thing.name, thing.content);
    }
}

When tested we get a mean measurement for writing 250 objects using String.format of 5.26 seconds. Better than
json-lib but Jackson isn’t looking worried!

I guess that format string must be expensive to parse so lets try a StringBuilder. Here is the code:

public class StringBuilderSerialiser implements Serialiser {
   public String toJson(Thing thing) {
     StringBuilder builder = new StringBuilder();
        builder.append("{\"name\":\"").append(thing.name).append("\",\"content\":\"").append(thing.content).append("\"}");

      return builder.toString();
  }
}

When tested we get a mean measurement for writing 250 objects using StringBuilder of 0.91 seconds. Finally a
winner, faster than Jackson but you had better quote those strings properly!

People have always told me that StringBuilder is unsyncronised so should be faster than an old fashioned
StringBuffer so lets check. Here is the code:

public class StringBufferSerialiser implements Serialiser {
 public String toJson(Thing thing) {
     StringBuffer buffer = new StringBuffer();
       buffer.append("{\"name\":\"").append(thing.name).append("\",\"content\":\"").append(thing.content).append("\"}");

       return buffer.toString();
   }
}

When tested we get a mean measurement for writing 250 objects using StringBuffer of 0.60 seconds. Hang on!
That is the fastest yet! What is going on?

What is going on is that I have been playing fast and loose with statistics by only presenting the mean
times in seconds for each benchmark. By extracting the raw data and applying the power of statistics (well
I looked at the distributions and standard deviations) it turns out you cannot tell the difference between
the StringBuilder and the StringBuffer, so all is well, and String[Builder|Buffer] are both winners! Jackson
is also a winner since for more complex object trees it will allow you to write more maintainable code than
using a StringBuilder combined with loops and conditional logic and is almost as fast as a StringBuilder
(or StringBuffer).

So what have we learnt? Firstly, use Jackson if you have to serialise your objects into JSON. Secondly,
JUnitBenchmarks is a very handy library. Thirdly, if you don’t present a standard deviation with your benchmark
results then your results may not mean what you think they mean.

by
tim
on
28/12/11

Implementing maps and folds using Zippers

Zippers continue to fascinate me. Let’s recap a bit: a zipper is a data structure representing the navigation and mutation of an otherwise immutable structure. We’ve looked at several different implementations here on the LShift blog, in several different languages.

Today, we’re going to see what else we can do with zippers.

Read more…

by
Frank Shearar
on
23/12/11

LShift awarded Technology Strategy Board funding for Progressive Publishing System

LShift is proud to announce that a significant grant from the
Technology Strategy Board has been awarded for the design and development of the Progressive Publishing System (PPS).

PPS will allow authors, designers, and publishers to automatically convert existing electronic documents into standard e-publishing formats. Existing conversion processes rely on a high degree of manual intervention, require a mixture of costly, and often proprietary, software components while delivering unreliable results.

LShift, along with its project partners Mute and Minibar Ventures, applied for the funding at the beginning of 2011. We had already developed a series of in-house technologies to accelerate and automate the creation of our own documents and so joining forces with Mute (a leading arts and technology publisher) and Minibar Ventures (a highly regarded technology start-up advisor) meant that we would be guided in applying our own efforts to other publishing scenarios and have access to existing market intelligence.

With the vast range of electronic document formats, unique production workflows and various methods for handling a document’s “metadata” (information describing each document) LShift saw the development of PPS
as a complex problem well-suited to its expertise. PPS will allow developers to extend the range of supported conversion formats as they change and emerge over time. Not only will complex document structures be supported (such as tables of contents, indexes, footnotes, endnotes and image data) but so will any descriptive data associated with the document. PPS will also support and preserve the data required by direct distribution and sales platforms (Amazon, Neilsen etc.) for content creators and publishers.

Along with its extensive metadata tools, PPS will allow users to preview converted content and ‘teach’ the system custom conversion rules as idiosyncracies in authorship or materials are discovered over time. The platform will ultimately allow collaboration and document management features to facilitate contributions from individuals and groups into any point of the writing, design and editorial cycle of a document.

The system will go into private beta in February 2012 with several publishers who are active in diverse content creation and distribution channels. They include traditional book and magazine publishers through to organisations with a high-profile online presence but who wish to automate the conversion of legacy materials.

by
Administrator
on
02/12/11

Search

Categories

You are currently browsing the LShift Ltd. blog archives for December, 2011.

Feeds

Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us