Archive for October, 2007

Parsing dates using a set format in Java SE

Tony mentioned a while back the parlous state of Java’s standard time libraries, and I wish I’d remembered before now, because he also pointed to Joda-time, which on the strength of a brief scan seems to me a compelling replacement. Hindsight, &c.

But to the matter at hand: getting a sensible result out of Java’s provisions for date parsing. Specifically, parsing a particular format, rather than a locale-dependent format. The stock way to do this is to use java.text.SimpleDateFormat.parse() with a suitable format string, say "dd MMM yyyy", which gives you a java.util.Date. The trouble starts when you want it to do what you mean; in my case, for 3 Oct 2007 to give you midnight on the 3rd of October 2007 UTC (there’s nothing in Java that represents an interval, so we have to settle for representative instants, and we can’t rely on things like databases to cope with timezones). Why doesn’t this work?

Because of the programmer’s friend the timezone. SimpleDateFormat assumes, if it’s not told, that what it’s parsing is in the system timezone, so in London the string from before is interpreted as midnight on the 3rd of October, British Summer Time; i.e., eleven at night on the 2nd, UTC. That’s not a bad assumption, although I didn’t see it documented anywhere, and fits well with the rest of the formatting APIs and locales and so on. The problem is that there’s no way to stop it behaving like that.

There are some teases, though: for example, java.text.DateFormat.setTimeZone() sounds like it will do what I want; but sadly, it only counts for formatting and not for parsing.

So, to the payoff: The expedient way to “neutralise” a date is to simply apply the system timezone offset and daylight savings offset to the parse result, then assert that it’s UTC.

import java.text.*;
import java.util.*;

public class UTCParse {
    public static void main(String[] args) {
    SimpleDateFormat sdf = new SimpleDateFormat("dd MMM yyyy");
    SimpleTimeZone utc = new SimpleTimeZone(0, "UTC");
    Calendar c = new GregorianCalendar(); // this picks up the system timezone
    try {
        Date d = sdf.parse(args[0]); // this assumes the system timezone
        c.setTimeInMillis(d.getTime());
        if (args.length > 1 && "utc".equals(args[1])) {
                // normalise the value
        c.setTimeInMillis(c.getTimeInMillis() +
                                        c.get(Calendar.ZONE_OFFSET) + 
                                        c.get(Calendar.DST_OFFSET));
        }

        SimpleDateFormat outformat = new SimpleDateFormat("dd MMM yyyy HH:mm Z");
        outformat.setTimeZone(utc);
        System.out.println(outformat.format(new Date(c.getTimeInMillis())));
    }
    catch (ParseException e) {
        System.out.println("Try a date formatted dd MMM yyyy");
    }
    }
}

1 comment October 31st, 2007 mikeb

What’s in a Distinguished Name?

In one of our projects, we access a directory server via LDAP to obtain subject distinguished names (DNs) of X509 certificates. The team that provision the directory recently asked a simple question: what order should the components of the subject DN be in? Below follows my explanation of the answer. Please note that the referenced specifications are not necessarily the latest versions, but are the ones supported by my version of OpenSSL and Sun Directory Server.

The subject DN is defined in RFC 2459 by the X.501 type Name as an ASN.1 structure. It consists of a sequence of Relative Distinguished Names (RDNs), which are themselves sets of attribute type and value pairs. For example, given the following DN:

C=UK, ST=England, L=London, O=LShift Ltd., OU=Development, CN=Lee Coomber/UID=123456

‘C=UK’ is an RDN, ‘C’ is an attribute type, and ‘UK’ is an attribute value. Normally, an RDN only consists of one attribute type / value pair, but more are allowed such as in the example of the final RDN above which consists of the pairs ‘CN=Lee Coomber’ and ‘UID=123456′.

We are ultimately interested in how to compare two DNs and this is defined in RFC 2252 as being a match type of “distinguishedNameMatch”. This is a standard match type defined in X501 which states that a presented DN matches a target DN if, and only if:

  • the number of RDNs is the same
  • each corresponding RDN has the same number of attribute value pairs
  • each of the attribute pairs that are of the same type have matching values as determined by the match rule for that type

Note that the order of the attribute pairs within an RDN does not matter.

The actual format of the certificates is defined in ASN.1 which as a binary format is not readable by anyone you would want to sit and have a drink with. A (almost) human readable form can be obtained using the OpenSSL command “asn1parse”. Using the example “self-signed” certificate given above and the following command:

$ openssl asn1parse -in server.crt

we can see the subject DN as part of the certificate in the diagram below:

DN Breakdown

There are two things to note. The first, is that the order shown reading top-down is the order of the RDNs as they are actually stored in the certificate. The second, is the unambiguous distinction between RDNs and attribute pairs. Given the match definition above we could determine whether a given certificate subject DN matches another by looking at the output of this command for the two certificates (ignoring the details of the attribute value comparisons for the purposes of this discussion).

We use strings to enter the DNs into the LDAP server and the method for representing DNs in this way is defined in RFC 2253. The main considerations are:

  • RDN order is reversed
  • RDNs are separated by commas
  • attribute pair order does not matter
  • attribute pairs in multi-valued RDNs are separated by the plus sign
  • attribute type and value are separared by the equals sign
  • reserved characters are escaped using a backslash

So given our example earlier, the string representation would be:

UID=123456+CN=Lee Coomber,OU=Development,O=LShift Ltd.,L=London,ST=England,C=UK

OpenSSL provides a mechanism for doing this though be warned that by default it uses its own more “readable” mechanism for displaying DNs that are not compliant with the standards described here. To display DNs in accordance with RFC 2253 use the ‘nameopt’ command line switch with the parameter “RFC2253″. For example, to see the formatted subject DN of our test certificate:

$ openssl x509 -subject -nameopt RFC2253 -noout -in server.crt subject= UID=123456+CN=Lee Coomber,OU=Development,O=LShift Ltd.,L=London,ST=England,C=UK

1 comment October 30th, 2007 lee

XML CDATA and escaping

XML’s syntax for CDATA looks like this:

  <![CDATA[some text]]>

Tag syntax within a CDATA section is suspended, so this is well-formed XML:

  <![CDATA[some <more> text]]>

even though it looks like the “<more>” tag is unclosed.

There’s only one thing you can’t say in a CDATA section: “]]>”. But there’s a trick to save us, even here. To print an arbitrary string in a CDATA enclosure, replace each instance of “]]>” with “]]]]><![CDATA[>”, and then put the normal “<![CDATA[”/”]]>” brackets around it:

  my ]]> text

becomes

  <![CDATA[my ]]]]><![CDATA[> text]]>

1 comment October 25th, 2007 tonyg

Nodelists are not like arrays

In JavaScript, it is possible to index into a DOM nodelist by using the square bracket notation: var firstchild = node.childNodes[0];. But, despite Apple saying “[a] nodeList is equivalent to an array”, and the Mozilla guys saying “[a] nodeList is an array of elements”, a nodelist is not like an array.

It is tempting to use JavaScript’s for .. in syntax to iterate through a nodelist; e.g.,

var nodes = parent.childNodes;
for (var i in nodes) {
    doSomethingWith(nodes[i]);
}

This won’t work: In Firefox, you get numbers, then ‘item’ and ‘length’, and with WebKit you get just ‘item’ and ‘length’. If a nodelist was like an array, this is really not what you’d expect!

But, it shouldn’t work, and in fact, Firefox gets this wrong. Looking at the specification for JavaScript DOM bindings, we can see that the NodeList interface has two properties, ‘length’ and ‘item’, and that the dereference using square brackets is just a misleading convenience.

3 comments October 17th, 2007 mikeb

Your very own 32-way SIMD machine

What’s a good way of counting the number of bits set in a word? The obvious answer, adding the low bit to an accumulator, shifting right, and repeating, is O(n) in the number of bits in the word. This is a sequential approach - and we can do better, complexity-wise, by using a parallel algorithm. Let’s assume we are using 32-bit words, and that Xn is just such a 32-bit word:

X0 = input word
X1 = (X0 & 0x55555555) + ((X0 >>  1) & 0x55555555)
X2 = (X1 & 0x33333333) + ((X1 >>  2) & 0x33333333)
X3 = (X2 & 0x0F0F0F0F) + ((X2 >>  4) & 0x0F0F0F0F)
X4 = (X3 & 0x00FF00FF) + ((X3 >>  8) & 0x00FF00FF)
X5 = (X4 & 0x0000FFFF) + ((X4 >> 16) & 0x0000FFFF)
total number of set bits = X5

This algorithm is O(log2 n) in the number of bits in a word.

Every ordinary N-bit-word based sequential machine is a disguised N-way, 1-bit SIMD machine with a slightly odd instruction set. Lots more on data-parallel algorithms here.

What about finding which is the highest bit set in a word?

X0 = input word
X1 = X0 or (X0 >> 1)
X2 = X1 or (X1 >> 2)
X3 = X2 or (X2 >> 4)
X4 = X3 or (X3 >> 8)
X5 = X4 or (X4 >> 16)

… and feed X5 through the parallel counter-of-set-bits algorithm above. The resulting number is the index of the highest set bit in the original word, starting from zero.

7 comments October 15th, 2007 tonyg

JONESFORTH ported to PowerPC and Mac OS X

A couple of weeks ago, Richard W. M. Jones released JONESFORTH, which I thought was pretty exciting. Today I spent a few hours porting the assembly-language part to PowerPC on Mac OS X 10.3.9. It ended up being 600 non-comment lines of code in total, and took me about eleven hours in total to write and debug. It runs the standard JONESFORTH prelude, up to and including SEE.

You can download the code here: ppcforth.S.m4.

(It’s also available via darcs: darcs get http://www.eighty-twenty.org/~tonyg/Darcs/jonesforth.)

The assembler-macro tricks that the original i386 version uses are sadly unavailable with the default OS X assembler, so I’ve had to resort to using m4 instead; other than that, it’s more-or-less a direct translation of Richard’s original program. To compile it,

m4 ppcforth.S.m4 > ppcforth.S
gcc -nostdlib -o ppcforth ppcforth.S
rm ppcforth.S

To run it, download the JONESFORTH prelude (save it as jonesforth.f), and

$ cat jonesforth.f - | ./ppcforth 
JONESFORTH VERSION 14641 
OK 

Here’s an example session, decompiling the “ELSE” word:

SEE ELSE
: ELSE IMMEDIATE ‘ BRANCH , HERE @ 0 , SWAP DUP HERE @ SWAP - SWAP ! ;

I’d like to thank Richard for such an amazingly well-written program: not only is JONESFORTH itself a beautiful piece of software, it’s also an incredibly lucid essay that does a wonderful job of introducing the reader to the concepts and techniques involved in implementing a FORTH.

2 comments October 4th, 2007 tonyg

Proper Unicode support in Erlang RFC4627 (JSON) module

In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.

The new module is Erlang-RFC4627 version 1.1.0, and is available as a tarball, a debian package, or by browsing online here. You can also get the code using mercurial:

hg clone http://hg.opensource.lshift.net/erlang-rfc4627/

Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)

Eshell V5.5.5  (abort with ^G)
1> rfc4627:unicode_decode([34,0,228,0,34,0]).
{'utf-16le',"\"ä\""}
2> rfc4627:unicode_decode([0,34,0,228,0,34]).
{'utf-16be',"\"ä\""}
3> rfc4627:unicode_decode([34,228,34]).
** exited: {ucs,{bad_utf8_character_code}} **
4> rfc4627:unicode_decode([34,195,164,34]).
{'utf-8',"\"ä\""}
5> 

Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.

5> rfc4627:decode([34,194,128,34]).
{ok,<<194,128>>,[]}
6> rfc4627:encode(<<194,128>>).
[34,194,128,34]
7> rfc4627:encode_noauto(<<194,128>>).
[34,128,34]
8> rfc4627:unicode_encode({’utf-32le’,
        rfc4627:encode_noauto(<<194,128>>)}).
[34,0,0,0,128,0,0,0,34,0,0,0]
9> rfc4627:encode_noauto({obj, [{[27700], 123}]}).
[123,34,27700,34,58,49,50,51,125]
10> rfc4627:encode({obj, [{[27700], 123}]}).
“{\”æ°´\”:123}”
11> 

Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.

2 comments October 3rd, 2007 tonyg

Too much mail is bad for you

We received a few reports from users of our Erlang-based RabbitMQ message broker who saw sharp decreases in throughput performance when putting the broker under heavy load. We subsequently reproduced these results in our lab. This is not what we expected to see - while some performance degradation is inevitable when running a system at its limits, we had carefully designed RabbitMQ to make such degradation are small and gradual. So clearly the system was behaving in ways we had not anticipated.

We eventually tracked down the problem. The lesson is: if you make synchronous calls inside an Erlang process you’d better make sure its message queue is short.

Continue Reading 8 comments October 1st, 2007 matthias

Calendar

October 2007
M T W T F S S
« Sep   Nov »
1234567
891011121314
15161718192021
22232425262728
293031  

Posts by Month

Posts by Category