technology from back to front

Archive for October, 2007

Parsing dates using a set format in Java SE

Tony mentioned a while back the parlous state of Java’s standard time libraries, and I wish I’d remembered before now, because he also pointed to Joda-time, which on the strength of a brief scan seems to me a compelling replacement. Hindsight, &c.

But to the matter at hand: getting a sensible result out of Java’s provisions for date parsing. Specifically, parsing a particular format, rather than a locale-dependent format. The stock way to do this is to use java.text.SimpleDateFormat.parse() with a suitable format string, say "dd MMM yyyy", which gives you a java.util.Date. The trouble starts when you want it to do what you mean; in my case, for 3 Oct 2007 to give you midnight on the 3rd of October 2007 UTC (there’s nothing in Java that represents an interval, so we have to settle for representative instants, and we can’t rely on things like databases to cope with timezones). Why doesn’t this work?

Because of the programmer’s friend the timezone. SimpleDateFormat assumes, if it’s not told, that what it’s parsing is in the system timezone, so in London the string from before is interpreted as midnight on the 3rd of October, British Summer Time; i.e., eleven at night on the 2nd, UTC. That’s not a bad assumption, although I didn’t see it documented anywhere, and fits well with the rest of the formatting APIs and locales and so on. The problem is that there’s no way to stop it behaving like that.

There are some teases, though: for example, java.text.DateFormat.setTimeZone() sounds like it will do what I want; but sadly, it only counts for formatting and not for parsing.

So, to the payoff: The expedient way to “neutralise” a date is to simply apply the system timezone offset and daylight savings offset to the parse result, then assert that it’s UTC.

import java.text.*;
import java.util.*;

public class UTCParse {
    public static void main(String[] args) {
    SimpleDateFormat sdf = new SimpleDateFormat("dd MMM yyyy");
 SimpleTimeZone utc = new SimpleTimeZone(0, "UTC");
  Calendar c = new GregorianCalendar(); // this picks up the system timezone
  try {
       Date d = sdf.parse(args[0]); // this assumes the system timezone
        c.setTimeInMillis(d.getTime());
     if (args.length > 1 && "utc".equals(args[1])) {
                // normalise the value
       c.setTimeInMillis(c.getTimeInMillis() +
                                        c.get(Calendar.ZONE_OFFSET) + 
                                        c.get(Calendar.DST_OFFSET));
       }

       SimpleDateFormat outformat = new SimpleDateFormat("dd MMM yyyy HH:mm Z");
       outformat.setTimeZone(utc);
     System.out.println(outformat.format(new Date(c.getTimeInMillis())));
    }
   catch (ParseException e) {
      System.out.println("Try a date formatted dd MMM yyyy");
 }
    }
}
by
mikeb
on
31/10/07

What’s in a Distinguished Name?

In one of our projects, we access a directory server via LDAP to obtain subject distinguished names (DNs) of X509 certificates. The team that provision the directory recently asked a simple question: what order should the components of the subject DN be in? Below follows my explanation of the answer. Please note that the referenced specifications are not necessarily the latest versions, but are the ones supported by my version of OpenSSL and Sun Directory Server.

The subject DN is defined in RFC 2459 by the X.501 type Name as an ASN.1 structure. It consists of a sequence of Relative Distinguished Names (RDNs), which are themselves sets of attribute type and value pairs. For example, given the following DN:

C=UK, ST=England, L=London, O=LShift Ltd., OU=Development, CN=Lee Coomber/UID=123456

‘C=UK’ is an RDN, ‘C’ is an attribute type, and ‘UK’ is an attribute value. Normally, an RDN only consists of one attribute type / value pair, but more are allowed such as in the example of the final RDN above which consists of the pairs ‘CN=Lee Coomber’ and ‘UID=123456′.

We are ultimately interested in how to compare two DNs and this is defined in RFC 2252 as being a match type of “distinguishedNameMatch”. This is a standard match type defined in X501 which states that a presented DN matches a target DN if, and only if:

  • the number of RDNs is the same
  • each corresponding RDN has the same number of attribute value pairs
  • each of the attribute pairs that are of the same type have matching values as determined by the match rule for that type

Note that the order of the attribute pairs within an RDN does not matter.

The actual format of the certificates is defined in ASN.1 which as a binary format is not readable by anyone you would want to sit and have a drink with. A (almost) human readable form can be obtained using the OpenSSL command “asn1parse”. Using the example “self-signed” certificate given above and the following command:

$ openssl asn1parse -in server.crt

we can see the subject DN as part of the certificate in the diagram below:

DN Breakdown

There are two things to note. The first, is that the order shown reading top-down is the order of the RDNs as they are actually stored in the certificate. The second, is the unambiguous distinction between RDNs and attribute pairs. Given the match definition above we could determine whether a given certificate subject DN matches another by looking at the output of this command for the two certificates (ignoring the details of the attribute value comparisons for the purposes of this discussion).

We use strings to enter the DNs into the LDAP server and the method for representing DNs in this way is defined in RFC 2253. The main considerations are:

  • RDN order is reversed
  • RDNs are separated by commas
  • attribute pair order does not matter
  • attribute pairs in multi-valued RDNs are separated by the plus sign
  • attribute type and value are separared by the equals sign
  • reserved characters are escaped using a backslash

So given our example earlier, the string representation would be:

UID=123456+CN=Lee Coomber,OU=Development,O=LShift Ltd.,L=London,ST=England,C=UK

OpenSSL provides a mechanism for doing this though be warned that by default it uses its own more “readable” mechanism for displaying DNs that are not compliant with the standards described here. To display DNs in accordance with RFC 2253 use the ‘nameopt’ command line switch with the parameter “RFC2253″. For example, to see the formatted subject DN of our test certificate:

$ openssl x509 -subject -nameopt RFC2253 -noout -in server.crt
subject= UID=123456+CN=Lee Coomber,OU=Development,O=LShift Ltd.,L=London,ST=England,C=UK

by
Lee Coomber
on
30/10/07

XML CDATA and escaping

XML’s syntax for CDATA looks like this:

  <![CDATA[some text]]>

Tag syntax within a CDATA section is suspended, so this is well-formed XML:

  <![CDATA[some <more> text]]>

even though it looks like the “<more>” tag is unclosed.

There’s only one thing you can’t say in a CDATA section: “]]>”. But there’s a trick to save us, even here. To print an arbitrary string in a CDATA enclosure, replace each instance of “]]>” with “]]]]><![CDATA[>", and then put the normal "<![CDATA["/"]]>” brackets around it:

  my ]]> text

becomes

  <![CDATA[my ]]]]><![CDATA[> text]]>
by
tonyg
on
25/10/07

Nodelists are not like arrays

In JavaScript, it is possible to index into a DOM nodelist by using the square bracket notation:
var firstchild = node.childNodes[0];. But, despite Apple saying “[a] nodeList is equivalent to an array”, and the Mozilla guys saying “[a] nodeList is an array of elements”, a nodelist is not like an array.

It is tempting to use JavaScript’s for .. in syntax to iterate through a nodelist; e.g.,

var nodes = parent.childNodes;
for (var i in nodes) {
    doSomethingWith(nodes[i]);
}

This won’t work: In Firefox, you get numbers, then ‘item’ and ‘length’, and with WebKit you get just ‘item’ and ‘length’. If a nodelist was like an array, this is really not what you’d expect!

But, it shouldn’t work, and in fact, Firefox gets this wrong. Looking at the specification for JavaScript DOM bindings, we can see that the NodeList interface has two properties, ‘length’ and ‘item’, and that the dereference using square brackets is just a misleading convenience.

by
mikeb
on
17/10/07

Your very own 32-way SIMD machine

What’s a good way of counting the number of bits set in a word? The obvious answer, adding the low bit to an accumulator, shifting right, and repeating, is O(n) in the number of bits in the word. This is a sequential approach – and we can do better, complexity-wise, by using a parallel algorithm. Let’s assume we are using 32-bit words, and that Xn is just such a 32-bit word:

X0 = input word
X1 = (X0 & 0x55555555) + ((X0 >>  1) & 0x55555555)
X2 = (X1 & 0x33333333) + ((X1 >>  2) & 0x33333333)
X3 = (X2 & 0x0F0F0F0F) + ((X2 >>  4) & 0x0F0F0F0F)
X4 = (X3 & 0x00FF00FF) + ((X3 >>  8) & 0x00FF00FF)
X5 = (X4 & 0x0000FFFF) + ((X4 >> 16) & 0x0000FFFF)
total number of set bits = X5

This algorithm is O(log2 n) in the number of bits in a word.

Every ordinary N-bit-word based sequential machine is a disguised N-way, 1-bit SIMD machine with a slightly odd instruction set. Lots more on data-parallel algorithms here.

What about finding which is the highest bit set in a word?

X0 = input word
X1 = X0 or (X0 >> 1)
X2 = X1 or (X1 >> 2)
X3 = X2 or (X2 >> 4)
X4 = X3 or (X3 >> 8)
X5 = X4 or (X4 >> 16)

… and feed X5 through the parallel counter-of-set-bits algorithm above. The resulting number is the index of the highest set bit in the original word, starting from zero.

by
tonyg
on
15/10/07

JONESFORTH ported to PowerPC and Mac OS X

A couple of weeks ago, Richard W. M. Jones released JONESFORTH, which I thought was pretty exciting. Today I spent a few hours porting the assembly-language part to PowerPC on Mac OS X 10.3.9. It ended up being 600 non-comment lines of code in total, and took me about eleven hours in total to write and debug. It runs the standard JONESFORTH prelude, up to and including SEE.

You can download the code here: ppcforth.S.m4.

(It’s also available via darcs: darcs get http://www.eighty-twenty.org/~tonyg/Darcs/jonesforth.)

The assembler-macro tricks that the original i386 version uses are sadly unavailable with the default OS X assembler, so I’ve had to resort to using m4 instead; other than that, it’s more-or-less a direct translation of Richard’s original program. To compile it,

m4 ppcforth.S.m4 > ppcforth.S
gcc -nostdlib -o ppcforth ppcforth.S
rm ppcforth.S

To run it, download the JONESFORTH prelude (save it as jonesforth.f), and

$ cat jonesforth.f - | ./ppcforth 
JONESFORTH VERSION 14641 
OK 

Here’s an example session, decompiling the “ELSE” word:

SEE ELSE
: ELSE IMMEDIATE ' BRANCH , HERE @ 0 , SWAP DUP HERE @ SWAP - SWAP ! ;

I’d like to thank Richard for such an amazingly well-written program: not only is JONESFORTH itself a beautiful piece of software, it’s also an incredibly lucid essay that does a wonderful job of introducing the reader to the concepts and techniques involved in implementing a FORTH.

by
tonyg
on
04/10/07

Proper Unicode support in Erlang RFC4627 (JSON) module

In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.

The new module is available as a tarball (automatically generated from the github repository) or by browsing online here. You can also get the code using git:

git clone git://github.com/tonyg/erlang-rfc4627.git

Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)

Eshell V5.5.5  (abort with ^G)
1> rfc4627:unicode_decode([34,0,228,0,34,0]).
{'utf-16le',"\"ä\""}
2> rfc4627:unicode_decode([0,34,0,228,0,34]).
{'utf-16be',"\"ä\""}
3> rfc4627:unicode_decode([34,228,34]).
** exited: {ucs,{bad_utf8_character_code}} **
4> rfc4627:unicode_decode([34,195,164,34]).
{'utf-8',"\"ä\""}
5> 

Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.

5> rfc4627:decode([34,194,128,34]).
{ok,<<194,128>>,[]}
6> rfc4627:encode(<<194,128>>).
[34,194,128,34]
7> rfc4627:encode_noauto(<<194,128>>).
[34,128,34]
8> rfc4627:unicode_encode({'utf-32le',
        rfc4627:encode_noauto(<<194,128>>)}).
[34,0,0,0,128,0,0,0,34,0,0,0]
9> rfc4627:encode_noauto({obj, [{[27700], 123}]}).
[123,34,27700,34,58,49,50,51,125]
10> rfc4627:encode({obj, [{[27700], 123}]}).
"{\"æ°´\":123}"
11> 

Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.

by
tonyg
on
03/10/07

Too much mail is bad for you

We received a few reports from users of our Erlang-based RabbitMQ message broker who saw sharp decreases in throughput performance when putting the broker under heavy load. We subsequently reproduced these results in our lab. This is not what we expected to see – while some performance degradation is inevitable when running a system at its limits, we had carefully designed RabbitMQ to make such degradation are small and gradual. So clearly the system was behaving in ways we had not anticipated.

We eventually tracked down the problem. Take this example program, which sets up a chain of three processes: a producer sending N messages to a consumer which for every message it receives performs M request/response interaction with an echo process.

The interaction between the consumer and echo process follows the same pattern as what happens under the covers of Erlang’s gen_server:call: a message is sent to the recipient and the caller then waits for a reply using a selective receive, i.e. a receive matching on a specific pattern unique to the expected response, which distinguishes the response from any other messages the caller may have in their incoming message queue.

The interaction between the producer and consumer comes in two flavours. When WithAck = false the producer sends messages to the consumer completely asynchronously. By contrast, when WithAck = true the producer waits for an acknowledgment to every message before proceeding. The consumer sends that acknowledgment as soon as it is has received the message.

Which version is faster? One would think that the asynchronous version is faster than the synchronous version since the former does not have to carry out the additional work of sending/receiving acknowledgments. Wrong.

Eshell V5.5.5  (abort with ^G)
1> mqueue:timed_run(10000, 10, false).
4892400
2> mqueue:timed_run(10000, 10, true).
85830

This has the producer sending 10000 messages, and the consumer calling the echo process 10 times for each message. The synchronous version outperforms the asynchronous version by a factor of 57!

How does this huge discrepancy arise? The culprit is the consumer’s message queue, or, more precisely, its length. Firstly, the asynchronous version of the above test may result in 10000 messages needing to be enqueued , i.e. in the worst case scenario when the consumer does not start any work until the producer has finished sending all messages. However, surely queuing up 10000 messages shouldn’t take nearly five seconds? Indeed it doesn’t, as a few experiments quickly confirm.

The root cause for the excessive time taken by the asynchronous version is the selective receive performed when waiting for the echo response. A selective receive scans the message queue until it finds a match. The echo response will typically be at or near the end of the queue, since it will only have been sent a short while after the echo request. So if the message queue contains a lot of messages from the producer, every time the consumer is waiting for the echo response it has to scan over all these messages. If there are N messages in the consumer’s message queue and it performs M calls to the echo process for each of them, this results in M * N * (N+1)/2 match operations, i.e. a time complexity that is quadratic in the number of messages.

Could Erlang do something smarter here?

One possibility is to use a richer data structure for the message queue, optimised for pattern matching. That is difficult in the general case since patterns can be of arbitrary complexity and cannot always be determined statically. Still, one could envisage a message queue data structure that dynamically optimises itself for matching against the kinds of patterns thrown at it.

Another option is to allow processes to have several, independently addressable message queues, as in, e.g., the join calculus and various other process algebras. Replies to calls could then be sent to a secondary, usually empty, queue. I wonder what an encoding of the join calculus into Erlang would look like …

Meanwhile the lesson is: if you make synchronous calls inside a process you’d better make sure its message queue is short.

PS: The latest release of RabbitMQ, fixing the problem originally reported, is available for download here.

by
matthias
on
01/10/07

Search

Categories

You are currently browsing the LShift Ltd. blog archives for October, 2007.

Feeds

Archives

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us