technology from back to front

Archive for the ‘Standards’ Category

Yahoo doesn’t know what an email address is

Many websites refuse to accept email addresses of the form ``, despite the fact that the `+sometext` is perfectly legitimate1 and is an advertised feature gmail offers for creating pseudo-single-use email addresses from a base email address.

My guess is that the developers of these sites think, because they’re either lazy or incompetent, that email addresses have more restrictions than they in fact have. It’s reasonable (and fairly easy) these days to check the syntax of the DNS part of an email address, because few people use non-DNS or non-SMTP transfer methods anymore, but the mailbox part is extremely flexible and hard to check accurately. A sane thing to do is just trust the user, and send a test mail to validate the address.

I picked on Yahoo in the title of this post: Yahoo are by no means the only offender, but I just signed up for a yahoo account, so they’re for me the most recent. Their signup form also refused to provide any guidance about why they were rejecting the form submission: I had to use my previous experience of sites wrongly rejecting valid email addresses to guess what the problem might be. Fail.


Footnote 1: According to my best reading of the relevant RFCs, anyway. See the definition of `dot-atom` in section 3.2.4 of [RFC 2822](, referenced in this context by section 3.4.1.


Reverse HTTP == Remote CGI

I’ve been working recently on Reverse HTTP, an approach to making HTTP easier to use as the distributed object system that it is. My work is similar to the work of Lentczner and Preston, but is independently invented and technically a bit different: one, I’m using plain vanilla HTTP as a transport, and two, I’m focussing a little more on the enrollment, registration, queueing and management aspects of the system. My draft spec is here (though as I’m still polishing, please excuse its roughness), and you can play with some demos or download and play with an implementation of the spec.

Comments welcome!


Streamlining HTTP

HTTP/1.1 is a lovely protocol. Text-based, sophisticated, flexible. It
does tend toward the verbose though. What if we wanted to use HTTP’s
semantics in a very high-speed messaging situation? How could we
mitigate the overhead of all those headers?

Now, bandwidth is pretty cheap: cheap enough that for most
applications the kind of approach I suggest below is ridiculously far
over the top. Some situations, though, really do need a more efficient
protocol: I’m thinking of people having to consume the [OPRA][] feed,
which is fast approaching 1 million messages per second ([1][], [2][],
[3][]). What if, in some bizarre situation, HTTP was the protocol used
to deliver a full OPRA feed?

### Being Stateful

Instead of having each HTTP request start with a clean slate after the
previous request on a given connection has been processed, how about
giving connections a memory?

Let’s invent a syntax for HTTP that is easy to translate back to
regular HTTP syntax, but that [avoids repeating ourselves][DRY] quite
so much.

Each line starts with an opcode and a colon. The rest of the line is
interpreted depending on the opcode. Each opcode-line is terminated
with CRLF.

V:HTTP/1.x Set HTTP version identifier.
B:/some/base/url Set base URL for requests.
M:GET Set method for requests.
<:somename Retrieve a named configuration
>:somename Give the current configuration a name
H:Header: value Set a header
-:/url/suffix Issue a bodyless request
+:/url/suffix 12345 Issue a request with a body

Opcodes `V`, `B`, `M` and `H` are hopefully self-explanatory. I’ll
explore `<` and `>` below. The opcodes `-` and `+` actually complete
each request and tell the server to process the message.

Opcode `-` takes as its argument a URL fragment that gets appended to
the base URL set by opcode `B`. Opcode `+` does the same, but also
takes an ASCII `Content-Length` value, which tells the server to read
that many bytes after the CRLF of the `+` line, and to use the bytes
read as the entity body of the HTTP request.

`Content-Length` is a slightly weird header, more properly associated
with the entity body than the headers proper, which is why it gets
special treatment. (We could also come up with a syntax for indicating
chunked transfer encoding for the entity body.)

As an example, let’s encode the following `POST` request:

POST /someurl HTTP/1.1
Content-Type: text/plain
Accept-Encoding: identity
Content-Length: 13

hello world

Encoded, this becomes

H:Content-Type: text/plain
H:Accept-Encoding: identity
+: 13
hello world

Not an obvious improvement. However, consider issuing 100 copies of
that same request on a single connection. With plain HTTP, all the
headers are repeated; with our encoded HTTP, the only part that is
repeated is:

+: 13
hello world

Instead of sending (151 * 100) = 15100 bytes, we now send 130 + (20 *
100) = 2130 bytes.

The scheme as described so far takes care of the unchanging parts of
repeated HTTP requests; for the changing parts, such as `Accept` and
`Referer` headers, we need to make use of the `<` and `>`
opcodes. Before I get into that, though, let’s take a look at how the
scheme so far might work in the case of OPRA.

### Measuring OPRA

Each OPRA quote update is [on average 66 bytes
long](, making for
around 63MB/s of raw content.

Let’s imagine that each delivery appears as a separate HTTP request:

POST /receiver HTTP/1.1
Content-Type: application/x-opra-quote
Accept-Encoding: identity
Content-Length: 66


That’s 213 bytes long: an overhead of 220% over the raw message

Encoded using the stateful scheme above, the first request appears on
the wire as

H:Content-Type: application/x-opra-quote
H:Accept-Encoding: identity
+: 66

and subsequent requests as

+: 66

for an amortized per-request size of 73 bytes: a much less problematic
overhead of 11%. In summary:

Encoding Bytes per message body Per-message overhead (bytes) Size increase over raw content Bandwidth at 1M msgs/sec
Plain HTTP 66 147 220% 203.1 MBy/s
Encoded HTTP 66 7 11% 69.6 MBy/s

Using plain HTTP, the feed doesn’t fit on a gigabit ethernet. Using
our encoding scheme, it does.

Besides the savings in terms of bandwidth, the encoding scheme could
also help with saving CPU. After processing the headers once, the
results of the processing could be cached, avoiding unnecessary
repetition of potentially expensive calculations such as routing,
authentication, and authorisation.

### Almost-identical requests

Above, I mentioned that some headers changed, while others stayed the
same from request to request. The `<` and `>` opcodes are intended to
deal with just this situation.

The `>` opcode stores the current state in a named register, and the
`<` opcode loads the current state from a register. Headers that don't
change between requests are placed into a register, and each request
loads from that register before setting its request-specific headers.

To illustrate, imagine the following two requests:

GET / HTTP/1.1
Cookie: key=value
Accept: HTTP Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

GET /style.css HTTP/1.1
Cookie: key=value
Accept: text/css,*/*;q=0.1

One possible encoding is:

H:Cookie: key=value
H:Accept: HTTP Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
H:Accept: text/css,*/*;q=0.1

By using `<:config1`, the second request reuses the stored settings
for the method, base URL, HTTP version, and `Host` and `Cookie`

### It’ll never catch on, of course — and I don’t mean for it to

Most applications of HTTP do fine using ordinary HTTP syntax. I’m not
suggesting changing HTTP, or trying to get an encoding scheme like
this deployed in any browser or webserver at all. The point of the
exercise is to consider how low one might make the bandwidth overheads
of a text-based protocol like HTTP for the specific case of a
high-speed messaging scenario.

In situations where the semantics of HTTP make sense, but the syntax
is just too verbose, schemes like this one can be useful on a
point-to-point link. There’s no need for global support for an
alternative syntax, since people who are already forming very specific
contracts with each other for the exchange of information can choose
to use it, or not, on a case-by-case basis.

Instead of specifying a whole new transport protocol for high-speed
links, people can reuse the considerable amount of work that’s gone
into HTTP, without paying the bandwidth price.

### Aside: AMQP 0-8 / 0-9

Just as a throwaway comparison, I computed the minimum possible
overhead for sending a 66-byte message using AMQP 0-8 or 0-9. Using a
single-letter queue name, “`q`”, the overhead is 69 bytes per message,
or 105% of the message body. For our OPRA example at 1M messages per
second, that works out at 128.7 megabytes per second, and we’re back
over the limit of a single gigabit ethernet again. Interestingly,
despite AMQP’s binary nature, its overhead is much higher than a
simple syntactic rearrangement of a text-based protocol in this case.

### Conclusion

We considered the overhead of using plain HTTP in a high-speed
messaging scenario, and invented a simple alternative syntax for HTTP
that drastically reduces the wasted bandwidth.

For the specific example of the OPRA feed, the computed bandwidth
requirement of the experimental syntax is only 11% higher than the raw
data itself — nearly 3 times less than ordinary HTTP.


Note: [this][3] is a local mirror of [this](




E4X: Not as awful as I thought

Long, long ago, I complained about various warts and infelicities in E4X, the ECMAScript extensions for generating and pattern-matching XML documents. It turns out that two of my complaints were not well-founded: sequence-splicing is supported, and programmatic construction of tags is possible.

Firstly (and I’m amazed I didn’t realise this at the time, as I was using it elsewhere), it’s not a problem at all to splice in a sequence of items, in the manner of Scheme’s unquote-splicing; here’s a working solution to the problem I set myself:

function buildItems() {
  return <>
var doc = <mydocument>{buildItems()}</mydocument>;

You can even use real Arrays (which is what I tried and failed to do earlier), by guerilla-patching Array.prototype:

Array.prototype.toXMLList = function () {
    var x = <container/>;
    for (var i = 0; i < this.length; i++) {
    return x.children();
function buildItems() {
    return [<item>Hello</item>,
var doc = <mydocument>{buildItems()}</mydocument>;

Programmatic construction of tags is done by use of the syntax for plain old unquote, in an unusual position: inside the tag’s angle-brackets:

var tagName = "p";
var doc = <{tagName}>test</{tagName}>;

So in summary, my original expectation that E4X should turn out to be very quasiquote-like wasn’t so far off the mark. It’s enough to get the basics done (ignoring for the minute the problems with namespace prefixes), but it’s still a bit of a bolt-on afterthought; it would have been nice to see it better integrated with the rest of the language.


Astral Plane characters in Erlang JSON/RFC4627 implementation

Sam Ruby examines support for astral-plane characters in various JSON implementations. His post prompted me to check my Erlang implementation of rfc4627. I found that for astral plane characters in utf-8, utf-16, or utf-32, everything worked properly, but the RFC4627-mandated surrogate-pair “\uXXXX” encodings broke. A few minutes hacking later, and:

Eshell V5.5.5  (abort with ^G)
1> {ok, Utf8Encoded, []} =
2> xmerl_ucs:from_utf8(Utf8Encoded).
3> rfc4627:encode(Utf8Encoded).

Much better.

You can get the updated code from


Invitation to AMQP and RabbitMQ Birds of a Feather session

I am guest blogging here on behalf of CohesiveFT. We work with the excellent LShift team on our joint venture, RabbitMQ.

I’m here to invite you to a Birds of a Feather session this coming Thursday, August 30th, at 8pm, in central London. It is FREE and will last for 45 minutes starting at 8pm, followed by the traditional breakout discussions over a beer.
Please do take a look at RabbitMQ if you have not yet done so. It’s a commercial open source product, available under the MPL 1.1 and implementing the Advanced Message Queue Protocol. AMQP is a new way to do business messaging (ie: “what goes in, must come out“). What’s really cool is that like HTTP it is a protocol instead of a language specific API. This should make interoperability between platforms much easier and less painful (business readers: “systems integration projects take less time and success can be predicted more accurately”). For more information, please see my list of links here.

What is the BOF about – and why come? It’s an informal session about RabbitMQ and AMQP, and how they apply within popular environments such as Spring, Mule, Ruby, AJAX, and other messaging protocols such as FIX.

“Informal” means we’ll be encouraging a conversation between people interested in any of these things. We want to hear from you, and from each other, rather than pushing slideware at people.

Come if you want to:

* Meet the RabbitMQ team and hear about AMQP and its implementation
* Meet Neil Harris and Damian Raffell from MuleSource
* Meet Ben Hale from Interface21, makers of the Spring Framework
* Meet other end users of these technologies, and people from the AMQP Working Group
* Talk about what you want from these products and how they might work together
* See demos of these products working together on Amazon’s EC2 Cloud using CohesiveFT Elastic Server

Details of the BOF here: Ideally we ask you to register via the web site, but late arrivals are very welcome – if you turn up, we shall get you in. The BOF is offered as part of the popular EJUG series of tech talks and as a tie-in with the most excellent No Fluff Just Stuffconference.

If you cannot come but want to know more about any of these things then you can email us at

Thank-you very much – and we hope to see you on Thursday :-)

Posted by Chris on behalf of Alexis Richardson, CohesiveFT.


RFC 1982 limits itself to powers of two unnecessarily

RFC 1982 defines a “Serial Number Arithmetic”, for use when you have a fixed number of bits available for some monotonically increasing sequence identifier, such as the DNS SOA record serial number, or message IDs in some messaging protocol. It defines all its operations with respect to some power of two, (2^SERIAL\_BITS). It struck me just now that there’s no reason why you couldn’t generalise to any number that simply has two as a factor. You’d simply replace any mention of (2^SERIAL\_BITS) by, say, N, and any mention of (2^(SERIAL\_BITS-1)) by (N/2). The definitions for addition and comparison still seem to hold just as well.

One of the reasons I was thinking along these lines is that in Erlang, it’s occasionally useful to model a queue in an ETS table or in a process dictionary. If one didn’t mind setting an upper bound on the length of one’s modelled queue, then by judicious use of RFC 1982-style sequence number wrapping, one might ensure that the space devoted to the sequence numbering required of the model remained bounded. Using a generalised variant of RFC 1982 arithmetic, one becomes free to choose any number as the queue length bound, rather than any power of two.


HTML Email is hard to get right

For a recent project, we developed support for sending
automatically-generated HTML emails. Now, most people do this by
including a message body with href=””>MIME-type
text/html. For extra points, sometimes there’s also a
text/plain part alongside the HTML in a
multipart/alternative container.

The problem with doing things this way is that you can’t include any
images or other resources (such as CSS) as separate parts of the email
linked to from the main HTML body-part. For that, you need to use the
MIME-type. Unfortunately, few commonly-used email clients render
multipart/related HTML-plus-resource aggregations well.

We only tried the arrangement where the multipart/related,
containing the main HTML page and its associated resources, was a
sibling of the text/plain part within the
multipart/alternative container. The inverse arrangement,
with the multipart/alternative as the main document within
the multipart/related part, was something we have yet to
experiment with.

Here’s a picture of the structure of our initial attempts:

 +-- text/plain
 +-- multipart/related
      +-- text/html
      +-- image/gif
      +-- text/css

This worked reasonably well in Thunderbird and Outlook 2002,
but we had consistent reports from our customer that the images and
stylesheet would randomly fail to display in Outlook 2003 (SP2). After
lots of mucking around trying to get Outlook to either work reliably
or fail reliably, we gave up on that line and instead simplified the
structure of our emails, putting the CSS styling inline in the HTML
HEAD element:

 +-- text/plain
 +-- multipart/related
      +-- text/html (with text/css inline in HEAD)
      +-- image/gif

This didn’t work particularly well, either: it seems many email
clients ignore styles set in the HEAD element. Finally, we
moved to applying CSS styling inline, using a style attribute
on each styled element. We were able to use an XSLT transformation to
allow us to write clean HTML and apply the CSS style
attribute automatically. The final structure of the emails we sent:

 +-- text/plain
 +-- multipart/related
      +-- text/html (with text/css copied on to each element!)
      +-- image/gif

This seems to work more-or-less reliably across

* Thunderbird
* Outlook 2002
* Outlook 2003 SP2
* Google Gmail
* MS Hotmail

If I was to do it all again, I’d give serious consideration to the
traditional non-multipart text/html solution with images
hosted by some public-facing web server. We managed to get our
multipart-HTML-emails working acceptably, but only by the skin of our


* The MIME
media type registry

* multipart/alternative: href=””>RFC 2045, href=””>RFC 2046
* multipart/related: href=””>RFC 2387


E4X: I want my S-expressions back

E4X is a new ECMA standard (ECMA-357) specifying an extension to ECMAScript for streamlining work with XML documents.

It adds objects representing XML to ECMAScript, and extends the syntax to allow literal XML fragments to appear in code. It also supports a very XPath-like notation for use in extracting data from XML objects. So far, so good – all these things are somewhat useful. However, there are serious problems with the design of the extension.

If E4X objects were real objects, if there were a means of splicing a sequence of child nodes into XML literal syntax, and if E4X supported XML namespace prefixes properly, most of my objections would be dealt with. As it stands, the overall verdict is “clunky at best”.

These are my main complaints:

* It doesn’t do anything like Scheme’s unquote-splicing, and so using E4X to produce XML objects is verbose, error-prone and dangerous in concurrent settings.

There seems to be no way of splicing in a sequence of items – I’d like to do something like the following:

  function buildItems() {
   return [<item>Hello</item>,
   var doc = <mydocument>{buildItems()}</mydocument>;

and have doc contain


What actually results is the more-or-less useless


The closest I can get to the result I’m after is

   function buildItems(n) {
      n.mydocument += <item>Hello</item>;
   n.mydocument += <item>World!</item>;
   var doc = <mydocument></mydocument>;

* It’s full of redundant redundancy – it’s as verbose as XML, when you can do so much better.

* There’s no toXML() method (or similar) for use in papering over the yawning chasm between the XML objects and the rest of the language: you can’t even make a Javascript object able to seamlessly render itself to XML.

* The new types E4X introduces aren’t even proper objects – they’re a whole new class of primitive datum!

* Because they’re not proper objects, you can’t extend the system. You ought to be able to implement to an interface and benefit from the language’s XPath searching and filtering operations. E4X is so close to offering a comprehension facility for Javascript, but it’s been short-sightedly restricted to a single class of non-extensible primitives.

* You can’t even construct XML tags programmatically! If the name of the tag doesn’t appear literally in your Javascript code, you’re out of luck (unless you resort to eval…) [[Update: I was wrong about this - you can write <{expr}> and have the result of evaluating expr substituted into the tag.]]

* E4X XML objects have no notion of namespace prefixes (which are required for quality implementations of XPath and anything to do with XML signatures). Prefixes only turn up in the API as a means of producing (namespaceURI,localname) pairs. This is actually how it should be, but because there’s already broken software out there
that depends on prefix support, by not supporting prefixes properly you preclude ECMAScript+E4X from being used for XML signatures or ECMAScript-native XPath implementations.

In my opinion, E4X violates several programming language design principles: most importantly, those of regularity, simplicity and orthogonality, but also preservation of information, automation and structure. SXML, perhaps in combination with eager comprehensions, provides a far superior model for producing and consuming XML. Sadly, there’s no real alternative for ECMAScript yet – we’re limited either to library extensions, or to using the DOM without any syntactic or library support at all.


XML tunnel-vision

In Ant 1.6, properties can be written in XML files. Can someone tell me why

<property name="" value="some.value"/>

is more desirable than


Update The import feature is what’s new in Ant 1.6 that makes this usage possible. So, the answer is, “because you can conditionally set properties in the imported files” (rather than conditionally evaluating property files). So really my gripe is (again) with the weird, stilted little language that is Ant.




You are currently browsing the archives for the Standards category.



2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us