Those guys at Arup are taking this seriously and unlike many others who postulate in this area they have the clout and the commercial imperative to influence decisions and make changes. Interesting times ahead…
Many websites refuse to accept email addresses of the form myusername+sometext@gmail.com, despite the fact that the +sometext is perfectly legitimate1 and is an advertised feature gmail offers for creating pseudo-single-use email addresses from a base email address.
My guess is that the developers of these sites think, because they’re either lazy or incompetent, that email addresses have more restrictions than they in fact have. It’s reasonable (and fairly easy) these days to check the syntax of the DNS part of an email address, because few people use non-DNS or non-SMTP transfer methods anymore, but the mailbox part is extremely flexible and hard to check accurately. A sane thing to do is just trust the user, and send a test mail to validate the address.
I picked on Yahoo in the title of this post: Yahoo are by no means the only offender, but I just signed up for a yahoo account, so they’re for me the most recent. Their signup form also refused to provide any guidance about why they were rejecting the form submission: I had to use my previous experience of sites wrongly rejecting valid email addresses to guess what the problem might be. Fail.
Footnote 1: According to my best reading of the relevant RFCs, anyway. See the definition of dot-atom in section 3.2.4 of RFC 2822, referenced in this context by section 3.4.1.
I’ve been working recently on Reverse HTTP, an approach to making HTTP easier to use as the distributed object system that it is. My work is similar to the work of Lentczner and Preston, but is independently invented and technically a bit different: one, I’m using plain vanilla HTTP as a transport, and two, I’m focussing a little more on the enrollment, registration, queueing and management aspects of the system. My draft spec is here (though as I’m still polishing, please excuse its roughness), and you can play with some demos or download and play with an implementation of the spec.
Comments welcome!
HTTP/1.1 is a lovely protocol. Text-based, sophisticated, flexible. It does tend toward the verbose though. What if we wanted to use HTTP’s semantics in a very high-speed messaging situation? How could we mitigate the overhead of all those headers?
Now, bandwidth is pretty cheap: cheap enough that for most applications the kind of approach I suggest below is ridiculously far over the top. Some situations, though, really do need a more efficient protocol: I’m thinking of people having to consume the OPRA feed, which is fast approaching 1 million messages per second (1, 2, 3). What if, in some bizarre situation, HTTP was the protocol used to deliver a full OPRA feed?
Instead of having each HTTP request start with a clean slate after the previous request on a given connection has been processed, how about giving connections a memory?
Let’s invent a syntax for HTTP that is easy to translate back to regular HTTP syntax, but that avoids repeating ourselves quite so much.
Each line starts with an opcode and a colon. The rest of the line is interpreted depending on the opcode. Each opcode-line is terminated with CRLF.
V:HTTP/1.x Set HTTP version identifier.
B:/some/base/url Set base URL for requests.
M:GET Set method for requests.
<:somename Retrieve a named configuration
>:somename Give the current configuration a name
H:Header: value Set a header
-:/url/suffix Issue a bodyless request
+:/url/suffix 12345 Issue a request with a body
Opcodes V, B, M and H are hopefully self-explanatory. I’ll
explore < and > below. The opcodes - and + actually complete
each request and tell the server to process the message.
Opcode - takes as its argument a URL fragment that gets appended to
the base URL set by opcode B. Opcode + does the same, but also
takes an ASCII Content-Length value, which tells the server to read
that many bytes after the CRLF of the + line, and to use the bytes
read as the entity body of the HTTP request.
Content-Length is a slightly weird header, more properly associated
with the entity body than the headers proper, which is why it gets
special treatment. (We could also come up with a syntax for indicating
chunked transfer encoding for the entity body.)
As an example, let’s encode the following POST request:
POST /someurl HTTP/1.1
Host: relay.localhost.lshift.net:8000
Content-Type: text/plain
Accept-Encoding: identity
Content-Length: 13
hello world
Encoded, this becomes
V:HTTP/1.1
B:/someurl
M:POST
H:Host: relay.localhost.lshift.net:8000
H:Content-Type: text/plain
H:Accept-Encoding: identity
+: 13
hello world
Not an obvious improvement. However, consider issuing 100 copies of that same request on a single connection. With plain HTTP, all the headers are repeated; with our encoded HTTP, the only part that is repeated is:
+: 13
hello world
Instead of sending (151 * 100) = 15100 bytes, we now send 130 + (20 * 100) = 2130 bytes.
The scheme as described so far takes care of the unchanging parts of
repeated HTTP requests; for the changing parts, such as Accept and
Referer headers, we need to make use of the < and >
opcodes. Before I get into that, though, let’s take a look at how the
scheme so far might work in the case of OPRA.
Each OPRA quote update is on average 66 bytes long, making for around 63MB/s of raw content.
Let’s imagine that each delivery appears as a separate HTTP request:
POST /receiver HTTP/1.1
Host: opra-receiver.example.com
Content-Type: application/x-opra-quote
Accept-Encoding: identity
Content-Length: 66
blablablablablablablablablablablablablablablablablablablablablabla
That’s 213 bytes long: an overhead of 220% over the raw message content.
Encoded using the stateful scheme above, the first request appears on the wire as
V:HTTP/1.1
B:/receiver
M:POST
H:Host: opra-receiver.example.com
H:Content-Type: application/x-opra-quote
H:Accept-Encoding: identity
+: 66
blablablablablablablablablablablablablablablablablablablablablabla
and subsequent requests as
+: 66
blablablablablablablablablablablablablablablablablablablablablabla
for an amortized per-request size of 73 bytes: a much less problematic overhead of 11%. In summary:
| Encoding | Bytes per message body | Per-message overhead (bytes) | Size increase over raw content | Bandwidth at 1M msgs/sec |
|---|---|---|---|---|
| Plain HTTP | 66 | 147 | 220% | 203.1 MBy/s |
| Encoded HTTP | 66 | 7 | 11% | 69.6 MBy/s |
Using plain HTTP, the feed doesn’t fit on a gigabit ethernet. Using our encoding scheme, it does.
Besides the savings in terms of bandwidth, the encoding scheme could also help with saving CPU. After processing the headers once, the results of the processing could be cached, avoiding unnecessary repetition of potentially expensive calculations such as routing, authentication, and authorisation.
Above, I mentioned that some headers changed, while others stayed the
same from request to request. The < and > opcodes are intended to
deal with just this situation.
The > opcode stores the current state in a named register, and the
< opcode loads the current state from a register. Headers that don’t
change between requests are placed into a register, and each request
loads from that register before setting its request-specific headers.
To illustrate, imagine the following two requests:
GET / HTTP/1.1
Host: www.example.com
Cookie: key=value
Accept: HTTP Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
GET /style.css HTTP/1.1
Host: www.example.com
Cookie: key=value
Referer: http://www.example.com/
Accept: text/css,*/*;q=0.1
One possible encoding is:
V:HTTP/1.1
B:/
M:GET
H:Host: www.example.com
H:Cookie: key=value
>:config1
H:Accept: HTTP Accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
-:
<:config1
H:Referer: http://www.example.com/
H:Accept: text/css,*/*;q=0.1
-:style.css
By using <:config1, the second request reuses the stored settings
for the method, base URL, HTTP version, and Host and Cookie
headers.
Most applications of HTTP do fine using ordinary HTTP syntax. I’m not suggesting changing HTTP, or trying to get an encoding scheme like this deployed in any browser or webserver at all. The point of the exercise is to consider how low one might make the bandwidth overheads of a text-based protocol like HTTP for the specific case of a high-speed messaging scenario.
In situations where the semantics of HTTP make sense, but the syntax is just too verbose, schemes like this one can be useful on a point-to-point link. There’s no need for global support for an alternative syntax, since people who are already forming very specific contracts with each other for the exchange of information can choose to use it, or not, on a case-by-case basis.
Instead of specifying a whole new transport protocol for high-speed links, people can reuse the considerable amount of work that’s gone into HTTP, without paying the bandwidth price.
Just as a throwaway comparison, I computed the minimum possible
overhead for sending a 66-byte message using AMQP 0-8 or 0-9. Using a
single-letter queue name, “q“, the overhead is 69 bytes per message,
or 105% of the message body. For our OPRA example at 1M messages per
second, that works out at 128.7 megabytes per second, and we’re back
over the limit of a single gigabit ethernet again. Interestingly,
despite AMQP’s binary nature, its overhead is much higher than a
simple syntactic rearrangement of a text-based protocol in this case.
We considered the overhead of using plain HTTP in a high-speed messaging scenario, and invented a simple alternative syntax for HTTP that drastically reduces the wasted bandwidth.
For the specific example of the OPRA feed, the computed bandwidth requirement of the experimental syntax is only 11% higher than the raw data itself — nearly 3 times less than ordinary HTTP.
Long, long ago, I complained about various warts and infelicities in E4X, the ECMAScript extensions for generating and pattern-matching XML documents. It turns out that two of my complaints were not well-founded: sequence-splicing is supported, and programmatic construction of tags is possible.
Firstly (and I’m amazed I didn’t realise this at the time, as I was using it elsewhere), it’s not a problem at all to splice in a sequence of items, in the manner of Scheme’s unquote-splicing; here’s a working solution to the problem I set myself:
function buildItems() {
return <>
<item>Hello</item>
<item>World!</item>
</>;
}
var doc = <mydocument>{buildItems()}</mydocument>;
You can even use real Arrays (which is what I tried and failed to do earlier), by guerilla-patching Array.prototype:
Array.prototype.toXMLList = function () {
var x = <container/>;
for (var i = 0; i < this.length; i++) {
x.appendChild(this[i]);
}
return x.children();
}
function buildItems() {
return [<item>Hello</item>,
<item>World!</item>].toXMLList();
}
var doc = <mydocument>{buildItems()}</mydocument>;
Programmatic construction of tags is done by use of the syntax for plain old unquote, in an unusual position: inside the tag’s angle-brackets:
var tagName = "p";
var doc = <{tagName}>test</{tagName}>;
So in summary, my original expectation that E4X should turn out to be very quasiquote-like wasn’t so far off the mark. It’s enough to get the basics done (ignoring for the minute the problems with namespace prefixes), but it’s still a bit of a bolt-on afterthought; it would have been nice to see it better integrated with the rest of the language.
Sam Ruby examines support for astral-plane characters in various JSON implementations. His post prompted me to check my Erlang implementation of rfc4627. I found that for astral plane characters in utf-8, utf-16, or utf-32, everything worked properly, but the RFC4627-mandated surrogate-pair “\uXXXX” encodings broke. A few minutes hacking later, and:
Eshell V5.5.5 (abort with ^G)
1> {ok, Utf8Encoded, []} =
rfc4627:decode(”\”\\u007a\\u6c34\\ud834\\udd1e\”").
{ok,<<122,230,176,180,240,157,132,158>>,[]}
2> xmerl_ucs:from_utf8(Utf8Encoded).
[122,27700,119070]
3> rfc4627:encode(Utf8Encoded).
[34,122,230,176,180,240,157,132,158,34]
4>
Much better.
You can get the updated code using mercurial:
hg clone http://hg.opensource.lshift.net/erlang-rfc4627/
I am guest blogging here on behalf of CohesiveFT. We work with the excellent LShift team on our joint venture, RabbitMQ.
I’m here to invite you to a Birds of a Feather session this coming Thursday, August 30th, at 8pm, in central London. It is FREE and will last for 45 minutes starting at 8pm, followed by the traditional breakout discussions over a beer. Please do take a look at RabbitMQ if you have not yet done so. It’s a commercial open source product, available under the MPL 1.1 and implementing the Advanced Message Queue Protocol. AMQP is a new way to do business messaging (ie: “what goes in, must come out“). What’s really cool is that like HTTP it is a protocol instead of a language specific API. This should make interoperability between platforms much easier and less painful (business readers: “systems integration projects take less time and success can be predicted more accurately”). For more information, please see my list of links here.
What is the BOF about - and why come? It’s an informal session about RabbitMQ and AMQP, and how they apply within popular environments such as Spring, Mule, Ruby, AJAX, and other messaging protocols such as FIX.
“Informal” means we’ll be encouraging a conversation between people interested in any of these things. We want to hear from you, and from each other, rather than pushing slideware at people.
Come if you want to:
You can find out details of the BOF here. Ideally we ask you to register via the web site, but late arrivals are very welcome - if you turn up, we shall get you in. The BOF is offered as part of the popular EJUG series of tech talks and as a tie-in with the most excellent No Fluff Just Stuff conference.
If you cannot come but want to know more about any of these things then you can email us at info@rabbitmq.com.
Thank-you very much - and we hope to see you on Thursday :-)
Posted by Chris on behalf of Alexis Richardson, CohesiveFT.
RFC 1982 defines a “Serial Number Arithmetic”, for use when you have a fixed number of bits available for some monotonically increasing sequence identifier, such as the DNS SOA record serial number, or message IDs in some messaging protocol. It defines all its operations with respect to some power of two, (2^SERIAL_BITS). It struck me just now that there’s no reason why you couldn’t generalise to any number that simply has two as a factor. You’d simply replace any mention of (2^SERIAL_BITS) by, say, N, and any mention of (2^(SERIAL_BITS-1)) by (N/2). The definitions for addition and comparison still seem to hold just as well.
One of the reasons I was thinking along these lines is that in Erlang, it’s occasionally useful to model a queue in an ETS table or in a process dictionary. If one didn’t mind setting an upper bound on the length of one’s modelled queue, then by judicious use of RFC 1982-style sequence number wrapping, one might ensure that the space devoted to the sequence numbering required of the model remained bounded. Using a generalised variant of RFC 1982 arithmetic, one becomes free to choose any number as the queue length bound, rather than any power of two.
For a recent project, we developed support for sending automatically-generated HTML emails. Now, most people do this by including a message body with MIME-type text/html. For extra points, sometimes there’s also a text/plain part alongside the HTML in a multipart/alternative container.
The problem with doing things this way is that you can’t include any images or other resources (such as CSS) as separate parts of the email linked to from the main HTML body-part. For that, you need to use the multipart/related MIME-type. Unfortunately, few commonly-used email clients render multipart/related HTML-plus-resource aggregations well.
We only tried the arrangement where the multipart/related, containing the main HTML page and its associated resources, was a sibling of the text/plain part within the multipart/alternative container. The inverse arrangement, with the multipart/alternative as the main document within the multipart/related part, was something we have yet to experiment with.
Here’s a picture of the structure of our initial attempts:
multipart/alternative
|
+-- text/plain
+-- multipart/related
|
+-- text/html
+-- image/gif
+-- text/css
This worked reasonably well in Thunderbird and Outlook 2002, but we had consistent reports from our customer that the images and stylesheet would randomly fail to display in Outlook 2003 (SP2). After lots of mucking around trying to get Outlook to either work reliably or fail reliably, we gave up on that line and instead simplified the structure of our emails, putting the CSS styling inline in the HTML HEAD element:
multipart/alternative
|
+-- text/plain
+-- multipart/related
|
+-- text/html (with text/css inline in HEAD)
+-- image/gif
This didn’t work particularly well, either: it seems many email clients ignore styles set in the HEAD element. Finally, we moved to applying CSS styling inline, using a style attribute on each styled element. We were able to use an XSLT transformation to allow us to write clean HTML and apply the CSS style attribute automatically. The final structure of the emails we sent:
multipart/alternative
|
+-- text/plain
+-- multipart/related
|
+-- text/html (with text/css copied on to each element!)
+-- image/gif
This seems to work more-or-less reliably across
If I was to do it all again, I’d give serious consideration to the traditional non-multipart text/html solution with images hosted by some public-facing web server. We managed to get our multipart-HTML-emails working acceptably, but only by the skin of our teeth.
References:
E4X is a new ECMA standard (ECMA-357) specifying an extension to ECMAScript for streamlining work with XML documents.
It adds objects representing XML to ECMAScript, and extends the syntax to allow literal XML fragments to appear in code. It also supports a very XPath-like notation for use in extracting data from XML objects. So far, so good - all these things are somewhat useful. However, there are serious problems with the design of the extension.
If E4X objects were real objects, if there were a means of splicing a sequence of child nodes into XML literal syntax, and if E4X supported XML namespace prefixes properly, most of my objections would be dealt with. As it stands, the overall verdict is “clunky at best”.
These are my main complaints:
It doesn’t do anything like Scheme’s unquote-splicing, and so using E4X to produce XML objects is verbose, error-prone and dangerous in concurrent settings.
There seems to be no way of splicing in a sequence of items - I’d like to do something like the following:
function buildItems() {
return [<item>Hello</item>,
<item>World!</item>];
}
var doc = <mydocument>{buildItems()}</mydocument>;
and have doc contain
<mydocument> <item>Hello</item> <item>World!</item> </mydocument>
What actually results is the more-or-less useless
<mydocument>Hello,World!</mydocument>
The closest I can get to the result I’m after is
function buildItems(n) {
n.mydocument += <item>Hello</item>;
n.mydocument += <item>World!</item>;
}
var doc = <mydocument></mydocument>;
buildItems(doc);
It’s full of redundant redundancy - it’s as verbose as XML, when you can do so much better.
There’s no toXML() method (or similar) for use in
papering over the yawning chasm between the XML objects and the rest
of the language: you can’t even make a Javascript object able to
seamlessly render itself to XML.
The new types E4X introduces aren’t even proper objects - they’re a whole new class of primitive datum!
Because they’re not proper objects, you can’t extend the system. You ought to be able to implement to an interface and benefit from the language’s XPath searching and filtering operations. E4X is so close to offering a comprehension facility for Javascript, but it’s been short-sightedly restricted to a single class of non-extensible primitives.
You can’t even construct XML tags programmatically! If the name of
the tag doesn’t appear literally in your Javascript code, you’re out
of luck (unless you resort to eval…) [[Update: I was wrong about this - you can write <{expr}> and have the result of evaluating expr substituted into the tag.]]
E4X XML objects have no notion of namespace prefixes (which are required for quality implementations of XPath and anything to do with XML signatures). Prefixes only turn up in the API as a means of producing (namespaceURI,localname) pairs. This is actually how it should be, but because there’s already broken software out there that depends on prefix support, by not supporting prefixes properly you preclude ECMAScript+E4X from being used for XML signatures or ECMAScript-native XPath implementations.
In my opinion, E4X violates several programming language design principles: most importantly, those of regularity, simplicity and orthogonality, but also preservation of information, automation and structure. SXML, perhaps in combination with eager comprehensions, provides a far superior model for producing and consuming XML. Sadly, there’s no real alternative for ECMAScript yet - we’re limited either to library extensions, or to using the DOM without any syntactic or library support at all.
You are currently browsing the archives for the Standards category.