technology from back to front

Parsing with Derivatives

We use many different languages writing software. Not just the usual kinds – Ruby, Haskell – but data formats like HTML or JSON, and protocols like SIP or HTTP. We have a lot of tools dealing with these languages – yacc, bison, ANTLR. And Matt Might and his colleagues have added a new spanner to the toolbox.

So in a break from tradition, let’s ask the important question: why bother? I mean, you want a parser generator, you reach for any number of techniques – parser combinators, whatever. Here are a few reasons why:

  • Non-blocking parsers are network server friendly.
  • Efficient – it’s cubic in the worst case, but usually linear (in the size of the grammar and the size of the input)
  • Composable grammars
  • Handles ambiguous grammars
  • Available in several languages: Scheme, Haskell, Common Lisp: https://github.com/sellout/yid , Java, Smalltalk.
  • Easy to implement (at least once you’ve understood fixed points, memoization, laziness, derivatives of recursive structures)

So what’s all the fuss about? How does it work? Imagine we’re writing an HTTP message parser. We read characters (bytes, really) off a socket. If we’ve just read a ‘G’, we take the derivative of the HTTP parser with respect to ‘G’ to yield a parser that (a) knows it’s read a ‘G’ and (b) can parse the remainder of any HTTP message that starts with a ‘G’. Then we read an ‘E’, and take the derivative of our new parser with respect to ‘E’ to yield a new parser that (a) knows it’s read ‘GE’ and (b) can parse the remainder of any HTTP message that starts with ‘GE’. We repeat this process until we’ve exhausted the input. At this point we have a parser that knows the entire input we’ve read, and can parse the remainder of the message. Since we assume that there is no remainder, we say “OK, what are your parse trees?” and the parser obligingly hands us a parse forest of every possible parse of the input. Hopefully that’s an entire HTTP message.

Ah, but note that at every step we read in a byte to yield a new parser. That means that we can pause any time we want. Perhaps we’re dealing with a pathologically slow sender of data. With a normal parser that would be a blocking operation. A blocking operation in your server is a denial of service. But when you parse with derivatives, you can stop parsing any time you like. You have a handle on the partial parse trees thus far, and can resume parsing again at your leisure.

To see this in action, let’s look at a silly example: the language of 1s consists of the empty string, or any number of 1s. In this example we create such a parser, and read in two 1s (as represented by the two derivatives we take):

| d ones |
d := DerivingParser deriverBlock.
ones := DerivingParser emptyString or: (1 asParser then: [ones] asParser).
ones := d value: 1 value: ones. "Read in a 1."
ones := d value: 1 value: ones. "Read in another 1."

This results in the following cascade of parsers:

Original parser Compacted form

Empty

parsers accept no input, eps parsers accept the empty string, eps* parsers accept the empty string but also store our partial parse trees, Red (for Reduction) parsers contain functions that act on parse trees (so we’d use them to handle our semantic actions). The other parsers should have obvious meanings.

We can see a few things already: the eps* parsers store the input we’ve read, there’s some structural sharing going on, and what is with that exploding graph size?! I thought you said it was linear! One of the tricks that parsing with derivatives requires is compaction: as parsing progresses, we end up producing an awful lot of useless stuff, like that Empty hanging off the top Union. We don’t want cruft in our parsers, so we ruthlessly compact the parser, deeply, recursively, at every step, resulting in the right hand progression. The result of compacting a Cat parser that can accept precisely one thing from its first parser is a Red parser. The reduction performed on the remaining parser is this: to every parse tree returned by the second parser, we prepend the parse tree of the first parser. That’s why we see a Red with an action of [:w2 | singleToken asArray , w2] and later that turns into a Red with action [:x | self value: (aUnaryBlock value: x)]

To see our parse trees, we start at the leaf nodes: the parse trees of a Literal forms an empty set. The parse tree of an eps is a set with just the empty string in it. An eps* yields the parse trees it stored. A Union’s parse trees are the set union of its subparsers’, and a Cat’s parse trees are the Cartesian product of its subparsers. As a result, we can see the parse trees that would result if at any step we decided we’d parsed enough:

| d d1 d2 ones |
d := DerivingParser deriverBlock.
ones := nil.
ones := DerivingParser emptyString or: (1 asParser then: [ones] asParser).
d1 := d value: 1 value: ones.
d2 := d value: 1 value: d1.
{ones parseNull. d1 parseNull. d2 parseNull} collect: #printString.
" =>  #('a Set(#())' 'a Set(#(1))' 'a Set(#(1 1))')"
by
Frank Shearar
on
31/10/12
 
 


1 × four =

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us