technology from back to front

Un Petit Haskell

It’s time to visit another Smalltalk PEG parser. We’ve seen OMeta2, and now it’s time for a rather different approach to parsing. Lukas Renggli wrote PetitParser, a parser library based on parser combinators.

PetitParser provides a few basic primitive operations. Each operation provides a new parser that extends the previous one. For instance, $a asParser returns a parser that can accept the single token ‘a’. $b asParser behaves similarly, and $a asParser / $b asParser returns a parser than may accept either an 'a' or a 'b'. The sequencing operator allows us to define a grammar accepting only the string ‘ab’ – $a asParser , $b asParser – or any sequence of tokens ‘a’ and ‘b’ – ($a asParser | $b asParser) star.

Something to note is that / is not the same operator as BNF’s |/ is the prioritised choice operator, and means “Try the first production. If that fails, try the second production.” The important difference between it and | is that in the following grammar

| parser prefix other |
prefix := $a asParser.
other := 'aa' asParser.
parser := prefix / other.

if you try parse ‘aa’ you will satisfy prefix‘s rule, and return ‘a’. A series of /-separated parsers will return the first successful match, so if you want “maximum munching” you must ensure that all prefix-containing productions occur before the prefix production. (For instance, sorting a bunch of acceptable tokens by reverse lexicographic order would work.) This first-successful-match rule may occur non-locally, as we will later see.

When a production of the form a / b / c fails, PetitParser will report the parse fail for c. That can make debugging mildly tricky, in the case of a large production: all of the options failed, when one (but which?) should have parsed.

I like throwing parser libraries against real grammars, rather than writing yet another calculator. Given that I’ve been translating a fair amount of Haskell lately, using the Haskell 2010 lexical structure seemed useful.

One of the non-BNF-like things that the Haskell 2010 grammar has is a “yes but” pattern: foo<bar> means “accept a foo, unless the bar pattern matches”. One way to translate that kind of production into PetitParser is like this:

"comment -> dashes [ any <symbol> {any} ] newline"
comment := dashes , (symbol not , any , any star) optional , newline.

Or, “a comment is a bunch of dashes, not followed by a symbol, followed by anything, ending in a newline.” One has to be careful: sometimes a rule forbids something unless it’s followed by something else. For instance:

PetitHaskell>>varid
    "varid -> (small {small | large | digit | ' }) <reservedid>"
    | normalVar varWithReservedPrefix |
    varWithReservedPrefix := reservedid , (small / large / digit / ($' asParser) plus).
    normalVar := reservedid not , (small , (small / large / digit / ($'
asParser)) star).
       
    ^ (varWithReservedPrefix / normalVar) flatten.

Note the basic pattern: “this thing either starts with something forbidden – but has more stuff trailing it in the lexeme – or it doesn’t have something forbidden in it at all.”

There’s something a bit odd going on here, by the way – what are all those instance variables doing? PetitHaskell is a PPCompositeParser which does some tricky things with instance variables. You type your method in exactly like the above, and Squeak will happily ask you if you want to make the undeclared variables instance variables. Then you specify a start production. When your PPCompositeParser initialises, it assigns to each of the instance variables a PPDelegateParser and sets the delegated parser as being the parser defined by the method of the same name. So in the above, “reservedid” is a PPDelegateParser that delegates to the parser defined by PetitHaskell>>reservedid. A mildly complicated implementation that allows for a very readable grammar. Also, this implicitly using delegate parsers allows for mutually recursive rules.

One thing I found really useful in debugging is a println-like trick. Every time this parser executes, it prints some debug information on the Transcript (Smalltalk’s equivalent of stdout):

PPParser>>debug: aString
    ^ self ==> [:x | Transcript showln: ('{1} ({2})' format: {aString. x.}). x].

Earlier I mentioned a non-local effect one can bump into with prioritised choice. Consider a subset of Haskell 2010′s lexing grammar:

program  ->  { lexeme | whitespace }
lexeme   ->  special | other stuff
special    ->  ( | ) | , | ; | [ | ] | ` | { | }
whitespace ->  whitestuff {whitestuff}
whitestuff   ->  whitechar | comment | ncomment
opencom   ->  {-
closecom  ->  -}
ncomment  ->  opencom ANY seq {ncomment ANY seq} closecom

Note that a special includes the ‘{‘ character, and this is also the start character of an ncomment. If we translate the program rule as program := lexeme / whitespace, when we parse ‘{- My First Comment -}’ we will end up processing the ‘{‘ as a special! To avoid debugging pain, write your parser test-first in a bottom-up style. And don’t forget to try process entire valid Haskell files – there’s nothing like the real world to give you data for your tests!

You can lex your own file:

"You ought to be able to parse directly off a FileStream -
FileStream fileNamed: 'foo.hs' do: [:f | PetitHaskell new parse: f]
- but for some reason PPStream doesn't play nicely with Squeak's FileStream."

| s |
s := ''.
FileStream fileNamed: '/home/frank/Downloads/ZipperTraversable.hs' do:
    [:f |
    [f atEnd] whileFalse: [s := s , (f next: 1000)]].
PetitHaskell new parse: s.

And finally, as always, the source:

Installer lukas
    project: 'petit';
    install: 'PetitParser-lr.217'.

Installer ss
    project: 'PetitHaskell';
    install: 'PetitHaskell-fbs.2'.

(Addendum: I can’t get WordPress to stop turning decent characters like single and double quotes into hideous smart quotes. Watch out when copying & pasting into your image.)

by
Frank Shearar
on
19/07/11
  1. Thanks for the interesting post.

    You gave me a new idea that might solve a current problem. I have a complex grammar and for that I like to have productions that are as close as possible to the standard. Meaning I have every rule that is in the specification as production. Otherwise I would fear to loose control easily. The problem is that there 300+ rules and squeak/pharo only allows 254 instance variables.

    In PetitHaskell>>varid you use temporary variables to produce two rules and the combined production. That is maybe the solution to my problem having all productions without wasting instance variables were it isn’t necessary. I would wish I would be capable of finding those things myself.

    For the debugging part. Do you know PPParser>>transform: ? You can wrap even the normal parsing with your transcript code.

    (YourParser new
    transform: [:parser| parser ==> [:nodes| Transcript show: parser name; ... ]])
    parse: someThing

  2. Hi Norbert,

    I didn’t know about #transform: so thanks for pointing it out.

    My #debug: is quite a hack, compared to using #transform:. Your snippet adds debugging to every rule without changing the original grammar. If you wanted more precision in your debugging you could simply only transform those parsers matching some pattern.

  3. About your smart quotes issue, your code chunks are wrapped in a <pre> element only, I suspect there should be a <code> element as well, inside it.

    (please drop my previous comment, and curse WP for not properly escaping HTML in comments…)

  4. Frank,

    yes, you can do a lot with the transform. Each parser has a name attribute which is set to the name of the rule/production. So it should be possible to act upon this “type” information. If you combine this with the >=> operator you can even debug on the position of the stream.

 
 


two × = 14

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us