technology from back to front

Reporting parser errors

We like parsers. One of the things that really kills the vibe with parsers is a rubbish error message.

Given the technical interestingness of parsing with derivatives, can we get useful error messages out of them?

What’s an error look like? Recall that a literal parser like Literal token: $1 will return EpsStar treeSet: {$1} if the next object (*) in the stream is a $1 (the character “1″) and an Empty otherwise. But for the moment let’s stick to parsing streams of characters – source code, say.

What’s a useful error look like? It should tell you two things: the failed expectation – “keyword expected but ‘NOTAKEYWORD’ found” – and where the error occured – “on line 22 at column 44″. We might want more than that, but this is a good start.

It’d be onerous to wind position tracking throughout our parsers, so let’s push all that stuff where it belongs, by decorating the input stream. It’s easy enough to track column/line numbers: if you read in a linefeed/newline character (strong assumption here: Windows machines and many Internet protocols will use carriage return/linefeed pairs, old Mac machines used carriage return, there’s a Unicode line separator character, …) you increment the line number and reset the column number, otherwise you increment the column number. We’ll just skip the accounting and note that we instantiate such a position tracking stream thusly: '123' reading positionTracking. (We’ll leave the problem of tracking one’s position when parsing an arbitrary object graph for another day.)

More interesting is the error itself. First, we realise that deriving an Empty shows that we have an error. It makes sense therefore that we decorate the Empty. In our case, let’s create these parsers thusly: DerivingParser emptyExpected: $1 actual: $2. In the context of the implementation we’ve previously discussed, we need to update DerivingParser >> #deriverBlock:

DerivingParser >> deriverBlock
    | d |
    d := nil.
    d := [:o :p |
        p class caseOf: {
            [Empty] -> [p].
            "Given that what we parse likely doesn't have special markers for
            'nothing' or 'the end of the stream', let's use unique markers
            for same."

            [EmptyString] -> [DerivingParser emptyExpected: #Nothing actual: o].
            [EpsStar] -> [DerivingParser emptyExpected: #EndOfStream actual: o].
            [Literal] -> [(o = p token)
                            ifTrue: [EpsStar treeSet: {o}]
                            ifFalse: [DerivingParser
                                emptyExpected: p token
                                actual: o]].
            [LiteralSet] -> [(p tokens includes: o)
                ifTrue: [EpsStar treeSet: {o}]
                "Note that we can report that the object wasn't one of a set of legal
                values."

                ifFalse: [DerivingParser emptyExpected: p tokens actual: o]].
            "... and so on"
        }
        otherwise: [Error signal: 'derivative not defined for ', p className]] memoized.
    ^ d

Now we can get basic error messages:

| d |
"Remember that this guy lets us say 'the derivative of secondArg with
respect to firstArg. Think of the D_x f(x) mathematical notation."

d := DerivingParser deriverBlock.
(d value: 1 value: (EpsStar treeSet: {2})) error.
"=> 'Expected a <#EndOfStream>, found a <1>'"
(d value: 1 value: DerivingParser empty) error.
"=> 'Unexplained error'" "(What to do when 'this should never happen' happens.)"
(d value: 1 value: (DerivingParser emptyExpected: 1 actual: 2)) error.
"=> 'Expected a <1>, found a <2>'"
(d value: 3 value: (1 asParser or: 2 asParser)) error.
"=> Error: Error!"

Wait, what? Ah, that last expression shows that we can, during derivation, end up with a union parser representing an error. Here we tried to derive a parser accepting a sequence of 2s with respect to 1, so we end up with a union of two empty parsers. However, compaction of a such a union is itself an Empty:

| u |
u := d value: 3 value: (1 asParser or: 2 asParser).
{u left error. u right error}
"=> #('Expected a <1>, found a <3>' 'Expected a <2>, found a <3>')"

OK, so it looks like when we compact parsers we will sometimes need to “merge” errors together:

ParserCompacter >> compactUnion: aUnion
    | c1 c2 |
    c1 := self cached: aUnion left.
    c2 := self cached: aUnion right.
    "Note #isEmptyParser - we'll talk about this later."
    (c1 isEmptyParser and: [c2 isEmptyParser]) ifTrue: [
        "Since both subparsers are empty, we must merge the expected values
         together."

        ^ DerivingParser
            emptyExpected: (c1 expected asArray , c2 expected asArray)
            actual: c2 actual].
    c1 isEmpty ifTrue: [^ c2].
    c2 isEmpty ifTrue: [^ c1].

    ^ ((c1 == aUnion left) and: [c2 == aUnion right])
        ifTrue: [aUnion]
        ifFalse: [c1 or: c2]

If we repeat our experiment, we now see:

| cdu d du u |
d := DerivingParser deriverBlock.
u := 1 asParser or: 2 asParser.
du := d value: 3 value: u.
cdu := ParserCompacter value: du.
cdu error
"=>  'Expected one of <1, 2>, found a <3>'"

One additional wrinkle is that sometimes parsers may be empty because they contain empty subparsers. A reduction parser wrapping an Empty – (Empty expected: 1 actual: 2) reduce: #printString – is empty, but of course a reduction parser isn’t actually the empty parser. However, by the time #compactUnion: is called, the subparsers are already compact – and a Red wrapping an Empty obviously compacts to just the Empty. That’s the reason for #isEmptyParser in our definition of #compactionUnion: rather than #isEmpty:

DerivingParser >> isEmptyParser
    ^ false.

Empty >> isEmptyParser
    ^ true.

So now we have position tracking, and decent error messages. We can put the two together thusly:

| d parse num |
DerivingParser flushCaches.
d := DerivingParser deriverBlock.
"Num takes a collection of (digit) characters, and turns it into a number. So
#($1 $2 $3) => 123."

num := (LiteralSet tokens: ($0 to: $9)) star reduce: [:digits |
    digits inject: 0 into: [:acc :each | (acc * 10) + (each asInteger - $0 asInteger)]].
parse := nil.
parse := [:stream :parser |
    [ | derivative |
    derivative := ParserCompacter value: (d value: stream get value: parser).
    "Throw an error if we end parsing unexpectedly soon."
    derivative isEmptyParser
        ifTrue: [Error signal: 'Error at line ', stream lineNumber asString,
            ' column ', stream columnNumber asString, '. ', derivative error].
    parse
        value: stream
        value: derivative] on: Incomplete do: [parser parseNull]].
[parse value: '12a3' reading positionTracking value: num]
    on: Error do: [:e | e messageText]
"=> 'Error at line 1 column 3. Expected one of <$0, $1, $2, $3, $4, $5, $6, $7, $8, $9>, found a <$a>'"

And of course we want a clean interface that hides all these boring details.

(*) There’s nothing that says we have to limit ourselves to streams of characters, and we’ve seen OMeta’s ability to pattern match over arbitrary objects

by
Frank Shearar
on
20/02/13
 
 


6 + six =

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us