technology from back to front

Turning language recognisers into language generators

I did the Coursera Natural Language Processing course at the beginning of the year. Apart from the introduction to probability it gave me, the thing that sticks most in my mind comes from one of the exercises. In the exercise we had to define a probabilistic parser to parse (an extremely limited subset of) English. That’s not the fun bit though. The fun bit was using the information in the parser to generate “English” sentences. The idea of a grammar generating sentences is right in the standard way of defining languages, but for some reason it hadn’t occured to me to actually build such a thing.

The idea’s simple enough: complicated parsers (language generators) are made up of simpler parsers (language generators) until you hit trivial parsers (language generators).

Of course we don’t want just one sentence, so it makes sense to build streams (we’ll use my favourite stream implementation, Xtreams) so that we may – if desired – generate as many different sentences as we’d like.

DerivingParser >> generate
    "Many parsers don't parse anything."
    ^ #() reading.

    Cat >> generate
    "Generate the Cartesian product of the two subparsers' output."
    ^ ((DelayedStream on: [first generate])
        collecting: [:head |
            (DelayedStream on: [second generate])
                collecting: [:tail | {head. tail}]]) stitching.

    DelayedParser >> generate
    ^ self force generate.

    DelegateParser >> generate
    ^ parser generate.

    EmptyString >> generate
    "The only valid AST is an empty AST."
    ^ #(()) reading

    EpsStar >> generate
    "An eps* node represents a partial parse: it's an internal construct
         and should never appear in an initial grammar."

    ^ #() reading.

    Literal >> generate
    ^ {token} reading.

    Red >> generate
    ^ parser generate collecting: reducer.

    StarParser >> generate
    "Generate ever-increasing parse trees from the underlying parser p:
     first 'no occurrences of p', then p, then pp, and so on."

    | count |
    count := 1.
    ^ DerivingParser emptyString generate ,
        [ | s |
        s := ((1 to: count)
            collect: [:ignore | parser generate]) reading stitching.
        count := count + 1.
        "Return an array of parse trees from the underlying parser."
        {s rest} reading] reading stitching.

    Union >> generate
    ^ (DelayedStream on: [left generate])
        interleave: (DelayedStream on: [right generate]).

where we’ve extended Xtreams:

XTReadStream subclass: #DelayedStream
    instanceVariableNames: ''
    classVariableNames: ''
    poolDictionaries: ''
    category: 'Parsing-Derivatives-Support'.

DelayedStream >> contentsSpecies
    ^ self force contentsSpecies.

DelayedStream >> force
    source isBlock ifTrue: [source := source value reading].
    ^ source.

DelayedStream >> read: anInteger into: aSequenceableCollection at: startIndex
    ^ self force
        read: anInteger
        into: aSequenceableCollection
        at: startIndex.

A DelayedStream is an otherwise normal XTReadStream except that it reads off a source that has not yet been calculated:

hasRun := false.
delay := DelayedStream on: [hasRun := true. #(1)].
self deny: hasRun.
self assert: 1 equals: (delay read: 1).
self assert: hasRun.

One more alteration to Xtreams:

XTReadStream >> interleave: aXTReadStream
    "Interleave my values with those of aXTReadStream. If either of us are
     exhausted, read from the other. If we are both exhausted, signal so
     with an Incomplete."

    | readFromLeft |
    readFromLeft := true.
    ^ self transforming: [:input :output | | next |
        next := [| s |
            s := readFromLeft ifTrue: [input] ifFalse: [aXTReadStream].
            readFromLeft := readFromLeft not.
        [output put: next value get]
            on: Incomplete do: [output put: next value get]].

Note the strange similarity between generating parse trees for a Union, and fair disjunction in logic programming: we generate parse trees for each subgrammar in turn. Consider now the action of a reducing parser. Recall that it takes a set of parse trees, performs some function on those parse trees, and returns the new resulting set of new parse trees. Those who have used Kanren have seen this operation before: it’s logical conjunction!

And finally, we can generate sample sentences from a given language:

| p |
p := ($a asParser or: $b asParser) star.
p := DerivingParser emptyString or:
    ([p] asParser then: ($a asParser or: $b asParser)).
p generate read: 10
"With some alterations in the output, we get the following:
 => {#() .
     #(#() $a) .
     #(#() $b) .
     #(#(#() $a) $a) .
     #(#(#() $a) $b) .
     #(#(#() $b) $a) .
     #(#(#() $b) $b) .
     #(#(#(#() $a) $a) $a) .
     #(#(#(#() $a) $a) $b) .
     #(#(#(#() $a) $b) $a)}"
Frank Shearar

nine × = 9

2000-14 LShift Ltd, 1st Floor, Hoxton Point, 6 Rufus Street, London, N1 6PE, UK+44 (0)20 7729 7060   Contact us