Compacting cyclic parsers
When I wrote my Smalltalk deriving-with-parsers library, I ran into an issue with compaction: cycles in the parser. Self-referencing parsers (corresponding to left- and right-recursive rules) occur naturally, so I couldn’t hide from the problem. I investigated two ways to introduce circularity as well as how to compact these graphs: delegates, and “sutures”.
Let’s first look at delegates:
DerivingParser subclass: #DelegateParser
instanceVariableNames: 'parser'
classVariableNames: ''
poolDictionaries: ''
category: 'Parsing-Derivatives'.
DelegateParser >> parser
parser ifNil: [UnresolvedDelegate accessing: self].
^ parser.
DelegateParser >> parser: aParser
parser := aParser.
DelegateParser >> isResolved
^ parser notNil.
DelegateParser >> isDelegate
^ true.
DelegateParser >> subParsers
^ parser
ifNil: [#()]
ifNotNil: [{parser}].
Compaction maps parsers to other parsers that are hopefully smaller. Most parsers are already compact – it’s only the Union and Cat parsers that might be compacted. The literature contains the rules for compaction. I won’t reproduce them here save one, to serve as an example. Imagine we have constructed this parser partway through a parse, a Red(ucing parser) wrapping a Red wrapping an EpsStar:
((EpsStar treeSet: {1. 2}) reduce: [:x | x * 2]) reduce: [:y | y + 1]
. Obviously a (slightly) simpler parser would compose the two reducing functions. Thus one of the compaction rules for a Red is “the compaction of a Red wrapping a Red wrapping a parser p is a Red wrapping p whose reducer is the composition of the reducers of the two Reds”. In this case that’s
(EpsStar treeSet: {1. 2}) reduce: [:x | (x * 2) + 1]
.
Imagine a simple cyclic grammar:
| ones d |
d := DelegateParser new.
ones := DerivingParser emptyString or: ($1 asParser then: d).
d parser: ones.
We can see that this particular parser is already compact: the leaf parsers are by definition compact, and the Union and Cat parsers are combinations of compact parsers. But consider writing a compacter. We traverse the graph, and find a delegate. The delegate points to a parser “up” the graph, so compaction maps this delegate to another delegate pointing at… oh, wait. We don’t know yet, because we need to complete compaction to know this. Hm. OK. Instead, we’ll come back to it later. Delegates that don’t form a cycle are easy to compact: they’re just delegates pointing to the compaction of the pointed-to parser. So given the basic algorithm
Object subclass: #SutureCompacter
instanceVariableNames: 'cache'
classVariableNames: ''
poolDictionaries: ''
category: 'Parsing-Derivatives-Support'.
SutureCompacter class >> value: aParser
^ self new value: aParser.
SutureCompacter >> initialize
super initialize.
cache := Dictionary new.
SutureCompacter >> value: aParser
| visited |
cache at: aParser ifPresent: [:compacted | ^ compacted].
visited := IdentitySet new.
self compact: aParser.
self resolveDelegates: (cache at: aParser).
^ cache at: aParser.
let’s look at
#compactDelegate:
(called by
#compact:
) and
#resolveDelegates:
a bit closer.
SutureCompacter >> compactDelegate: aDelegateParser
"The compaction of a delegate parser is the compaction of its subparser.
A delegate parser might not be resolvable yet, because if the delegate
closes a cycle, the parser to which it should point will not have been
mapped yet."
| mappedSubparser |
"This delegate completes a cycle. Its subparser hasn't been mapped yet.
Mark its presence, and bail. The post-walk will resolve things."
(cache contains: [:s | s reference = aDelegateParser parser])
ifFalse: [^ DelegateParser new].
mappedSubparser := self cached: aDelegateParser parser.
(aDelegateParser parser == mappedSubparser reference)
ifTrue: [^ aDelegateParser].
(cache includesKey: aDelegateParser) ifTrue: [
^ DelegateParser new parser: mappedSubparser].
^ DelegateParser new.
SutureCompacter >> resolveDelegates: aParser
PreOrderParserWalker
walk: aParser
doing: [:p |
(p isDelegate and: [p isResolved not])
ifTrue: [ | preimage |
preimage := cache keyAtValue: (cache values detect: [:sutr |
sutr reference = p]).
p parser: (self cached: preimage parser)]].
We’ll see that
sutr
in a bit. So
#resolveDelegates:
just walks the compacted parser, filling in the blanks left behind by the cycle-causing delegates. We have one final trick up our sleeves: while we’re looking up parsers in our cache, and we’re looking up a delegate, it does no harm to return the pointed-to parser (and might make the graph one node smaller):
SutureCompacter >> cached: aParser
"If a delegate doesn't form part of a cycle, it does no harm to point to
the delegate's reference instead of the delegate. This also makes the
parser one node smaller, potentially."
| p suture |
suture := cache at: aParser.
p := suture reference.
(p notNil and: [p isDelegate and: [(self formsACycle: p) not]]) ifTrue: [
^ cache at: p parser].
^ suture.
SutureCompacter >> formsACycle: aDelegateParser
"If aDelegateParser is not yet resolved, treat it as though it forms a cycle."
[aDelegateParser parser] on: UnresolvedDelegate do: [^ true].
^ aDelegateParser isChildOf: aDelegateParser parser.
So with this we can now compact parsers with or without cycles. But what about that
sutr
bit? What’s up with that?
sutr
is short for
Suture
. Hopefully the name suggests “ties things together” and “it disappears by itself after surgery”:
DerivingParser subclass: #Suture
instanceVariableNames: 'reference'
classVariableNames: ''
poolDictionaries: ''
category: 'Parsing-Derivatives'.
SutureCompacter class >> pointingAt: anObject
^ self new pointingAt: anObject.
SutureCompacter >> pointingAt: anObject
self reference: anObject.
SutureCompacter >> reference
^ reference.
SutureCompacter >> reference: anObject
reference := anObject.
SutureCompacter >> subParsers
^ reference
ifNil: [#()]
ifNotNil: [{reference}].
So far there’s nothing really to differentiate between a
Suture
and a
DelegateParser
. That’s fine.
Suture
s are just markers. We’ll change the basic algorithm ever so slightly:
SutureCompacter >> value: aParser
| visited |
cache at: aParser ifPresent: [:compacted | ^ compacted].
visited := IdentitySet new.
self compact: aParser.
self resolveDelegates: (cache at: aParser).
self tieSutures: aParser. "<-- This line's new."
^ cache at: aParser.
Before we go any further, I should mention the magic trick we’re about to use. Smalltalk has a feature called
#become:
:
a become: b
means that, across the entire image, everything that points to
b
now points to
a
and vice versa. This is implemented using
#becomeForward:
, a “this thing becomes that thing (in one atomic operation)”. So how do we use this? As we walk the parser in a pre-order traversal, we populate the parser -> compacter parser mapping (cunningly misnamed as
cache
in the code):
SutureCompacter >> compact: aParser
"Pre-order mark the parser as visited; post-order compact the parser (so
that all its subparsers (except in a cycle) have been compacted)."
cache at: aParser ifPresent: [:p | ^ p].
cache at: aParser put: Suture new.
aParser subParsers do: [:p | self compact: p].
(cache at: aParser) reference: (aParser class
caseOf: {
"Optimise case testing by putting the most common cases first"
[Empty] -> [aParser].
[EmptyString] -> [aParser].
[EpsStar] -> [aParser].
[Literal] -> [aParser].
[Cat] -> [self compactCat: aParser].
[Union] -> [self compactUnion: aParser].
[Red] -> [self compactRed: aParser].
[DelayedParser] -> [self compactDelayed: aParser].
[DelegateParser] -> [self compactDelegate: aParser]}
otherwise: [aParser]).
You’ve seen me switch on classes before: functional decomposition is sometimes rather useful.
SutureCompacter >> cached: aParser
"If a delegate doesn't form part of a cycle, it does no harm to point to
the delegate's reference instead of the delegate. This also makes the
parser one node smaller, potentially."
| p suture |
suture := cache at: aParser.
p := suture reference.
(p notNil and: [p isDelegate and: [(self formsACycle: p) not]]) ifTrue: [
^ cache at: p parser].
^ suture.
So we walk the parser pre-order, stashing
Suture
s in our map as we go. We compact the parsers in a post-order fashion (so that composite parsers always (except for cycles) have their subparsers compacted). During compaction we update the compaction mapping (
(cache at: aParser) reference: bigCaseStatement
in
#compact:
’s definition above). And finally, we pull a great big switcheroo, and make the
Suture
s vanish, leaving a pristine, cyclic parser:
SutureCompacter >> tieSutures: aParser
| sutures references |
sutures := cache values.
references := sutures collect: #reference.
"Put the pointed-to objects in the cache rather than the sutures."
sutures do: [:s | cache at: s put: s reference].
"Magically make everything that points to a Suture point to the Suture's
reference, tying the cyclic knots."
sutures elementsForwardIdentityTo: references.
So after all that, what was the point? A neater API! Now we don’t have to add artificialities to describe cyclic parsers;
| ones |
"This line just shuts the Compiler up, which would otherwise
moan about the undefined use of ones in the second line."
ones := nil.
ones := DerivingParser emptyString or: ($1 asParser then: [ones] asParser).
