Unger Parsing

I am busy with holiday stuff so this post is a little bit half-baked. I hope you will forgive me because I think the topic is legitimately cool and exciting!
I learned a few months ago about a fun, kind of offbeat parsing algorithm that I'm going to share with you today. I learned about this from Parsing Techniques: A Practical Guide, which if you're interested in goofy parsing algorithms I highly recommend (although I think for the most part, it's not really "practical" these days to need to parse ambiguous grammars, which this book cares a lot about).
When we have a context-free grammar we want to parse, conceptually what we're trying to do is to find a tree of productions from some root nonterminal symbol that would produce that string. There are two popular ways of doing this: you either start from that root symbol and try to produce new symbols from left-to-right that line up with the input string, or you start smushing together symbols from the input string until you create the root symbol. These are, respectively "top-down" and "bottom-up" parsing. The former having a popular conception in the case of LL(1) grammars as "recursive descent."
Unger's algorithm, surprisingly, does neither of these things!
Let's begin with a simple expression grammar:
const grammar = {
'E': [
['E', '+', 'T'],
['T'],
],
'T': [
['T', '*', 'F'],
['F'],
],
'F': [
['(', 'E', ')'],
['i'],
]
}
And, say we want to parse the string i+(i*i) for the root symbol E.
There are two ways that E can be parsed: E+T or T. Starting with E+T, there must be three strings, a, b, c, where abc=i+(i*i), a can be parsed as an E, b can be parsed as a +, and c can be parsed as a T. We can easily enumerate all such strings:
E + T
[ 'i', '+', '(i*i)' ]
[ 'i', '+(', 'i*i)' ]
[ 'i', '+(i', '*i)' ]
[ 'i', '+(i*', 'i)' ]
[ 'i', '+(i*i', ')' ]
[ 'i+', '(', 'i*i)' ]
[ 'i+', '(i', '*i)' ]
[ 'i+', '(i*', 'i)' ]
[ 'i+', '(i*i', ')' ]
[ 'i+(', 'i', '*i)' ]
[ 'i+(', 'i*', 'i)' ]
[ 'i+(', 'i*i', ')' ]
[ 'i+(i', '*', 'i)' ]
[ 'i+(i', '*i', ')' ]
[ 'i+(i*', 'i', ')' ]
Now all that remains is to recurse, and try to parse each element of each array as the respective next symbol.
function *parse(str, grammar, symbol) {
if (isNonterminal(symbol)) {
let options = grammar[symbol];
for (let opt of options) {
let numTerms = opt.length;
for (let split of splits(str.length, numTerms)) {
let strs = substrs(str.length, split).map(([a, b]) => str.substring(a, b));
let parsed = [];
for (let i = 0; i < strs.length; i++) {
parsed.push(parse(strs[i], grammar, opt[i]));
}
if (parsed.length === 1) {
// This makes the result a little more readable.
yield* parsed[0];
} else {
yield* product(parsed);
}
}
}
} else {
if (str === symbol) {
yield str;
}
}
}
And we get a successful parse out of the algorithm:
["i","+",["(",["i","*","i"],")"]]
Something interesting about this algorithm is that it can handle ambiguous grammars just fine. If we give it a grammar and a string with multiple parses, it will find all of them:
let grammar2 = {
'E': [
['E', '+', 'E'],
['i'],
],
}
//console.log([...splits(5, 3)]);
for (let tree of parse('i+i+i+i', grammar2, 'E')) {
console.log('result:', JSON.stringify([...tree]));
}
result: ["i","+",["i","+",["i","+","i"]]]
result: ["i","+",[["i","+","i"],"+","i"]]
result: [["i","+","i"],"+",["i","+","i"]]
result: [["i","+",["i","+","i"]],"+","i"]
result: [[["i","+","i"],"+","i"],"+","i"]
So, this is a fully-functioning parsing algorithm, but it's a little lame, it's not really that different from just parsing from left-to-right like in a traditional top-down algorithm.
The first observation to improve this we can make is that we can very quickly dispose of split attempts by checking if the nonterminals in a given grammar match or do not.
This lets us change our initial split checks from:
E = E + T
[ 'i', '*', '(i+i)' ]
[ 'i', '*(', 'i+i)' ]
[ 'i', '*(i', '+i)' ]
[ 'i', '*(i+', 'i)' ]
[ 'i', '*(i+i', ')' ]
[ 'i*', '(', 'i+i)' ]
[ 'i*', '(i', '+i)' ]
[ 'i*', '(i+', 'i)' ]
[ 'i*', '(i+i', ')' ]
[ 'i*(', 'i', '+i)' ]
[ 'i*(', 'i+', 'i)' ]
[ 'i*(', 'i+i', ')' ]
[ 'i*(i', '+', 'i)' ]
[ 'i*(i', '+i', ')' ]
[ 'i*(i+', 'i', ')' ]
E = T
[ 'i*(i+i)' ]
to
E = E + T
[ 'i*(i', '+', 'i)' ]
E = T
[ 'i*(i+i)' ]
But then, we can go one step further. We can just enumerate all of the nonterminals in a given rule and then find all the possible assignments of those to positions in the input string to jump straight to the possible places we could split our string. For example, say our production rule was:
E = A + B + C
and our input string was
1+2+3+4
According to the nonterminals in the production rule, the only assignments are:
1 + 2 + 3 + 4
A + B + C
A + B + C
A + B + C
From which we can just recurse and continue. We can implement this by just enumerating the symbols in the input string and traversing that and our terminals in lockstep.
I think this algorithm is very strange! It really does not depend on reading in the input from left-to-right, as most other parsing algorithms do, and legitimately uses the power it has to read arbitrary positions to do something interesting. Again, I don't think that in the world of systems programming (even when you're dealing with grammars regularly) there is all that much use for parsing algorithms with this level of power; you can generally just come up with a grammar that is LL(1). But it is cool to know that we can do this sort of thing!
Add a comment: