index.html

<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8"/>
    <title>A Pint-sized Earley Parser</title>
    <script src="earley.js"></script>
    <script src="forest.js"></script>
    <link rel="stylesheet" href="style.css">
</head>
<body>

<div id="content">

<h1>A Pint-sized Earley Parser</h1>

<p>Many people have said that convenient parser generation is a
game-changing technology.  In his talk <em>To Trap a Better Mouse</em>,
Ian Piumarta <a href="https://youtu.be/EGeN2IC7N0Q?t=41m">suggested</a>
that the Earley algorithm is a good place to start because it handles
full context-free grammars and is fairly trivial to implement.  But some
parts of the algorithm (particularly the construction of the parse
forest) don't seem to have good descriptions which are easily accessible
to non-experts.</p>

<p>So this is an attempt to fill that gap: a trivial realization of the
algorithm, suitable for implementation in an afternoon, and with an
annotated version for easy understanding and porting.  I'm deliberately
being fairly concise: for more information try Loup Vaillant-David's <a
href="http://loup-vaillant.fr/tutorials/earley-parsing/">Earley Parsing
Explained</a> which goes into far more detail.</p>

<p>This is annotated pseudocode for compactness and as an attempt at
language independence.  If you want to look at real code, the JavaScript
source is available at the <a
href="https://github.com/JoshuaGrams/pep">git repository for this
project</a>.</p>

<ul>
    <li><a href="#recognizer">The Recognizer</a></li>
    <ul>
        <li><a href="#data-structures">Data Structures</a></li>
        <li><a href="#algorithm">The Algorithm</a></li>
    </ul>
    <li><a href="#parser">The Parser</a></li>
    <li><a href="#forest">Walking the Forest</a></li>
	<li><a href="#tree">Choosing a Parse Tree</a></li>
    <li><a href="#actions">Adding Parse Actions</a></li>
    <li><a href="#future">Future Work</a></li>
</ul>

<h2><a name="recognizer">The Recognizer</a></h2>

<p class="note">Note: This part of the algorithm is well-covered: the
presentation in Aycock and Horspool's <a
href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.4254">Practical
Earley Parsing</a> is excellent, and <a
href="https://en.wikipedia.org/wiki/Earley_parser">Wikipedia's entry</a>
isn't too bad.  But I am using a modified formulation which builds only
one set at once, adds additional items which will be useful for the
parser, and does not make a distinction between terminal and
non-terminal symbols, so you probably want to read this section
anyway.</p>


<h3><a name="data-structures">Data Structures</a></h3>

<p>Let's start with some basic data structures.  A <strong>Rule</strong>
has a symbol on the left hand side, and an array of symbols on the
right.  A <strong>Grammar</strong> has a start symbol and a set of rules
indexed by symbol.  Note that there may be multiple rules for the same
symbol: this is the only way that we support alternatives in rules.</p>

<pre><code>Rule : { String symbol, Array production }

Grammar : {
    String start,
    Dictionary([Rule]) rules
        // { "S": [S ::= X Y, S ::= A A b], ...}
}</code></pre>

<p>An <strong>LR(0) item</strong> represents a partially matched rule
and is generally written with a dot showing the end of the match, so
<code class="no-wrap">S ::= Expression "+" &bull; Term</code> would
indicate that an <code>Expression</code> and <code>"+"</code> have been
matched.</p>

<code><pre>LR0 : { Rule rule, Integer dot }
next_symbol(lr0) =
    if(lr0.dot &lt; lr0.rule.production.length) lr0.rule.production[lr0.dot]</pre></code>

<p>In this implementation we never use complete LR(0) items: we always
promote them to a symbol so that all parse trees for a symbol will be
combined under one Earley item no matter which production they derive
from.</p>

<pre><code>advance(lr0) =
    if(lr0.dot == lr0.rule.production.length) lr0.rule.symbol
    else new LR0(lr0.rule, lr0.dot+1)</code></pre>

<p>Earley's algorithm works by keeping a table of all possible matches,
both partial and complete.  These are organized into <strong>Earley
sets</strong> by the input position where the match ends.  My sets also
contain a couple of indices and a reference to the grammar so we can
quickly find the items and rules we need to look at.</p>

<p>An <strong>Earley item</strong> consists of a symbol (for complete
matches) or an LR(0) item (for partial matches) plus references to the
sets (and hence input positions) where the match starts and ends.</p>

<pre><code>Item : { Symbol|LR0 tag, Set start, Set end }

Set : {
    Grammar grammar,
    Integer position,  // index into the input sequence
    Array items,       // items for iteration/processing
    Dictionary idx,    // items by tag and start for uniqueness
    Dictionary wants,  // incomplete items by next symbol
} 

append_item(tag, start, end) =
    item = new Item(tag, start, end)
    end.items.push(item)
    end.idx[tag][start] = item
    if(tag.is_lr0) end.wants[tag.rule.symbol].push(item)
    item  // return item</code></pre>

<p>To preserve uniqueness we only call <code>append_item</code> if the
item doesn't already exist.</p>

<pre><code>add_item(tag, start, end) =
    end.idx[tag][start] || append_item(tag, start, end)</code></pre>


<h3><a name="algorithm">The Algorithm</a></h3>

<p>We start with an initial set containing rules for the start symbol
and any items derived from those.  Then we build a set for each
successive input symbol.  When we reach the end of the string, if we
have a complete item for the start symbol matching the entire string
then the parse has succeeded.</p>

<pre><code>parse(grammar, input) =
    s0 = process(predict(new Set(grammar, 0), grammar.start))
    sN = foldl(parse_symbol, s0, input)
    sN.idx[grammar.start][0]

parse_symbol(set, sym) = process(scan(set, sym))</code></pre>

<p>Like most parsers, Earley's works from both ends. It
<em>predicts</em> which rules to match from the top down, but
<em>completes</em> matches from the bottom up.</p>

<p>Prediction makes new LR(0) items (with the dot at the beginning) from
each of a symbol's rules.</p>

<pre><code>predict(set, sym) =
    for rule in set.grammar.rules[sym]
        add_item(new LR0(rule, 0), set, set)
</code></pre>

<p>When an item is complete, we look back to the set where it started
and advance any partial items there which were looking for the completed
symbol.</p>

<pre><code>complete(set, complete) =
    for item in complete.start.wants[complete.tag]
        add_item(advance(item.tag), item.start, complete.end)</code></pre>

<p>To initialize a new set, we <em>scan</em> an input symbol, adding an
item for that symbol.  Note that the traditional Earley algorithm does
<em>not</em> have items for terminals, and the scan operation advances
all items which are looking for the input symbol.  But it's convenient
to have an item for the input symbol when creating the parse forest.
And since we don't distinguish between terminal and non-terminal
symbols, our scan can just add a single item for the input symbol and
let the completer do all the advancing.</p>

<pre><code>scan(set1, sym) =
    set2 = new Set(set1.grammar, set1.position + 1)
    add_item(sym, set1, set2)
    set2  // return the new set</code></pre>

<p>Then we process items recursively with <code>predict()</code> and
<code>complete()</code>.  We predict rules for the next symbol of any
partial matches, and to look back and advance items which match any
newly completed items.</p>

<pre><code>process_once(set) =
    for(item in set.items)  // including newly-added ones
        if(item.tag.is_lr0) predict(set, next_symbol(item.tag))
        else complete(item)</code></pre>

<p>If there are rules which produce the empty string, this may fail to
produce all the necessary items (see Aycock and Horspool's <a
href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.4254">Practical
Earley Parsing</a> for more info).  There are better ways to deal with
this, but we're just going to keep trying again until it fails to add
any items.</p>

<pre><code>process(set) =
    old = set.items.length
    do  process_once(set)  while(set.items.length &gt; old)
    set  // return the set for foldl's convenience</code></pre>


<h2><a name="parser">The Parser</a></h2>

<p>To extend the recognizer to a parser, we add a shared packed parse
forest (SPPF) node to each Earley item, as described in Elizabeth
Scott's paper <a
href="http://www.sciencedirect.com/science/article/pii/S1571066108001497">SPPF-style
Parsing from Earley Recognizers</a>.  She uses a pointer to a separate
object, but I am using the Earley items themselves as parse forest
nodes.</p>

<p>Scanning produces input symbol nodes which are their own derivation.
Predictions start with no derivations.  The completer appends the
derivation of the newly completed symbol to the derivation of the item
it is advancing.  So we just need to augment the Earley items with two
pointers: a <em>left</em> pointer to the previous Earley item (if any)
and a <em>right</em> pointer to the new item.</p>

<p>The <code>scan</code> and <code>predict</code> routines are
unchanged, since they add nodes with no derivation.  But
<code>complete</code> needs to (potentially) add a new derivation,
regardless of whether it adds a new item.</p>

<pre><code>complete(set, complete) =
    for item in complete.start.wants[complete.tag]
        <span class="inserted">a = </span>add_item(advance(item.tag), item.start, complete.end)
        <span class="inserted">add_derivation(a, item, complete)</span></code></pre>

<p>The contents of the parse pointers fall into three classes:</p>

<table class="center">
    <tr><th>Links</th><th>Meaning</th></tr>
    <tr><td>null/null</td><td>no derivations</td></tr>
    <tr><td>null/item<br>item/item</td><td>single derivation</td></tr>
    <tr><td>list/null</td><td>multiple derivations</td></tr>
</table>

<p>Items with ambiguous parses have a linked list of objects, each of
which represents one derivation.</p>

<pre><code>Derivation : { Item left, Item right, Derivation next }</code></pre>

<p><code>add_derivation</code> needs to handle each of the three
classes.  It also removes some unnecessary nodes on the left edge of the
tree.  If the left child is an LR(0) item with one or no children, then
we skip it and use its child (if any) directly.</p>

<pre><code>add_derivation(item, left, right) {
    if(!(left || right)) return  // empty derivation

    // remove trivial nodes on the left
    if(left.tag.is_lr0 and left.tag.dot &lt;= 1) left = left.right

    if(!(item.left || item.right)) set_derivation(item, left, right)
    else if(item.right) add_second_derivation(item, left, right)
    else add_another_derivation(item, left, right)</code></pre>

<p>If there aren't any derivations yet, we just set the new one.  But if
there <em>are</em> existing derivations we'll need to check whether the
new one is already present.</p>

<pre><code>set_derivation(item, left, right) =
    item.left = left
    item.right = right

same_derivation(item, left, right) =
    item.left == left &amp;&amp; item.right == right</code></pre>

<p>If there is one existing derivation, we need to turn it into a list
of one item and then add the new one.</p>

<pre><code>add_second_derivation(item, left, right) =
    if(!same_derivation(item, left, right))
        old = new Derivation(item.left, item.right, null)
        item.left = new Derivation(left, right, old)
        item.right = null</code></pre>

<p>Finally, if there are multiple existing derivations, we need to
search the list to see if the derivation is present.</p>

<pre><code>add_another_derivation(item, left, right) =
    d = item.left;  while(d)
        if(same_derivation(d, left, right)) return
        d = d.next
    item.left = new Derivation(left, right, item.left)</code></pre>


<h2><a name="forest">Walking the Forest</a></h2>

<p>It's a little bit awkward to walk the forest, since the links go back
to the left.  But collecting the children into their proper order isn't
<em>too</em> bad.  You basically just do a post-order traversal,
accumulating the children before performing the node's action, and for
intermediate nodes (those tagged with an LR(0) item) you do nothing
<em>but</em> accumulate the children.</p>

<p>So we can provide a forest walker which takes a set of handler
functions to process a derivation (array of derivations for each
symbol), a set of choices (array of derivations), and a completed symbol
(with its derivation).</p>

<pre><code>WalkerFns : {
    Function(Array) derivation,
    Function(Array) choices,
    Function(Symbol, Array) symbol
}

process_forest(f, fns) =
    if(!f) return

    result = null
    if(f is ambiguous)
        c = map(fns.derivation(collect_children(_, fns)),
            choices)
        result = fns.choices(c)
    else if(f is intermediate)
        result = collect_children(f, fns)
    else
        d = fns.derivation(collect_children(f, fns));
        result = fns.symbol(f.tag, d)
    result</code></pre>

<p>The helper function pretty much just puts the children into an array,
except that if the left child is an intermediate node, it uses that
array directly rather than creating a new level of nesting.</p>

<code><pre>collect_children(f, fns) =
    left = process_forest(f.left, fns)
    right = process_forest(f.right, fns)

    if(left)
        result = (f.left is intermediate) ? left : [left]
        if(right) result.push(right)
    else if(right) result = [right]
    else result = null
    result</code></pre>


<p>Then we can easily visualize a parse forest as nested HTML
tables.</p>

<pre><code>html_derivation(d) = tr(map(td, d))
html_choices(c) = table(map(tr . td, c))
html_symbol(name, derivation) =
    if(derivation)
        symbol = tr(td(name, colSpan = derivation.length))
        table(symbol, derivation)
    else name</code></pre>

<p>And here it is in action.</p>

<pre><code>g = Grammar('S',
    S ::= a X X c
    X ::= X b
    X ::=
)

parse(g, 'abbc')</code></pre>

<div id="forest-demo"></div>
<script>
var g = new Grammar('S', [
        new Rule('S', ['a', 'X', 'X', 'c']),
        new Rule('X', ['X', 'b']),
        new Rule('X', [])
        ]);
var f = g.parse('abbc');
var hf = forest_to_html(f);
hf.className = 'center';
document.getElementById('forest-demo').appendChild(hf);
</script>


<h2><a name="tree">Choosing a Parse Tree</a></h2>

<p>It's also fairly easy to walk the forest and reduce it to a single
tree.  I'm using rule priorities (earlier rules have higher priority) to
decide.  This gives you the nice ordered choice behavior of PEGs without
the <a href="http://www.romanredz.se/papers/FI2008.pdf">strange
behaviors</a> PEGs can exhibit (e.g. &ldquo;<span class="no-wrap">S
::= a S a / a a</span>&rdquo; matching only strings of a's whose length is
a power of two, instead of all even length ones).</p>

<p>I think this ordered choice behavior gives you enough control for
most practical use.  For instance, applying it to our example above
gives you longest-match behavior, while reversing the order of the rules
for X would give you shortest-match behavior instead.</p>

<div id="tree-demo"></div>
<script>
var ht = forest_to_html(prioritized_tree(f));
ht.className = 'center';
document.getElementById('tree-demo').appendChild(ht);
</script>

<p>To implement this, we need to modify several things.  Obviously we
need to modify our grammar initialization code to assign priority
numbers.</p>

<p>We need to make sure we can tell which rule an item derives from.
This isn't directly accessible when we have a complete item: they just
have the symbol.  You could probably look back to the left to find an
LR(0) item, or look it up in the grammar for null derivations.  But I
have chosen to add a rule reference to each item and derivation, and to
update <code>add_second_derivation</code> so that it moves the rule from
the (now ambiguous) item to the derivation.</p>

<p>Then we simply walk the tree, choosing one option for each ambiguous
derivation.</p>

<pre><code>disambiguate(f, gt) =
    if(f &amp;&amp; is_ambiguous(f))
        best = f.left,  d = f.left
        while(d = d.next)
            best = (cmp(d, best, gt) &gt; 0) ? d : best
        f = best
    // handle our children's derivations
    if(f.left) disambiguate(f.left, choose)
    if(f.right) disambiguate(f.right, choose)

is_ambiguous(f) = f &amp;&amp; f.left &amp;&amp; !f.right</code></pre>

<p>The tricky stuff is in the actual comparison function.  We
make the decision based on the rule.  Only if the rules are equal do we
need to recurse deeper to make a decision.  So we know that we are only
ever comparing two items of the same type (partial, complete, input
symbol).  So:</p>

<ul>

    <li>If they're null, then they're equal</li>

    <li>Otherwise we disambiguate them so we're dealing with single
    parse trees on each side. Then:</li>

    <ul>

        <li>If their rules are equal, then they're equal (in JavaScript,
        this handles the case of input symbols: their rules are both
        <code>undefined</code>, so they're equal.</li>

        <li>If their rules are not equal, we pass them to a user-defined
        rule comparison function which tells us which one is
        preferred.</li>

    </ul>

</ul>

<pre><code>cmp(a, b, gt) =
    if(!(a || b)) result = 0
    else
        disambiguate(a, gt)
        disambiguate(b, gt)
        if(a.rule == b.rule)
            result = cmp(a.left, b.left, gt)
                || cmp(a.right, b.right, gt)
        else if(gt(a.rule, b.rule)) result = 1
        else result = -1  // must be less - we know they're not equal
    result</code></pre>

<p>Then disambiguating by priority is trivial.</p>

<pre><code>higher_priority(a, b) = a.priority &gt; b.priority

prioritized_tree(f) = disambiguate(f, higher_priority)</code></pre>


<h2><a name="actions">Adding Parse Actions</a></h2>

<p>Then adding parse actions turns out to be trivial: just add an action
to each rule and call them in the tree walker.</p>

<pre><code>eval_fns = {
    choice: error,
    derivation: function(d) = d,
    symbol: function(item, derivation) =
        if(item.rule) item.rule.action(derivation)
        else if(derivation) derivation
        else item.tag  // input symbol; just return it.
}</code></pre>

<p>Let's try it.</p>

<p class="note">(Note the way the priority is "backwards": since all the
rules are going to be part of the parse, the priority just chooses which
ones are closer to the top of the tree.  But operator precedence works
the other way: higher precedence operators bind more tightly to the
leaves.)</p>

<pre><code>expr = new Grammar('E',
    E ::= E + E &#8594; d[0] + d[2]
    E ::= E * E &#8594; d[0] * d[2]
    E ::= 2 &#8594; 2
    E ::= 3 &#8594; 3
    E ::= 5 &#8594; 5
    E ::= 7 &#8594; 7

tree = disambiguate(parse(expr, '2*3+5*7'))
process_forest(tree, eval_fns)</code></pre>

<table class="center"><tr id="action-demo"></tr></table>
<script>
function is_digit(d) { return d >= '0' && d <= '9'; }
var list = document.getElementById('action-demo');
function log(msg) { list.appendChild(td(text(msg))); }
var expr = new Grammar('E', [
    new Rule('E', ['E', '+', 'E'], function(d) {
        log('adding ' + d[0] + ' and ' + d[2]);
        return d[0] + d[2];
        }),
    new Rule('E', ['E', '*', 'E'], function(d) {
        log('multiplying ' + d[0] + ' and ' + d[2]);
        return d[0] * d[2];
        }),
    new Rule('E', ['digit'], function(d) { return d[0].charCodeAt(0) - '0'.charCodeAt(0); },
        is_digit)
    ]);
var ef = expr.parse('2*3+5*7');
log(evaluate_forest(ef));
</script>


<h2><a name="future">Future Work</a></h2>

<p>I'm excited about finally getting this algorithm figured out and
running, so I'm writing it up and posting it right away.  Since I first
posted it I've been working on it quite a bit.  I think that to be
useful it mostly just needs a means to enter grammars in a convenient
notation.  I got bogged down deciding how far to go with making BNF
syntax, so I've set it aside until I have some use cases and know what
I want...</p>

</div>

</body>
</html>