-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathindex.html
544 lines (426 loc) · 19.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<title>A Pint-sized Earley Parser</title>
<script src="earley.js"></script>
<script src="forest.js"></script>
<link rel="stylesheet" href="style.css">
</head>
<body>
<div id="content">
<h1>A Pint-sized Earley Parser</h1>
<p>Many people have said that convenient parser generation is a
game-changing technology. In his talk <em>To Trap a Better Mouse</em>,
Ian Piumarta <a href="https://youtu.be/EGeN2IC7N0Q?t=41m">suggested</a>
that the Earley algorithm is a good place to start because it handles
full context-free grammars and is fairly trivial to implement. But some
parts of the algorithm (particularly the construction of the parse
forest) don't seem to have good descriptions which are easily accessible
to non-experts.</p>
<p>So this is an attempt to fill that gap: a trivial realization of the
algorithm, suitable for implementation in an afternoon, and with an
annotated version for easy understanding and porting. I'm deliberately
being fairly concise: for more information try Loup Vaillant-David's <a
href="http://loup-vaillant.fr/tutorials/earley-parsing/">Earley Parsing
Explained</a> which goes into far more detail.</p>
<p>This is annotated pseudocode for compactness and as an attempt at
language independence. If you want to look at real code, the JavaScript
source is available at the <a
href="https://github.com/JoshuaGrams/pep">git repository for this
project</a>.</p>
<ul>
<li><a href="#recognizer">The Recognizer</a></li>
<ul>
<li><a href="#data-structures">Data Structures</a></li>
<li><a href="#algorithm">The Algorithm</a></li>
</ul>
<li><a href="#parser">The Parser</a></li>
<li><a href="#forest">Walking the Forest</a></li>
<li><a href="#tree">Choosing a Parse Tree</a></li>
<li><a href="#actions">Adding Parse Actions</a></li>
<li><a href="#future">Future Work</a></li>
</ul>
<h2><a name="recognizer">The Recognizer</a></h2>
<p class="note">Note: This part of the algorithm is well-covered: the
presentation in Aycock and Horspool's <a
href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.4254">Practical
Earley Parsing</a> is excellent, and <a
href="https://en.wikipedia.org/wiki/Earley_parser">Wikipedia's entry</a>
isn't too bad. But I am using a modified formulation which builds only
one set at once, adds additional items which will be useful for the
parser, and does not make a distinction between terminal and
non-terminal symbols, so you probably want to read this section
anyway.</p>
<h3><a name="data-structures">Data Structures</a></h3>
<p>Let's start with some basic data structures. A <strong>Rule</strong>
has a symbol on the left hand side, and an array of symbols on the
right. A <strong>Grammar</strong> has a start symbol and a set of rules
indexed by symbol. Note that there may be multiple rules for the same
symbol: this is the only way that we support alternatives in rules.</p>
<pre><code>Rule : { String symbol, Array production }
Grammar : {
String start,
Dictionary([Rule]) rules
// { "S": [S ::= X Y, S ::= A A b], ...}
}</code></pre>
<p>An <strong>LR(0) item</strong> represents a partially matched rule
and is generally written with a dot showing the end of the match, so
<code class="no-wrap">S ::= Expression "+" • Term</code> would
indicate that an <code>Expression</code> and <code>"+"</code> have been
matched.</p>
<code><pre>LR0 : { Rule rule, Integer dot }
next_symbol(lr0) =
if(lr0.dot < lr0.rule.production.length) lr0.rule.production[lr0.dot]</pre></code>
<p>In this implementation we never use complete LR(0) items: we always
promote them to a symbol so that all parse trees for a symbol will be
combined under one Earley item no matter which production they derive
from.</p>
<pre><code>advance(lr0) =
if(lr0.dot == lr0.rule.production.length) lr0.rule.symbol
else new LR0(lr0.rule, lr0.dot+1)</code></pre>
<p>Earley's algorithm works by keeping a table of all possible matches,
both partial and complete. These are organized into <strong>Earley
sets</strong> by the input position where the match ends. My sets also
contain a couple of indices and a reference to the grammar so we can
quickly find the items and rules we need to look at.</p>
<p>An <strong>Earley item</strong> consists of a symbol (for complete
matches) or an LR(0) item (for partial matches) plus references to the
sets (and hence input positions) where the match starts and ends.</p>
<pre><code>Item : { Symbol|LR0 tag, Set start, Set end }
Set : {
Grammar grammar,
Integer position, // index into the input sequence
Array items, // items for iteration/processing
Dictionary idx, // items by tag and start for uniqueness
Dictionary wants, // incomplete items by next symbol
}
append_item(tag, start, end) =
item = new Item(tag, start, end)
end.items.push(item)
end.idx[tag][start] = item
if(tag.is_lr0) end.wants[tag.rule.symbol].push(item)
item // return item</code></pre>
<p>To preserve uniqueness we only call <code>append_item</code> if the
item doesn't already exist.</p>
<pre><code>add_item(tag, start, end) =
end.idx[tag][start] || append_item(tag, start, end)</code></pre>
<h3><a name="algorithm">The Algorithm</a></h3>
<p>We start with an initial set containing rules for the start symbol
and any items derived from those. Then we build a set for each
successive input symbol. When we reach the end of the string, if we
have a complete item for the start symbol matching the entire string
then the parse has succeeded.</p>
<pre><code>parse(grammar, input) =
s0 = process(predict(new Set(grammar, 0), grammar.start))
sN = foldl(parse_symbol, s0, input)
sN.idx[grammar.start][0]
parse_symbol(set, sym) = process(scan(set, sym))</code></pre>
<p>Like most parsers, Earley's works from both ends. It
<em>predicts</em> which rules to match from the top down, but
<em>completes</em> matches from the bottom up.</p>
<p>Prediction makes new LR(0) items (with the dot at the beginning) from
each of a symbol's rules.</p>
<pre><code>predict(set, sym) =
for rule in set.grammar.rules[sym]
add_item(new LR0(rule, 0), set, set)
</code></pre>
<p>When an item is complete, we look back to the set where it started
and advance any partial items there which were looking for the completed
symbol.</p>
<pre><code>complete(set, complete) =
for item in complete.start.wants[complete.tag]
add_item(advance(item.tag), item.start, complete.end)</code></pre>
<p>To initialize a new set, we <em>scan</em> an input symbol, adding an
item for that symbol. Note that the traditional Earley algorithm does
<em>not</em> have items for terminals, and the scan operation advances
all items which are looking for the input symbol. But it's convenient
to have an item for the input symbol when creating the parse forest.
And since we don't distinguish between terminal and non-terminal
symbols, our scan can just add a single item for the input symbol and
let the completer do all the advancing.</p>
<pre><code>scan(set1, sym) =
set2 = new Set(set1.grammar, set1.position + 1)
add_item(sym, set1, set2)
set2 // return the new set</code></pre>
<p>Then we process items recursively with <code>predict()</code> and
<code>complete()</code>. We predict rules for the next symbol of any
partial matches, and to look back and advance items which match any
newly completed items.</p>
<pre><code>process_once(set) =
for(item in set.items) // including newly-added ones
if(item.tag.is_lr0) predict(set, next_symbol(item.tag))
else complete(item)</code></pre>
<p>If there are rules which produce the empty string, this may fail to
produce all the necessary items (see Aycock and Horspool's <a
href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.4254">Practical
Earley Parsing</a> for more info). There are better ways to deal with
this, but we're just going to keep trying again until it fails to add
any items.</p>
<pre><code>process(set) =
old = set.items.length
do process_once(set) while(set.items.length > old)
set // return the set for foldl's convenience</code></pre>
<h2><a name="parser">The Parser</a></h2>
<p>To extend the recognizer to a parser, we add a shared packed parse
forest (SPPF) node to each Earley item, as described in Elizabeth
Scott's paper <a
href="http://www.sciencedirect.com/science/article/pii/S1571066108001497">SPPF-style
Parsing from Earley Recognizers</a>. She uses a pointer to a separate
object, but I am using the Earley items themselves as parse forest
nodes.</p>
<p>Scanning produces input symbol nodes which are their own derivation.
Predictions start with no derivations. The completer appends the
derivation of the newly completed symbol to the derivation of the item
it is advancing. So we just need to augment the Earley items with two
pointers: a <em>left</em> pointer to the previous Earley item (if any)
and a <em>right</em> pointer to the new item.</p>
<p>The <code>scan</code> and <code>predict</code> routines are
unchanged, since they add nodes with no derivation. But
<code>complete</code> needs to (potentially) add a new derivation,
regardless of whether it adds a new item.</p>
<pre><code>complete(set, complete) =
for item in complete.start.wants[complete.tag]
<span class="inserted">a = </span>add_item(advance(item.tag), item.start, complete.end)
<span class="inserted">add_derivation(a, item, complete)</span></code></pre>
<p>The contents of the parse pointers fall into three classes:</p>
<table class="center">
<tr><th>Links</th><th>Meaning</th></tr>
<tr><td>null/null</td><td>no derivations</td></tr>
<tr><td>null/item<br>item/item</td><td>single derivation</td></tr>
<tr><td>list/null</td><td>multiple derivations</td></tr>
</table>
<p>Items with ambiguous parses have a linked list of objects, each of
which represents one derivation.</p>
<pre><code>Derivation : { Item left, Item right, Derivation next }</code></pre>
<p><code>add_derivation</code> needs to handle each of the three
classes. It also removes some unnecessary nodes on the left edge of the
tree. If the left child is an LR(0) item with one or no children, then
we skip it and use its child (if any) directly.</p>
<pre><code>add_derivation(item, left, right) {
if(!(left || right)) return // empty derivation
// remove trivial nodes on the left
if(left.tag.is_lr0 and left.tag.dot <= 1) left = left.right
if(!(item.left || item.right)) set_derivation(item, left, right)
else if(item.right) add_second_derivation(item, left, right)
else add_another_derivation(item, left, right)</code></pre>
<p>If there aren't any derivations yet, we just set the new one. But if
there <em>are</em> existing derivations we'll need to check whether the
new one is already present.</p>
<pre><code>set_derivation(item, left, right) =
item.left = left
item.right = right
same_derivation(item, left, right) =
item.left == left && item.right == right</code></pre>
<p>If there is one existing derivation, we need to turn it into a list
of one item and then add the new one.</p>
<pre><code>add_second_derivation(item, left, right) =
if(!same_derivation(item, left, right))
old = new Derivation(item.left, item.right, null)
item.left = new Derivation(left, right, old)
item.right = null</code></pre>
<p>Finally, if there are multiple existing derivations, we need to
search the list to see if the derivation is present.</p>
<pre><code>add_another_derivation(item, left, right) =
d = item.left; while(d)
if(same_derivation(d, left, right)) return
d = d.next
item.left = new Derivation(left, right, item.left)</code></pre>
<h2><a name="forest">Walking the Forest</a></h2>
<p>It's a little bit awkward to walk the forest, since the links go back
to the left. But collecting the children into their proper order isn't
<em>too</em> bad. You basically just do a post-order traversal,
accumulating the children before performing the node's action, and for
intermediate nodes (those tagged with an LR(0) item) you do nothing
<em>but</em> accumulate the children.</p>
<p>So we can provide a forest walker which takes a set of handler
functions to process a derivation (array of derivations for each
symbol), a set of choices (array of derivations), and a completed symbol
(with its derivation).</p>
<pre><code>WalkerFns : {
Function(Array) derivation,
Function(Array) choices,
Function(Symbol, Array) symbol
}
process_forest(f, fns) =
if(!f) return
result = null
if(f is ambiguous)
c = map(fns.derivation(collect_children(_, fns)),
choices)
result = fns.choices(c)
else if(f is intermediate)
result = collect_children(f, fns)
else
d = fns.derivation(collect_children(f, fns));
result = fns.symbol(f.tag, d)
result</code></pre>
<p>The helper function pretty much just puts the children into an array,
except that if the left child is an intermediate node, it uses that
array directly rather than creating a new level of nesting.</p>
<code><pre>collect_children(f, fns) =
left = process_forest(f.left, fns)
right = process_forest(f.right, fns)
if(left)
result = (f.left is intermediate) ? left : [left]
if(right) result.push(right)
else if(right) result = [right]
else result = null
result</code></pre>
<p>Then we can easily visualize a parse forest as nested HTML
tables.</p>
<pre><code>html_derivation(d) = tr(map(td, d))
html_choices(c) = table(map(tr . td, c))
html_symbol(name, derivation) =
if(derivation)
symbol = tr(td(name, colSpan = derivation.length))
table(symbol, derivation)
else name</code></pre>
<p>And here it is in action.</p>
<pre><code>g = Grammar('S',
S ::= a X X c
X ::= X b
X ::=
)
parse(g, 'abbc')</code></pre>
<div id="forest-demo"></div>
<script>
var g = new Grammar('S', [
new Rule('S', ['a', 'X', 'X', 'c']),
new Rule('X', ['X', 'b']),
new Rule('X', [])
]);
var f = g.parse('abbc');
var hf = forest_to_html(f);
hf.className = 'center';
document.getElementById('forest-demo').appendChild(hf);
</script>
<h2><a name="tree">Choosing a Parse Tree</a></h2>
<p>It's also fairly easy to walk the forest and reduce it to a single
tree. I'm using rule priorities (earlier rules have higher priority) to
decide. This gives you the nice ordered choice behavior of PEGs without
the <a href="http://www.romanredz.se/papers/FI2008.pdf">strange
behaviors</a> PEGs can exhibit (e.g. “<span class="no-wrap">S
::= a S a / a a</span>” matching only strings of a's whose length is
a power of two, instead of all even length ones).</p>
<p>I think this ordered choice behavior gives you enough control for
most practical use. For instance, applying it to our example above
gives you longest-match behavior, while reversing the order of the rules
for X would give you shortest-match behavior instead.</p>
<div id="tree-demo"></div>
<script>
var ht = forest_to_html(prioritized_tree(f));
ht.className = 'center';
document.getElementById('tree-demo').appendChild(ht);
</script>
<p>To implement this, we need to modify several things. Obviously we
need to modify our grammar initialization code to assign priority
numbers.</p>
<p>We need to make sure we can tell which rule an item derives from.
This isn't directly accessible when we have a complete item: they just
have the symbol. You could probably look back to the left to find an
LR(0) item, or look it up in the grammar for null derivations. But I
have chosen to add a rule reference to each item and derivation, and to
update <code>add_second_derivation</code> so that it moves the rule from
the (now ambiguous) item to the derivation.</p>
<p>Then we simply walk the tree, choosing one option for each ambiguous
derivation.</p>
<pre><code>disambiguate(f, gt) =
if(f && is_ambiguous(f))
best = f.left, d = f.left
while(d = d.next)
best = (cmp(d, best, gt) > 0) ? d : best
f = best
// handle our children's derivations
if(f.left) disambiguate(f.left, choose)
if(f.right) disambiguate(f.right, choose)
is_ambiguous(f) = f && f.left && !f.right</code></pre>
<p>The tricky stuff is in the actual comparison function. We
make the decision based on the rule. Only if the rules are equal do we
need to recurse deeper to make a decision. So we know that we are only
ever comparing two items of the same type (partial, complete, input
symbol). So:</p>
<ul>
<li>If they're null, then they're equal</li>
<li>Otherwise we disambiguate them so we're dealing with single
parse trees on each side. Then:</li>
<ul>
<li>If their rules are equal, then they're equal (in JavaScript,
this handles the case of input symbols: their rules are both
<code>undefined</code>, so they're equal.</li>
<li>If their rules are not equal, we pass them to a user-defined
rule comparison function which tells us which one is
preferred.</li>
</ul>
</ul>
<pre><code>cmp(a, b, gt) =
if(!(a || b)) result = 0
else
disambiguate(a, gt)
disambiguate(b, gt)
if(a.rule == b.rule)
result = cmp(a.left, b.left, gt)
|| cmp(a.right, b.right, gt)
else if(gt(a.rule, b.rule)) result = 1
else result = -1 // must be less - we know they're not equal
result</code></pre>
<p>Then disambiguating by priority is trivial.</p>
<pre><code>higher_priority(a, b) = a.priority > b.priority
prioritized_tree(f) = disambiguate(f, higher_priority)</code></pre>
<h2><a name="actions">Adding Parse Actions</a></h2>
<p>Then adding parse actions turns out to be trivial: just add an action
to each rule and call them in the tree walker.</p>
<pre><code>eval_fns = {
choice: error,
derivation: function(d) = d,
symbol: function(item, derivation) =
if(item.rule) item.rule.action(derivation)
else if(derivation) derivation
else item.tag // input symbol; just return it.
}</code></pre>
<p>Let's try it.</p>
<p class="note">(Note the way the priority is "backwards": since all the
rules are going to be part of the parse, the priority just chooses which
ones are closer to the top of the tree. But operator precedence works
the other way: higher precedence operators bind more tightly to the
leaves.)</p>
<pre><code>expr = new Grammar('E',
E ::= E + E → d[0] + d[2]
E ::= E * E → d[0] * d[2]
E ::= 2 → 2
E ::= 3 → 3
E ::= 5 → 5
E ::= 7 → 7
tree = disambiguate(parse(expr, '2*3+5*7'))
process_forest(tree, eval_fns)</code></pre>
<table class="center"><tr id="action-demo"></tr></table>
<script>
function is_digit(d) { return d >= '0' && d <= '9'; }
var list = document.getElementById('action-demo');
function log(msg) { list.appendChild(td(text(msg))); }
var expr = new Grammar('E', [
new Rule('E', ['E', '+', 'E'], function(d) {
log('adding ' + d[0] + ' and ' + d[2]);
return d[0] + d[2];
}),
new Rule('E', ['E', '*', 'E'], function(d) {
log('multiplying ' + d[0] + ' and ' + d[2]);
return d[0] * d[2];
}),
new Rule('E', ['digit'], function(d) { return d[0].charCodeAt(0) - '0'.charCodeAt(0); },
is_digit)
]);
var ef = expr.parse('2*3+5*7');
log(evaluate_forest(ef));
</script>
<h2><a name="future">Future Work</a></h2>
<p>I'm excited about finally getting this algorithm figured out and
running, so I'm writing it up and posting it right away. Since I first
posted it I've been working on it quite a bit. I think that to be
useful it mostly just needs a means to enter grammars in a convenient
notation. I got bogged down deciding how far to go with making BNF
syntax, so I've set it aside until I have some use cases and know what
I want...</p>
</div>
</body>
</html>