Revise inline parsing #549

colinodell · 2020-09-26T21:21:12Z

This PR implements a new approach to inline parsing which is more flexible (and also 8-10% faster) than before.

Under the hood

In 1.x, an inline parser needed to define each character it was interested in parsing and then only attempting to parse the inlines (via more-expensive, time-consuming) regular expressions when those characters were encountered. This optimization drastically reduced the number of regular expressions we needed to parse and worked well for most of our uses cases.

The core part of this code looked something like this over-simplified example:

$line = new Cursor('The quick brown fox jumps over the lazy dog');

// Iterate through the string one character at a time
while (($currentCharacter = $cursor->getCharacter()) !== null) {
    $parsed = false;
    foreach ($this->environment->getInlineParsersByCharacter($currentCharacter) as $parser) {
        if ($parser->parse($line)) {
            $parsed = true;
            break;
        }
    }

    if (! $parsed) {
        $parsed = $this->parseDelimiters($this->environment->getDelimiterProcessors()->getByCharacter($currentCharacter));
    }

    if (! $parsed) {
        $cursor->match($this->environment->getRegexOfNonSpecialChars());
        $this->skip($cursor->getPreviousMatch());
    }   
}

However, #514 and #492 (comment) showed us that this approach was not feasible in many cases where we need to match on more than just a single character. The "optimized" code above, while better than checking every single character with every single inline parser, still resulted in the inline parsers running many regexes, including many cases where they failed to match (and thus wasted valuable time).

So in 2.x we're taking a different approach. Instead of running preg_match for each parser every time we come across an interesting character, we're now running preg_match_all for each parser exactly once per line. This provides the engine with a list of all positions in the Cursor that the parsers are interested in, which we can then iterate over.

This approach has a few nice benefits:

We're no longer limited to checking single characters - parsers can define longer strings or even regular expressions they're interested in
The parsers that only care about single characters or simple strings don't need to continually run regex matches at each position (since we've already checked all the positions). We can even provide the matched text directly to the parser without them needing to match text themselves
Finding all "positions of interest" in advance means we can skip over uninteresting characters without performing additional regex matches there too
We've now united inline and delimiter parsing under-the-hood (parsing delims is now done via special, automatically-registered InlineParserInterface - it's no longer this completely separate step)

From the developer's perspective

This new approach does involve tweaking implementations of InlineParserInterface but those tweaks are very straightforward. At a minimum, you must:

Change the getCharacters() method to getMatchDefinition() and change the return value
Add a new $match argument to the parse() method

For example:

 final class PunctuationParser implements InlineParserInterface
 {
     /**
      * {@inheritdoc}
      */
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-       return ['-', '.'];
+       return InlineParserMatch::oneOf('-', '.');
    }

-   public function parse(InlineParserContext $inlineContext): bool
+   public function parse(string $match, InlineParserContext $inlineContext): bool

That's the bare minimum you'd need to do. However, that new $match argument will contain the text matched by whatever you defined in getMatchDefinition(), meaning you could further optimize your parsing method. Take this parser for example:

 final class TaskListItemMarkerParser implements InlineParserInterface
 {
     /**
      * {@inheritdoc}
      */
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-        return ['['];
+        return InlineParserMatch::oneOf('[ ]', '[x]');
     }
 
-    public function parse(InlineParserContext $inlineContext): bool
+    public function parse(string $match, InlineParserContext $inlineContext): bool
     {
         $container = $inlineContext->getContainer();

         // Checkbox must come at the beginning of the first paragraph of the list item
         if ($container->hasChildren() || ! ($container instanceof Paragraph && $container->parent() && $container->parent() instanceof ListItem)) {
             return false;
         }

         $cursor   = $inlineContext->getCursor();
         $oldState = $cursor->saveState();
-        $m = $cursor->match('/\[[ xX]\]/');
-        if ($m === null) {
-            return false;
-        }
+        $cursor->advanceBy(3);

         if ($cursor->getNextNonSpaceCharacter() === null) {
             $cursor->restoreState($oldState);
             return false;
         }

-        $isChecked = $m !== '[ ]';
+        $isChecked = $match !== '[ ]';
         $container->appendChild(new TaskListItemMarker($isChecked));

         return true;
     }
 }

With the new approach, parse() is only going to be called if the InlineParserEngine finds one of those two strings (case-insensitive), so $match is always going to contain one of those results - no need to try and re-match it ourselves.

New possibilities

This new approach opens up some new possibilities like #514, and it also resolves #492.

src/Extension/SmartPunct/PunctuationParser.php

src/Extension/TaskList/TaskListItemMarkerParser.php

src/Parser/Inline/InlineParserMatch.php

…pproach Fixes #492

colinodell added enhancement New functionality or behavior performance Something could be made faster or more efficient labels Sep 26, 2020

colinodell added this to the v2.0 milestone Sep 26, 2020

colinodell self-assigned this Sep 26, 2020

colinodell commented Sep 26, 2020

View reviewed changes

colinodell added 9 commits September 26, 2020 17:39

Allow inline parsers to match on more than just single characters

38ea1e0

Simplify handing of inline parsers within the Environment

5ae1602

Optimization: provide already-matched text to the inline parser

d135ba8

Require the cursor to be injected into the context

f5ac044

Implement delimiter parsing as a special type of inline parser

225c357

Only search for delimiters if any were given

55b911b

Make regular expressions case-insensitive

4a40802

Rewrite the InlineParserEngine to use the new approach

ee66a64

Re-implement the GFM Autolink extension using the new inline parser a…

00649fb

…pproach Fixes #492

colinodell force-pushed the revise-inline-parsing branch from 0e5ed0d to 00649fb Compare September 26, 2020 21:41

colinodell merged commit 995567a into latest Sep 26, 2020

colinodell deleted the revise-inline-parsing branch September 26, 2020 21:46

colinodell mentioned this pull request Sep 26, 2020

Allow more than one character to be used as a symbol for MentionExtension #550

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise inline parsing #549

Revise inline parsing #549

colinodell commented Sep 26, 2020 •

edited

Loading

Revise inline parsing #549

Revise inline parsing #549

Conversation

colinodell commented Sep 26, 2020 • edited Loading

Under the hood

From the developer's perspective

New possibilities

colinodell commented Sep 26, 2020 •

edited

Loading