Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise inline parsing #549

Merged
merged 9 commits into from
Sep 26, 2020
Merged

Revise inline parsing #549

merged 9 commits into from
Sep 26, 2020

Conversation

colinodell
Copy link
Member

@colinodell colinodell commented Sep 26, 2020

This PR implements a new approach to inline parsing which is more flexible (and also 8-10% faster) than before.

Under the hood

In 1.x, an inline parser needed to define each character it was interested in parsing and then only attempting to parse the inlines (via more-expensive, time-consuming) regular expressions when those characters were encountered. This optimization drastically reduced the number of regular expressions we needed to parse and worked well for most of our uses cases.

The core part of this code looked something like this over-simplified example:

$line = new Cursor('The quick brown fox jumps over the lazy dog');

// Iterate through the string one character at a time
while (($currentCharacter = $cursor->getCharacter()) !== null) {
    $parsed = false;
    foreach ($this->environment->getInlineParsersByCharacter($currentCharacter) as $parser) {
        if ($parser->parse($line)) {
            $parsed = true;
            break;
        }
    }

    if (! $parsed) {
        $parsed = $this->parseDelimiters($this->environment->getDelimiterProcessors()->getByCharacter($currentCharacter));
    }

    if (! $parsed) {
        $cursor->match($this->environment->getRegexOfNonSpecialChars());
        $this->skip($cursor->getPreviousMatch());
    }   
}

However, #514 and #492 (comment) showed us that this approach was not feasible in many cases where we need to match on more than just a single character. The "optimized" code above, while better than checking every single character with every single inline parser, still resulted in the inline parsers running many regexes, including many cases where they failed to match (and thus wasted valuable time).

So in 2.x we're taking a different approach. Instead of running preg_match for each parser every time we come across an interesting character, we're now running preg_match_all for each parser exactly once per line. This provides the engine with a list of all positions in the Cursor that the parsers are interested in, which we can then iterate over.

This approach has a few nice benefits:

  • We're no longer limited to checking single characters - parsers can define longer strings or even regular expressions they're interested in
  • The parsers that only care about single characters or simple strings don't need to continually run regex matches at each position (since we've already checked all the positions). We can even provide the matched text directly to the parser without them needing to match text themselves
  • Finding all "positions of interest" in advance means we can skip over uninteresting characters without performing additional regex matches there too
  • We've now united inline and delimiter parsing under-the-hood (parsing delims is now done via special, automatically-registered InlineParserInterface - it's no longer this completely separate step)

From the developer's perspective

This new approach does involve tweaking implementations of InlineParserInterface but those tweaks are very straightforward. At a minimum, you must:

  1. Change the getCharacters() method to getMatchDefinition() and change the return value
  2. Add a new $match argument to the parse() method

For example:

 final class PunctuationParser implements InlineParserInterface
 {
     /**
      * {@inheritdoc}
      */
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-       return ['-', '.'];
+       return InlineParserMatch::oneOf('-', '.');
    }

-   public function parse(InlineParserContext $inlineContext): bool
+   public function parse(string $match, InlineParserContext $inlineContext): bool

That's the bare minimum you'd need to do. However, that new $match argument will contain the text matched by whatever you defined in getMatchDefinition(), meaning you could further optimize your parsing method. Take this parser for example:

 final class TaskListItemMarkerParser implements InlineParserInterface
 {
     /**
      * {@inheritdoc}
      */
-    public function getCharacters(): array
+    public function getMatchDefinition(): InlineParserMatch
     {
-        return ['['];
+        return InlineParserMatch::oneOf('[ ]', '[x]');
     }
 
-    public function parse(InlineParserContext $inlineContext): bool
+    public function parse(string $match, InlineParserContext $inlineContext): bool
     {
         $container = $inlineContext->getContainer();

         // Checkbox must come at the beginning of the first paragraph of the list item
         if ($container->hasChildren() || ! ($container instanceof Paragraph && $container->parent() && $container->parent() instanceof ListItem)) {
             return false;
         }

         $cursor   = $inlineContext->getCursor();
         $oldState = $cursor->saveState();
-        $m = $cursor->match('/\[[ xX]\]/');
-        if ($m === null) {
-            return false;
-        }
+        $cursor->advanceBy(3);

         if ($cursor->getNextNonSpaceCharacter() === null) {
             $cursor->restoreState($oldState);
             return false;
         }

-        $isChecked = $m !== '[ ]';
+        $isChecked = $match !== '[ ]';
         $container->appendChild(new TaskListItemMarker($isChecked));

         return true;
     }
 }

With the new approach, parse() is only going to be called if the InlineParserEngine finds one of those two strings (case-insensitive), so $match is always going to contain one of those results - no need to try and re-match it ourselves.

New possibilities

This new approach opens up some new possibilities like #514, and it also resolves #492.

@colinodell colinodell added enhancement New functionality or behavior performance Something could be made faster or more efficient labels Sep 26, 2020
@colinodell colinodell added this to the v2.0 milestone Sep 26, 2020
@colinodell colinodell self-assigned this Sep 26, 2020
src/Extension/SmartPunct/PunctuationParser.php Outdated Show resolved Hide resolved
src/Extension/TaskList/TaskListItemMarkerParser.php Outdated Show resolved Hide resolved
src/Parser/Inline/InlineParserMatch.php Outdated Show resolved Hide resolved
src/Parser/Inline/InlineParserMatch.php Outdated Show resolved Hide resolved
src/Parser/Inline/InlineParserMatch.php Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New functionality or behavior performance Something could be made faster or more efficient
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autolink extension breaks some URLs too early
1 participant