Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise inline parsing #549

Merged
merged 9 commits into from
Sep 26, 2020
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,17 +22,20 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `BlockStartParserInterface`
- `ChildNodeRendererInterface`
- `CursorState`
- `DelimiterParser`
- `DocumentBlockParser`
- `DocumentRenderedEvent`
- `HtmlRendererInterface`
- `InlineParserEngineInterface`
- `InlineParserMatch`
- `MarkdownParserState`
- `MarkdownParserStateInterface`
- `ReferenceableInterface`
- `RenderedContent`
- `RenderedContentInterface`
- Added several new methods:
- `Environment::setEventDispatcher()`
- `EnvironmentInterface::getInlineParsers()`
- `FencedCode::setInfo()`
- `Heading::setLevel()`
- `HtmlRenderer::renderDocument()`
Expand All @@ -58,10 +61,18 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `ConfigurableEnvironmentInterface::addBlockParser()` is now `ConfigurableEnvironmentInterface::addBlockParserFactory()`
- `ReferenceParser` was re-implemented and works completely different than before
- The paragraph parser no longer needs to be added manually to the environment
- Implemented a new approach to inline parsing where parsers can now specify longer strings or regular expressions they want to parse (instead of just single characters):
- `InlineParserInterface::getCharacters()` is now `getMatchDefinition()` and returns an instance of `InlineParserMatch`
- `InlineParserInterface::parse()` has a new parameter containing the pre-matched text
- `InlineParserContext::__construct()` now requires the contents to be provided as a `Cursor` instead of a `string`
- Implemented delimiter parsing as a special type of inline parser (via the new `DelimiterParser` class)
- Changed block and inline rendering to use common methods and interfaces
- `BlockRendererInterface` and `InlineRendererInterface` were replaced by `NodeRendererInterface` with slightly different parameters. All core renderers now implement this interface.
- `ConfigurableEnvironmentInterface::addBlockRenderer()` and `addInlineRenderer()` are now just `addRenderer()`
- `EnvironmentInterface::getBlockRenderersForClass()` and `getInlineRenderersForClass()` are now just `getRenderersForClass()`
- Re-implemented the GFM Autolink extension using the new inline parser approach instead of document processors
- `EmailAutolinkProcessor` is now `EmailAutolinkParser`
- `UrlAutolinkProcessor` is now `UrlAutolinkParser`
- Combined separate classes/interfaces into one:
- `DisallowedRawHtmlRenderer` replaces `DisallowedRawHtmlBlockRenderer` and `DisallowedRawHtmlInlineRenderer`
- `NodeRendererInterface` replaces `BlockRendererInterface` and `InlineRendererInterface`
Expand Down Expand Up @@ -106,11 +117,14 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- Footnote event listeners now have numbered priorities (but still execute in the same order)
- Footnotes must now be separated from previous content by a blank line
- The line numbers (keys) returned via `MarkdownInput::getLines()` now start at 1 instead of 0
- `DelimiterProcessorCollectionInterface` now extends `Countable`
- `RegexHelper::PARTIAL_` constants must always be used in case-insensitive contexts

### Fixed

- Fixed parsing of footnotes without content
- Fixed rendering of orphaned footnotes and footnote refs
- Fixed some URL autolinks breaking too early (#492)

### Removed

Expand Down Expand Up @@ -159,6 +173,8 @@ See <https://commonmark.thephpleague.com/2.0/upgrading/> for detailed informatio
- `AbstractBlock::finalize()`
- `ConfigurableEnvironmentInterface::addBlockParser()`
- `Delimiter::setCanClose()`
- `EnvironmentInterface::getInlineParsersForCharacter()`
- `EnvironmentInterface::getInlineParserCharacterRegex()`
- `HtmlRenderer::renderBlock()`
- `HtmlRenderer::renderBlocks()`
- `HtmlRenderer::renderInline()`
Expand Down
59 changes: 36 additions & 23 deletions docs/2.0/customization/inline-parsing.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,28 +29,43 @@ If your syntax looks like that, consider using a [delimiter processor](/2.0/cust

Inline parsers should implement `InlineParserInterface` and the following two methods:

### getCharacters()
### getMatchDefinition()

This method should return an array of single characters which the inline parser engine should stop on. When it does find a match in the current line the `parse()` method below may be called.
This method should return an instance of `InlineParserMatch` which defines the text the parser is looking for. Examples of this might be something like:

```php
use League\CommonMark\Parser\Inline\InlineParserMatch;

InlineParserMatch::string('@'); // Match any '@' characters found in the text
InlineParserMatch::string('foo'); // Match the text 'foo' (case insensitive)

InlineParserMatch::oneOf('@', '!'); // Match either character
InlineParserMatch::oneOf('http://', 'https://'); // Match either string

InlineParserMatch::regex('\d+'); // Match the regular expression (omit the regex delimiters and any flags)
```

Once a match is found, the `parse()` method below may be called.

### parse()

This method will be called if both conditions are met:

1. The engine has stopped at a matching character; and,
2. No other inline parsers have successfully parsed the character
1. The engine has found at a matching string in the current line; and,
2. No other inline parsers with a [higher priority](/2.0/customization/environment/#addinlineparser) have successfully parsed the text at this point in the line

#### Parameters

* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line.
* `string $match` - Contains the text that matches the start pattern from `getMatchDefinition()`
* `InlineParserContext $inlineContext` - Encapsulates the current state of the inline parser, including the [`Cursor`](/2.0/customization/cursor/) used to parse the current line. (Note that the cursor will be positioned **before** the matching text, so you must advance it yourself if you determine it's a valid match)

#### Return value

`parse()` should return `false` if it's unable to handle the current line/character for any reason. (The [`Cursor`](/2.0/customization/cursor/) state should be restored before returning false if modified). Other parsers will then have a chance to try parsing the line. If all registered parsers return false, the character will be added as plain text.
`parse()` should return `false` if it's unable to handle the text at the current position for any reason. Other parsers will then have a chance to try parsing that text. If all registered parsers return false, the text will be added as plain text.

Returning `true` tells the engine that you've successfully parsed the character (and related ones after it). It is your responsibility to:

1. Advance the cursor to the end of the parsed text
1. Advance the cursor to the end of the parsed/matched text
2. Add the parsed inline to the container (`$inlineContext->getContainer()->appendChild(...)`)

## Inline Parser Examples
Expand All @@ -65,15 +80,17 @@ Let's say you wanted to autolink Twitter handles without using the link syntax.
use League\CommonMark\Environment\Environment;
use League\CommonMark\Extension\CommonMark\Node\Inline\Link;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;

class TwitterHandleParser implements InlineParserInterface
{
public function getCharacters(): array
public function getMatchDefinition(): InlineParserMatch
{
return ['@'];
// Note that you could match the entire regex here instead of in parse() if you wish
return InlineParserMatch::string('@');
}
public function parse(InlineParserContext $inlineContext): bool
public function parse(string $match, InlineParserContext $inlineContext): bool
{
$cursor = $inlineContext->getCursor();
// The @ symbol must not have any other characters immediately prior
Expand Down Expand Up @@ -113,33 +130,27 @@ Let's say you want to automatically convert smilies (or "frownies") to emoticon
use League\CommonMark\Environment\Environment;
use League\CommonMark\Extension\CommonMark\Node\Inline\Image;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;

class SmilieParser implements InlineParserInterface
{
public function getCharacters(): array
public function getMatchDefinition(): InlineParserMatch
{
return [':'];
return InlineParserMatch::oneOf(':)', ':(');
}

public function parse(InlineParserContext $inlineContext): bool
public function parse(string $match, InlineParserContext $inlineContext): bool
{
$cursor = $inlineContext->getCursor();

// The next character must be a paren; if not, then bail
// We use peek() to quickly check without affecting the cursor
$nextChar = $cursor->peek();
if ($nextChar !== '(' && $nextChar !== ')') {
return false;
}

// Advance the cursor past the 2 matched chars since we're able to parse them successfully
$cursor->advanceBy(2);

// Add the corresponding image
if ($nextChar === ')') {
if ($match === ':)') {
$inlineContext->getContainer()->appendChild(new Image('/img/happy.png'));
} elseif ($nextChar === '(') {
} elseif ($match === ':(') {
$inlineContext->getContainer()->appendChild(new Image('/img/sad.png'));
}

Expand All @@ -153,6 +164,8 @@ $environment->addInlineParser(new SmilieParserParser());

## Tips

* For best performance, `return false` **as soon as possible**.
* For best performance:
* Avoid using overly-complex regular expressions in `getMatchDefinition()` - use the simplest regex you can and have `parse()` do the heavier validation
* Have your `parse()` method `return false` **as soon as possible**.
* You can `peek()` without modifying the cursor state. This makes it useful for validating nearby characters as it's quick and you can bail without needed to restore state.
* You can look at (and modify) any part of the AST if needed (via `$inlineContext->getContainer()`).
105 changes: 105 additions & 0 deletions src/Delimiter/DelimiterParser.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<?php

declare(strict_types=1);

namespace League\CommonMark\Delimiter;

use League\CommonMark\Delimiter\Processor\DelimiterProcessorCollection;
use League\CommonMark\Delimiter\Processor\DelimiterProcessorInterface;
use League\CommonMark\Node\Inline\Text;
use League\CommonMark\Parser\Inline\InlineParserInterface;
use League\CommonMark\Parser\Inline\InlineParserMatch;
use League\CommonMark\Parser\InlineParserContext;
use League\CommonMark\Util\RegexHelper;

/**
* Delimiter parsing is implemented as an Inline Parser with the lowest-possible priority
*
* @internal
*/
final class DelimiterParser implements InlineParserInterface
{
/** @var DelimiterProcessorCollection */
private $collection;

public function __construct(DelimiterProcessorCollection $collection)
{
$this->collection = $collection;
}

public function getMatchDefinition(): InlineParserMatch
{
return InlineParserMatch::oneOf(...$this->collection->getDelimiterCharacters());
}

public function parse(string $match, InlineParserContext $inlineContext): bool
{
$character = $match;
$numDelims = 0;
$cursor = $inlineContext->getCursor();
$processor = $this->collection->getDelimiterProcessor($character);

if ($processor === null) {
throw new \LogicException('Delimiter processor should never be null here');
}

$charBefore = $cursor->peek(-1);
if ($charBefore === null) {
$charBefore = "\n";
}

while ($cursor->peek($numDelims) === $character) {
++$numDelims;
}

if ($numDelims < $processor->getMinLength()) {
return false;
}

$cursor->advanceBy($numDelims);

$charAfter = $cursor->getCharacter();
if ($charAfter === null) {
$charAfter = "\n";
}

[$canOpen, $canClose] = self::determineCanOpenOrClose($charBefore, $charAfter, $character, $processor);

$node = new Text(\str_repeat($character, $numDelims), [
'delim' => true,
]);
$inlineContext->getContainer()->appendChild($node);

// Add entry to stack to this opener
if ($canOpen || $canClose) {
$delimiter = new Delimiter($character, $numDelims, $node, $canOpen, $canClose);
$inlineContext->getDelimiterStack()->push($delimiter);
}

return true;
}

/**
* @return bool[]
*/
private static function determineCanOpenOrClose(string $charBefore, string $charAfter, string $character, DelimiterProcessorInterface $delimiterProcessor): array
{
$afterIsWhitespace = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charAfter);
$afterIsPunctuation = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charAfter);
$beforeIsWhitespace = \preg_match(RegexHelper::REGEX_UNICODE_WHITESPACE_CHAR, $charBefore);
$beforeIsPunctuation = \preg_match(RegexHelper::REGEX_PUNCTUATION, $charBefore);

$leftFlanking = ! $afterIsWhitespace && (! $afterIsPunctuation || $beforeIsWhitespace || $beforeIsPunctuation);
$rightFlanking = ! $beforeIsWhitespace && (! $beforeIsPunctuation || $afterIsWhitespace || $afterIsPunctuation);

if ($character === '_') {
$canOpen = $leftFlanking && (! $rightFlanking || $beforeIsPunctuation);
$canClose = $rightFlanking && (! $leftFlanking || $afterIsPunctuation);
} else {
$canOpen = $leftFlanking && $character === $delimiterProcessor->getOpeningCharacter();
$canClose = $rightFlanking && $character === $delimiterProcessor->getClosingCharacter();
}

return [$canOpen, $canClose];
}
}
5 changes: 5 additions & 0 deletions src/Delimiter/Processor/DelimiterProcessorCollection.php
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,9 @@ private function addStaggeredDelimiterProcessorForChar(string $opening, Delimite
$s->add($new);
$this->processorsByChar[$opening] = $s;
}

public function count(): int
{
return \count($this->processorsByChar);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@

namespace League\CommonMark\Delimiter\Processor;

interface DelimiterProcessorCollectionInterface
interface DelimiterProcessorCollectionInterface extends \Countable
{
/**
* Add the given delim processor to the collection
Expand Down
Loading