Keyword extraction doesn't work when optional regexp precedes a keyword #1404

jonatanklosko · 2021-09-22T19:14:54Z

Let's assume a word operator as in 1 or 1. We don't want 1 or1 to be valid, so we use keyword extraction so that or1 is consumed at once. Here's the stripped down grammar:

module.exports = grammar({
  name: "test",

  word: ($) => $.identifier,

  rules: {
    source: ($) => seq(choice($.integer, $.sigil), "or", $.integer),

    identifier: ($) => /[a-zA-Z]\w+/,
    integer: ($) => /\d+/,
    sigil: ($) => seq("~", "()", optional(/[a-zA-Z]+/)),
  },
});

The meaning of sigil or its exact content is not particularly relevant, the noteworthy fact is the optional regexp at the end. Also, given the ~ and () this path should be quickly rejected when parsing 1 or1.

Even with keyword extraction, this grammar doesn't produce an error for 1 or1. The problem seems to be in the possible $.sigil left operand, more specifically in the optional regexp. As soon as we make the regexp not optional, the grammar properly errors for 1 or1.

(tree-sitter-cli: 0.20.0)

The text was updated successfully, but these errors were encountered:

maxbrunsfeld · 2021-09-22T19:29:08Z

I think it’s because your optional regex conflicts with ‘identifier’, which is your word token.

Is there a reason you don’t want to use ‘identifier’ as the optional part of ‘sigil’?

jonatanklosko · 2021-09-22T19:32:15Z

I actually simplified the identifier, the actual version is more like /[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C][\p{ID_Continue}]*[?!]?/u, while for sigil it can just be a sequence of letters. Using this more complex regexp the behaviour is the same, that's why I simplified.

jonatanklosko · 2021-09-22T19:35:14Z

On top of that, if we replace the optional regexp with optional(/[a]/) or even optional("a"), it still causes the same issue 🤔

maxbrunsfeld · 2021-09-22T19:59:46Z

Yeah, that makes sense, since all of those alphabetical tokens overlap with your identifier rule (which allows a sequence of letters and numbers).

The problem is when parsing this:

~()hello or 2

Without keyword extraction, there are no lexical conflicts. But suppose that Tree-sitter did perform keyword extraction on the or operator. Keyword extraction means that in any state where or is considered valid, identifier would now also be considered valid, for the purposes of the lexer. It would cause the hello to get tokenized as an identifier, which would result in a parse error. In order to avoid introducing this parse error, the keyword extraction does not apply to or.

You can probably tweak your grammar to make keyword extraction possible though. If you list identifier last in your grammar (after all other regexes), then (as per the lexical conflict docs) it would not take precedence over your sigil regex. So that might fix the problem.

Alternatively, as I mentioned above, I would just generalize sigil so that it uses identifier instead of that separate regex. If you want to check for validity of the suffix (because it is more restrictive than identifier), it would probably be better to do that in your own code so that you can produce a useful error message ("sigil suffixes must be alphabetical") instead of a syntax error at parse time.

jonatanklosko · 2021-09-22T21:19:37Z

Thanks for explaining, this makes sense :)

In the example above, moving identifier to the end does indeed solve the issue. There's another interesting bit, it still works when using an extended identifier like this:

module.exports = grammar({
  name: "test",

  word: ($) => $.identifier,

  rules: {
    source: ($) => seq(choice($.integer, $.sigil), "or", $.integer),

    integer: ($) => /\d+/,
    sigil: ($) => seq("~", "()", optional(/[a-zA-Z]+/)),
    identifier: ($) => /[\p{ID_Start}][\p{ID_Continue}]*[?!]?/u,
  },
});

However it no longer works when using /[\p{Ll}][\p{ID_Continue}]*[?!]?/u, which is curious because Ll is a subset of ID_Start. I realise this may be some weird edge case, but if you have any ideas please let me know :)

jonatanklosko · 2021-09-22T21:27:27Z

Ah \p{Ll}\p{Lu} works, so I think it's because with Ll alone it no longer overlaps with /[a-zA-Z]+/. Does Tree-sitter check if regexes overlap like that?

jonatanklosko · 2021-09-23T23:42:52Z

Either way, gonna close the issue since it's been answered :) Thanks for all the help!

jonatanklosko closed this as completed Sep 23, 2021

ahlinc added the question label Sep 24, 2021

jonatanklosko mentioned this issue Jun 2, 2023

"when" is treated as an operator even without a space after elixir-lang/tree-sitter-elixir#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keyword extraction doesn't work when optional regexp precedes a keyword #1404

Keyword extraction doesn't work when optional regexp precedes a keyword #1404

jonatanklosko commented Sep 22, 2021 •

edited

Loading

maxbrunsfeld commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021 •

edited

Loading

jonatanklosko commented Sep 22, 2021

maxbrunsfeld commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021

jonatanklosko commented Sep 23, 2021

Keyword extraction doesn't work when optional regexp precedes a keyword #1404

Keyword extraction doesn't work when optional regexp precedes a keyword #1404

Comments

jonatanklosko commented Sep 22, 2021 • edited Loading

maxbrunsfeld commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021 • edited Loading

jonatanklosko commented Sep 22, 2021

maxbrunsfeld commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021

jonatanklosko commented Sep 22, 2021

jonatanklosko commented Sep 23, 2021

jonatanklosko commented Sep 22, 2021 •

edited

Loading

jonatanklosko commented Sep 22, 2021 •

edited

Loading