Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keyword extraction doesn't work when optional regexp precedes a keyword #1404

Closed
jonatanklosko opened this issue Sep 22, 2021 · 7 comments
Closed
Labels

Comments

@jonatanklosko
Copy link

jonatanklosko commented Sep 22, 2021

Let's assume a word operator as in 1 or 1. We don't want 1 or1 to be valid, so we use keyword extraction so that or1 is consumed at once. Here's the stripped down grammar:

module.exports = grammar({
  name: "test",

  word: ($) => $.identifier,

  rules: {
    source: ($) => seq(choice($.integer, $.sigil), "or", $.integer),

    identifier: ($) => /[a-zA-Z]\w+/,
    integer: ($) => /\d+/,
    sigil: ($) => seq("~", "()", optional(/[a-zA-Z]+/)),
  },
});

The meaning of sigil or its exact content is not particularly relevant, the noteworthy fact is the optional regexp at the end. Also, given the ~ and () this path should be quickly rejected when parsing 1 or1.

Even with keyword extraction, this grammar doesn't produce an error for 1 or1. The problem seems to be in the possible $.sigil left operand, more specifically in the optional regexp. As soon as we make the regexp not optional, the grammar properly errors for 1 or1.

(tree-sitter-cli: 0.20.0)

@maxbrunsfeld
Copy link
Contributor

I think it’s because your optional regex conflicts with ‘identifier’, which is your word token.

Is there a reason you don’t want to use ‘identifier’ as the optional part of ‘sigil’?

@jonatanklosko
Copy link
Author

jonatanklosko commented Sep 22, 2021

I actually simplified the identifier, the actual version is more like /[\p{Ll}\p{Lm}\p{Lo}\p{Nl}\u1885\u1886\u2118\u212E\u309B\u309C][\p{ID_Continue}]*[?!]?/u, while for sigil it can just be a sequence of letters. Using this more complex regexp the behaviour is the same, that's why I simplified.

@jonatanklosko
Copy link
Author

On top of that, if we replace the optional regexp with optional(/[a]/) or even optional("a"), it still causes the same issue 🤔

@maxbrunsfeld
Copy link
Contributor

Yeah, that makes sense, since all of those alphabetical tokens overlap with your identifier rule (which allows a sequence of letters and numbers).

The problem is when parsing this:

~()hello or 2

Without keyword extraction, there are no lexical conflicts. But suppose that Tree-sitter did perform keyword extraction on the or operator. Keyword extraction means that in any state where or is considered valid, identifier would now also be considered valid, for the purposes of the lexer. It would cause the hello to get tokenized as an identifier, which would result in a parse error. In order to avoid introducing this parse error, the keyword extraction does not apply to or.

You can probably tweak your grammar to make keyword extraction possible though. If you list identifier last in your grammar (after all other regexes), then (as per the lexical conflict docs) it would not take precedence over your sigil regex. So that might fix the problem.

Alternatively, as I mentioned above, I would just generalize sigil so that it uses identifier instead of that separate regex. If you want to check for validity of the suffix (because it is more restrictive than identifier), it would probably be better to do that in your own code so that you can produce a useful error message ("sigil suffixes must be alphabetical") instead of a syntax error at parse time.

@jonatanklosko
Copy link
Author

Thanks for explaining, this makes sense :)

In the example above, moving identifier to the end does indeed solve the issue. There's another interesting bit, it still works when using an extended identifier like this:

module.exports = grammar({
  name: "test",

  word: ($) => $.identifier,

  rules: {
    source: ($) => seq(choice($.integer, $.sigil), "or", $.integer),

    integer: ($) => /\d+/,
    sigil: ($) => seq("~", "()", optional(/[a-zA-Z]+/)),
    identifier: ($) => /[\p{ID_Start}][\p{ID_Continue}]*[?!]?/u,
  },
});

However it no longer works when using /[\p{Ll}][\p{ID_Continue}]*[?!]?/u, which is curious because Ll is a subset of ID_Start. I realise this may be some weird edge case, but if you have any ideas please let me know :)

@jonatanklosko
Copy link
Author

Ah \p{Ll}\p{Lu} works, so I think it's because with Ll alone it no longer overlaps with /[a-zA-Z]+/. Does Tree-sitter check if regexes overlap like that?

@jonatanklosko
Copy link
Author

Either way, gonna close the issue since it's been answered :) Thanks for all the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants