Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precomputing JS regexes for perf #878

Closed
slevithan opened this issue Dec 28, 2024 · 2 comments · Fixed by #880
Closed

Precomputing JS regexes for perf #878

slevithan opened this issue Dec 28, 2024 · 2 comments · Fixed by #880

Comments

@slevithan
Copy link
Collaborator

slevithan commented Dec 28, 2024

Do you think it would be a good idea to precompute regexes used by the JS engine so that the regex transpilation step isn't needed at runtime? Perhaps this would be optional. Presumably, it would cause some challenges for the fine-grained bundle approach, but maybe that could be avoided/simplified in a new major Shiki version.

With some extra, somewhat-complex work in the build script for grammars, this could also make the JS regexes used by some grammars faster, by avoiding the use of a RegExp subclass when it's not needed (details follow).


oniguruma-to-es's toRegExp outputs a RegExp, which might be a regexp literal or might be an instance of its EmulatedRegExp subclass. The subclass has some minor perf overhead, which presumably can add up for large grammars applied to large source files.

oniguruma-to-es uses EmulatedRegExp for two kinds of cases:

  1. Regexes that wouldn't be emulatable otherwise. This applies to many but not all uses of \G.
  2. Regexes that would match the same strings without the subclass, but might then have differences compared to Oniguruma for subpattern details on the result array (visible from outside the regex on subpattern values by index, groups, indices, and indices.groups). This applies when using atomic groups, possessive quantifiers, subroutines, and some uses of recursion. In some cases the subclass simply hides injected "emulation groups" from results, and in other cases it's more sophisticated, transferring subpattern match values/indices in order to accurately emulate complex edge-case effects of Oniguruma subroutines.

oniguruma-to-es includes an option (avoidSubclass) that avoids the use of a subclass (and any changes to the pattern that would rely on the subclass to be interpreted correctly). This is relevant to perf because, although enabling avoidSubclass for regexes in case 1 (above) results in regexes not being emulatable (and throwing), for case-2 regexes it results in faster regexes that still match the same strings.

The trick is that, if a particular regex in a grammar doesn't reference subpattern matches from outside the regex to do any additional handling, you can get a safe perf increase by not using the subclass. The build step that precomputes regexes could determine this based on whether the grammar refers to subpatterns for a given regex (using properties captures, beginCaptures, etc. in the JSON). If it doesn't, you could either:

  • Check whether toRegExp throws with avoidSubclass enabled. Or...
  • Use toDetails instead of toRegExp and check whether the result includes .options.strategy. If so, EmulatedRegExp needs to be used as a constructor to handle case 1, but otherwise you can run it again with avoidSubclass and then just use a regexp literal or RegExp as the constructor. Or...
  • As of oniguruma-to-es v0.10.0, you can check for .rawArgs.options.strategy on EmulatedRegExp instances. If present, you can use .toString() (or .source and .flags) and use those to construct a RegExp without the subclass.

This extra handling when precomputing regexes would be totally optional, but should provide a meaningful perf increase for some grammars.

Perhaps this perf optimization could even be built into the current JS engine (without the need to precompile regexes), if the patterns passed to its JavaScriptScanner could include data about whether the patterns will have their subpatterns referred to from outside the regexes.


Feel free to close this if you think it's not the right road to go down for whatever reason.

Aside: Just let me know @antfu if you're ever interested in being added as a collaborator for oniguruma-to-es. I'd be happy for you to make even fundamental changes that you think would be helpful.

@antfu
Copy link
Member

antfu commented Dec 30, 2024

Yeah, that's something I have been thinking about for a while but haven't come up with good ideas to handle it well. The main challenge is that it will introduce two versions of incompatible grammars, and if we could communicate well with the usage. I'll keep an eye on that.

if you're ever interested in being added as a collaborator for oniguruma-to-es

Definitely! Thank you! Tho I am not sure if I would ever be able to help :P Seeing how you work I won't say I know anything about RegExp 🤣

@slevithan
Copy link
Collaborator Author

Very cool to see this landing. 🎉 Re: the potential perf optimization described above, see the review comment I added here #880 (review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants