Precomputing JS regexes for perf #878

slevithan · 2024-12-28T17:13:52Z

Do you think it would be a good idea to precompute regexes used by the JS engine so that the regex transpilation step isn't needed at runtime? Perhaps this would be optional. Presumably, it would cause some challenges for the fine-grained bundle approach, but maybe that could be avoided/simplified in a new major Shiki version.

With some extra, somewhat-complex work in the build script for grammars, this could also make the JS regexes used by some grammars faster, by avoiding the use of a RegExp subclass when it's not needed (details follow).

oniguruma-to-es's toRegExp outputs a RegExp, which might be a regexp literal or might be an instance of its EmulatedRegExp subclass. The subclass has some minor perf overhead, which presumably can add up for large grammars applied to large source files.

oniguruma-to-es uses EmulatedRegExp for two kinds of cases:

Regexes that wouldn't be emulatable otherwise. This applies to many but not all uses of \G.
Regexes that would match the same strings without the subclass, but might then have differences compared to Oniguruma for subpattern details on the result array (visible from outside the regex on subpattern values by index, groups, indices, and indices.groups). This applies when using atomic groups, possessive quantifiers, subroutines, and some uses of recursion. In some cases the subclass simply hides injected "emulation groups" from results, and in other cases it's more sophisticated, transferring subpattern match values/indices in order to accurately emulate complex edge-case effects of Oniguruma subroutines.

oniguruma-to-es includes an option (avoidSubclass) that avoids the use of a subclass (and any changes to the pattern that would rely on the subclass to be interpreted correctly). This is relevant to perf because, although enabling avoidSubclass for regexes in case 1 (above) results in regexes not being emulatable (and throwing), for case-2 regexes it results in faster regexes that still match the same strings.

The trick is that, if a particular regex in a grammar doesn't reference subpattern matches from outside the regex to do any additional handling, you can get a safe perf increase by not using the subclass. The build step that precomputes regexes could determine this based on whether the grammar refers to subpatterns for a given regex (using properties captures, beginCaptures, etc. in the JSON). If it doesn't, you could either:

Check whether toRegExp throws with avoidSubclass enabled. Or...
Use toDetails instead of toRegExp and check whether the result includes .options.strategy. If so, EmulatedRegExp needs to be used as a constructor to handle case 1, but otherwise you can run it again with avoidSubclass and then just use a regexp literal or RegExp as the constructor. Or...
As of oniguruma-to-es v0.10.0, you can check for .rawArgs.options.strategy on EmulatedRegExp instances. If present, you can use .toString() (or .source and .flags) and use those to construct a RegExp without the subclass.

This extra handling when precomputing regexes would be totally optional, but should provide a meaningful perf increase for some grammars.

Perhaps this perf optimization could even be built into the current JS engine (without the need to precompile regexes), if the patterns passed to its JavaScriptScanner could include data about whether the patterns will have their subpatterns referred to from outside the regexes.

Feel free to close this if you think it's not the right road to go down for whatever reason.

Aside: Just let me know @antfu if you're ever interested in being added as a collaborator for oniguruma-to-es. I'd be happy for you to make even fundamental changes that you think would be helpful.

The text was updated successfully, but these errors were encountered:

antfu · 2024-12-30T02:27:48Z

Yeah, that's something I have been thinking about for a while but haven't come up with good ideas to handle it well. The main challenge is that it will introduce two versions of incompatible grammars, and if we could communicate well with the usage. I'll keep an eye on that.

if you're ever interested in being added as a collaborator for oniguruma-to-es

Definitely! Thank you! Tho I am not sure if I would ever be able to help :P Seeing how you work I won't say I know anything about RegExp 🤣

slevithan · 2025-01-02T10:50:55Z

Very cool to see this landing. 🎉 Re: the potential perf optimization described above, see the review comment I added here #880 (review)

antfu closed this as completed in #880 Jan 2, 2025

This was referenced Jan 4, 2025

Potential features based on oniguruma-to-es RedCMD/TmLanguage-Syntax-Highlighter#18

Open

Run regexes for TM grammars in native JS for perf microsoft/vscode#237537

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precomputing JS regexes for perf #878

Precomputing JS regexes for perf #878

slevithan commented Dec 28, 2024 •

edited

Loading

antfu commented Dec 30, 2024

slevithan commented Jan 2, 2025

Precomputing JS regexes for perf #878

Precomputing JS regexes for perf #878

Comments

slevithan commented Dec 28, 2024 • edited Loading

antfu commented Dec 30, 2024

slevithan commented Jan 2, 2025

slevithan commented Dec 28, 2024 •

edited

Loading