-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More Regex perf improvements #1348
Conversation
aefaf7e
to
ba8a5c7
Compare
@wfurt, several legs are failing here in a stack overflow in the NetworkInformation tests. Can you please take a look? |
NetworkInformation seems busted cross the board. I will take look. As far as the OSX OpenSSL, this may be version mismatch discussed here #1129. |
@stephentoub there's several independent optimizations here -- would it make sense to merge commit this, rather than squash? If an issue occurs, it would be helpful to be able to bisect, assuming each commit is functional. |
Up to you. My default would be to squash, but we can merge if you like. I can try to clean up the commits to squash some of them individually and make it so that they're actually bisect-able, but I'm not sure how much work that'll be. |
It's large enough that I think it would be worth not squashing, but I think that still has value even if you don't feel like spending more time making them bisectable. (It would just mean we would have to make them bisectable at that point, but we'd have a start on it.) |
2bf275e
to
ee0ea7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple comments.
...ibraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCharClass.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexFCD.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs
Show resolved
Hide resolved
429d659
to
ec7f4d3
Compare
Just as a final example to highlight the differences in code gen... Here's one of the regular expressions in the regex redux test: With master, the generated string runtext = base.runtext;
int runtextstart = runtextstart;
int runtextbeg = runtextbeg;
int runtextend = base.runtextend;
int num = runtextpos;
int[] runtrack = base.runtrack;
int runtrackpos = base.runtrackpos;
int[] runstack = base.runstack;
int runstackpos = base.runstackpos;
runtrack[--runtrackpos] = num;
runtrack[--runtrackpos] = 0;
runstack[--runstackpos] = num;
runtrack[--runtrackpos] = 1;
int num4;
if (num<runtextend && runtext[num++] == '<')
{
int num3;
num4 = (num3 = runtextend - num) + 1;
while (--num4 > 0)
{
if (runtext[num++] == '>')
{
num--;
break;
}
}
if (num3 > num4)
{
runtrack[--runtrackpos] = num3 - num4 - 1;
runtrack[--runtrackpos] = num - 1;
runtrack[--runtrackpos] = 2;
}
goto IL_00f2;
}
goto IL_0147;
IL_00f2:
if (num<runtextend && runtext[num++] == '>')
{
num4 = runstack[runstackpos];
Capture(0, num4, num);
runtrack[--runtrackpos] = num4;
runtrack[runtrackpos - 1] = 3;
goto IL_013e;
}
goto IL_0147;
IL_0147:
while (true)
{
base.runtrackpos = runtrackpos;
base.runstackpos = runstackpos;
EnsureStorage();
runtrackpos = base.runtrackpos;
runstackpos = base.runstackpos;
runtrack = base.runtrack;
runstack = base.runstack;
switch (runtrack[runtrackpos++])
{
case 1:
runstackpos++;
continue;
case 2:
goto IL_01b8;
case 3:
runstack[--runstackpos] = runtrack[runtrackpos++];
Uncapture();
continue;
}
break;
}
num = runtrack[runtrackpos];
goto IL_013e;
IL_013e:
runtextpos = num;
return;
IL_01b8:
num = runtrack[runtrackpos++];
num4 = runtrack[runtrackpos++];
if (num4 > 0)
{
runtrack[--runtrackpos] = num4 - 1;
runtrack[--runtrackpos] = num - 1;
runtrack[--runtrackpos] = 2;
}
goto IL_00f2; With this PR, the generated string runtext = base.runtext;
int runtextpos;
int num = runtextpos = base.runtextpos;
ReadOnlySpan<char> readOnlySpan = runtext.AsSpan(num, runtextend - num);
if (0u < (uint)readOnlySpan.Length && readOnlySpan[0] == '<')
{
int num2 = readOnlySpan.Slice(1).IndexOf<char>('>');
if (num2 == -1)
{
num2 = readOnlySpan.Length - 1;
}
readOnlySpan = readOnlySpan.Slice(num2);
num += num2;
if (1u < (uint)readOnlySpan.Length && readOnlySpan[1] == '>')
{
base.runtextpos = num + 2;
Capture(0, runtextpos, base.runtextpos);
}
} |
That’s way more readable.. like I can actually read it |
I'd previously removed multiple delegate allocations that were being incurred every time we needed to run a compiled regex concurrently. In doing so, however, I inadvertently forced JIT'ing to occur when the Regex was created rather than on first use; that can in turn lead to more start-up overheads for regexes that are created on startup (e.g. to initialize a static field) but that aren't used at start-up, as well as reduce potential parallelism that could otherwise occur for distinct regexes being run for the first time concurrently. This addresses the issue by still caching delegates, but in addition to rather than instead of the DynamicMethod objects.
I had removed it because it seemed illogical that anyone would write such character classes. But it turns out the implementation/parser itself generates them for certain forms, e.g. "hello|hithere" will create a character class containing just 'h' in order to find the start of the alternation.
- Only use ilg from "macro" helpers - Move some blocks of code around
Takes advantage of vectorization in string.IndexOf.
The JIT generates slightly better code for: ```C# text[pos] = ....; pos++; ``` instead of: ```C# text[pos++] = ...; ``` so use the IL equivalent of the former.
Only spill if we have to grow the relevant arrays.
In an expression like A*B, backtracing may be involved even if it's impossible that it'll do any good, namely if there's no overlap between the A and B sets. This backtracking can be avoided by using a greedy subexpression, e.g. (?>A*)B, but a) few developers think to do that, and b) even when they do the implementation has more overhead than is desirable. This change does three things: 1. It introduces primitives for "oneloopgreedy" and "setloopgreedy" that correspond to the existing "oneloop" (a single character in a loop) and "setloop" (a char class in a loop) primitives. 2. It reduces a Greedy node that contains a one/setloop to instead just be a one/setloopgreedy node. 3. It finds cases of one/setloop concatenated with other things, and automatically updates those to be one/setloopgreedy when it can provie that the backtracking would have no benefit. (2) removes a small amount of overhead, in particular in compiled, because it avoids needing to spit additional tracking state for backtracking. (3) can likely be expanded in the future to catch more cases, but it handles the common cases now. This has a very measurable impact on non-matches that would otherwise trigger backtracking into the subexpression.
When written as a* these are already transformed appropriately by the parser, but other reductions can expose them in a way that the parser won't see. For example, (?:a)* ends up being a much more expensive regex even though it's identical functionally. This reduces the latter to the former, for cases where a {Lazy}loop is found to directly wrap a Set/One/Notone.
When we can determine from the regex that backtracking isn't needed, we can generate much smaller / better code in RegexCompiler.GenerateGo.
I initially mistakenly thought this wouldn't be useful, but it is. For example, the expression ".*\n" can be made non-backtracing.
With all the changes being made in the regex implementation in this release, we want to make sure we don't regress, and a good way to help doing that is add a bunch of tests and ensure they pass on netfx as well, where appropriate.
The Boolean logic was previously wrong, leading to characters being included in the set even if they weren't meant to be, e.g. given the set [^bc], this should include everything other than b and c, but if invariant culture was being used, it incorrectly included B and C in the set.
The greedy naming is really confusing (we should fix it in the framework documentation as well). In general regular expression nomenclature, "greedy" is used as the opposite of "lazy", indicating how much a loop initially consumes, e.g. does `a*` first try consuming as many 'a's as possible or does it first try consuming as few 'a's as possible... it's "greedy" (denoted as `a*`) if it consumes as many as possible and "lazy" (denoted as `a*?` if it consumes as few. How aggressively it consumes, however, is orthogonal to whether it backtracks. Whereas `a*` is greedy and backtracking and `a*?` is lazy and backtracking, `(?>a*)` is greedy and non-backtracking and `(?>a*?) is lazy and non-backtracking. Unfortunately, the nomenclature in the implementation and the documentation describes the `(?> ... )` as being a "greedy subexpression", which then conflates the meaning of "greedy". The rest of the industry refers to these instead as "atomic", so I've changed it to that in the implementation.
- In Go, Improve code generation from several constructs to reduce code size and increase bounds reduction in the non-backtracking compiler - In FindFirstChars, use IndexOf instead of Boyer-Moore if the found prefix is only one character in length. - Hide some debug-only functions from code coverage
ec7f4d3
to
95965c9
Compare
Merging with a merge commit per Dan's request |
Found some time over my vacation to play more with our regex implementation.
Primary changes here:
\d{1,2}\/\d{1,2}\/\d{2,4}
as an approach to parsing a date... each of those repeaters will backtrack when a match fails, even though such backtracking is useless: since a/
can never match\d
, there's no way here that backtracking will help. A developer could instead write this using atomic groups, e.g.(?>\d{1,2})\/(?>\d{1,2})\/(?>\d{2,4})
but few developers think to do that. Instead, we can detect these conditions and automatically change the regex to make these atomic.\d{1,2}\/\d{1,2}\/\d{2,4}
: with the code in master, the JIT ends up generating 2137 bytes of asm for the Go method, and with this PR, that shrinks to 758 bytes. As another example, consider a regex that just matches the string "abcdefghijklmnopqrstuvwxyz". With master, that results in a Go method that's 1204 bytes of asm, and just looking at the first six character checks for the unrolled loop portion, you can see why it's so much larger:In contrast, with this PR, it's only 384 bytes, and the matching for the first six characters looks instead like this:
.*
will end up searching for\n
, and in the non-backtracking implementation we now useSpan.IndexOf
for that..*
as opposed to a lazy one like.*?
. I've renamed all of the relevant usage in the implementation from greedy to atomic, leaving the use of greedy in cases where it was referring to the greedy vs lazy case. I think we should also update the docs to refer to atomic groups / subexpressions rather than greedy groups / subexpressions, as the former is established terminology.Other minor changes:
span[i] = x; i++;
than it does forspan[i++] = x;
; since we're manually writing out the IL, we can just do it for the former instead of the latter.Running the perf tests from dotnet/performance#1117 on .NET Core 3.1, master, and this PR:
I also tried out the regex redux code from the perf repo, but on the full input5000000.txt input rather than the much reduced one it uses. Results:
There is still a lot more that can / should be done. I've opened an issue (#1349) to track that work, and to hopefully enlist help in working through the list, and also finding additional improvement opportunities.
cc: @danmosemsft, @ViktorHofer, @eerhardt, @davidwrighton