-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix inadvertently case sensitive Boyer-Moore #39420
Conversation
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
We need to get this into Preview 8. |
Separate from this PR, it would probably be good to add a test that searches for a random string against a text that may or may not contain it somewhere, and compare compiled with non compiled. Such a test could quickly have found this bug and might protect us against others. The comparison with non compiled is interesting because the implementation is so different. |
...libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM!
Fix #39390
In this case the pattern "H#" would not match "#H#" iff RegexOptions.IgnoreCase | RegexOptions.Compiled.
Because the pattern contains a literal prefix (indeed it is the entire pattern) we will use Boyer-Moore to find the first instance of it. (One could imagine a more efficient way to search for a 2-character prefix.) Because the IgnoreCase was passed, we lowercase the pattern immediately to "h#", and when we match against a character in the text, we must lower case that character to compare it.
As a performance optimization, in the Compiled path, we avoid calling ToLower on the text candidate if we can cheaply verify that the character we are searching for is not be affected by case conversion. In this case, for example, we need not bother to lower case the text candidate character when we are searching for "#" because it is in a UnicodeCategory ("OtherPunctuation") which we know is not affected by case conversion. This optimization, like many others, does not exist in the non Compiled path.
The bug was that when deciding whether to lowercase the text candidate, instead of examining the character we were searching for, we were examining the last character of the prefix instead. In this repro case that is "#" so when searching for "H" we would not lower case it.
I added a test that fails without this fix.