-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panicked at 'not a char boundary' when using Regex::replace #1006
Comments
It looks like this is a new bug introduced by
This is fixed by #978 more generally by allowing the regex to compile but not reporting a match anywhere. In older versions of this crate, things like |
This regex failed to compile in `regex <1.8`, but the migration to regex-automata tweaked the rules in a subtle way that permitted it to compile despite the fact that the old/status-quo matching engines can't handle it correctly. By that, I mean that they may permit the \B to match between code units. That in turn results in panicking when slicing a &str. In `regex 1.9`, this regex will actually be able to be compiled, but the matching engines will correctly and robustly never report matches that split UTF-8 code units. For now, we just add code that causes `regex 1.8` to have the same behavior as previous releases. Fixes #1006
I put up #1007 to make |
This regex failed to compile in `regex <1.8`, but the migration to regex-automata tweaked the rules in a subtle way that permitted it to compile despite the fact that the old/status-quo matching engines can't handle it correctly. By that, I mean that they may permit the \B to match between code units. That in turn results in panicking when slicing a &str. In `regex 1.9`, this regex will actually be able to be compiled, but the matching engines will correctly and robustly never report matches that split UTF-8 code units. For now, we just add code that causes `regex 1.8` to have the same behavior as previous releases. Fixes #1006
Can we add |
Probably just making sure that I'd prefer is we waited until #978 was merged to do more work on the fuzzers. |
This is fixed in |
This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [regex](https://github.com/rust-lang/regex) | dependencies | patch | `1.8.1` -> `1.8.4` | --- ### Release Notes <details> <summary>rust-lang/regex</summary> ### [`v1.8.4`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#​184-2023-06-05) [Compare Source](rust-lang/regex@1.8.3...1.8.4) \================== This is a patch release that fixes a bug where `(?-u:\B)` was allowed in Unicode regexes, despite the fact that the current matching engines can report match offsets between the code units of a single UTF-8 encoded codepoint. That in turn means that match offsets that split a codepoint could be reported, which in turn results in panicking when one uses them to slice a `&str`. This bug occurred in the transition to `regex 1.8` because the underlying syntactical error that prevented this regex from compiling was intentionally removed. That's because `(?-u:\B)` will be permitted in Unicode regexes in `regex 1.9`, but the matching engines will guarantee to never report match offsets that split a codepoint. When the underlying syntactical error was removed, no code was added to ensure that `(?-u:\B)` didn't compile in the `regex 1.8` transition release. This release, `regex 1.8.4`, adds that code such that `Regex::new(r"(?-u:\B)")` returns to the `regex <1.8` behavior of not compiling. (A `bytes::Regex` can still of course compile it.) Bug fixes: - [BUG #​1006](rust-lang/regex#1006): Fix a bug where `(?-u:\B)` was allowed in Unicode regexes, and in turn could lead to match offsets that split a codepoint in `&str`. ### [`v1.8.3`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#​183-2023-05-25) [Compare Source](rust-lang/regex@1.8.2...1.8.3) \================== This is a patch release that fixes a bug where the regex would report a match at every position even when it shouldn't. This could occur in a very small subset of regexes, usually an alternation of simple literals that have particular properties. (See the issue linked below for a more precise description.) Bug fixes: - [BUG #​999](rust-lang/regex#999): Fix a bug where a match at every position is erroneously reported. ### [`v1.8.2`](https://github.com/rust-lang/regex/blob/HEAD/CHANGELOG.md#​182-2023-05-22) [Compare Source](rust-lang/regex@1.8.1...1.8.2) \================== This is a patch release that fixes a bug where regex compilation could panic in debug mode for regexes with large counted repetitions. For example, `a{2147483516}{2147483416}{5}` resulted in an integer overflow that wrapped in release mode but panicking in debug mode. Despite the unintended wrapping arithmetic in release mode, it didn't cause any other logical bugs since the errant code was for new analysis that wasn't used yet. Bug fixes: - [BUG #​995](rust-lang/regex#995): Fix a bug where regex compilation with large counted repetitions could panic. </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNS4xMTEuMCIsInVwZGF0ZWRJblZlciI6IjM1LjExMS4wIiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCJ9--> Co-authored-by: cabr2-bot <cabr2.help@gmail.com> Co-authored-by: crapStone <crapstone01@gmail.com> Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1912 Reviewed-by: crapStone <crapstone@noreply.codeberg.org> Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org> Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
What version of regex are you using?
v1.8.3
Describe the bug at a high level.
I have fuzzed the regex package with afl.rs, and afl reports some cases panicked at "char boundary":
It seems replace function have some problems with utf-8 character. I have found such panic in several fuzz drivers.
What are the steps to reproduce the behavior?
I also found that if I replace "replace_all" with "replace", the last case will not panic. The rest of these will panic after being replaced. It seems weird.
What is the actual behavior?
Here is an error message sample; the error messages in these cases are similar.
What is the expected behavior?
Regex::replace should deal with utf-8 character correctly.
The text was updated successfully, but these errors were encountered: