ensure the path regexes will accept all valid paths #48686

vtjnash · 2023-02-15T17:32:52Z

Previously, we might try to interpret the random bytes in a path as UTF-8 and excluding \n, causing the regex match to fail or be incomplete in some cases. But those are valid in a path, so we want PCRE2 to treat them as transparent bytes. Accordingly, change r""a to specify all flags needed to interpret the values simply as ASCII.

Note, this would be breaking if someone was previously trying to match a Unicode character by \u while also disabling UCP matching of \w and \s, but that seems an odd specific choice to need.

julia> match(r"\u03b1"a, "α")
ERROR: PCRE compilation error: character code point value in \u.... sequence is too large at offset 6

(this would have previously worked). Note that explicitly starting the regex with (*UTF) or using a literal α in the regex would continue to work as before however.

Note that s (DOTALL) is a more efficient matcher (if the pattern contains .), as is a, so it is often preferable to set both when in doubt: http://man.he.net/man3/pcre2perform

Refs: #48648

Previously, we might try to interpret the random bytes in a path as UTF-8 and excluding \n, causing the regex match to fail or be incomplete in some cases. But those are valid in a path, so we want PCRE2 to treat them as transparent bytes. Accordingly, change r""a to specify all flags needed to interpret the values simply as ASCII. Note, this would be breaking if someone was previously trying to match a Unicode character by `\u` while also disabling UCP matching of \w and \s, but that seems an odd specific choice to need. julia> match(r"\u03b1"a, "α") ERROR: PCRE compilation error: character code point value in \u.... sequence is too large at offset 6 (this would have previously worked). Note that explicitly starting the regex with (*UTF) or using a literal α in the regex would continue to work as before however. Note that `s` (DOTALL) is a more efficient matcher (if the pattern contains `.`), as is `a`, so it is often preferable to set both when in doubt: http://man.he.net/man3/pcre2perform Refs: #48648

vtjnash · 2023-02-27T16:43:23Z

test/path.jl

@@ -171,6 +171,9 @@
        @test string(splitdrive(S(homedir()))...) == homedir()
        @test splitdrive("a\nb") == ("", "a\nb")

+        @test splitdir("a/\xfe/\n/b/c.ext") == ("a/\xfe/\n/b", "c.ext")
+        @test splitext("a/\xfe/\n/b/c.ext") == ("a/\xfe/\n/b/c", ".ext")


Thought someone might be interested to note that this had previously been encountered as a bug and fixed only for splitdrive and only for windows by #32543

vtjnash force-pushed the jn/ascii-regex++ branch from 38fbd36 to b5ce7ca Compare February 16, 2023 20:58

vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Feb 16, 2023

vtjnash merged commit 892cd4f into master Feb 17, 2023

vtjnash deleted the jn/ascii-regex++ branch February 17, 2023 15:58

DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Feb 21, 2023

vtjnash commented Feb 27, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure the path regexes will accept all valid paths #48686

ensure the path regexes will accept all valid paths #48686

vtjnash commented Feb 15, 2023

vtjnash Feb 27, 2023

ensure the path regexes will accept all valid paths #48686

ensure the path regexes will accept all valid paths #48686

Conversation

vtjnash commented Feb 15, 2023

vtjnash Feb 27, 2023

Choose a reason for hiding this comment