Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“basename” does not support gbk encoding #48648

Closed
zpcpi opened this issue Feb 11, 2023 · 3 comments
Closed

“basename” does not support gbk encoding #48648

zpcpi opened this issue Feb 11, 2023 · 3 comments
Labels
bug Indicates an unexpected problem or unintended behavior filesystem Underlying file system and functions that use it

Comments

@zpcpi
Copy link

zpcpi commented Feb 11, 2023

Julia Version 1.8.5
Commit 17cfb8e (2023-01-08 06:45 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 8 × Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, haswell)
Threads: 1 on 8 virtual cores
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS =

code
`path = "d:\\AppData\\Local\\Programs\\Julia-1.8.5\\bin\\libLLVM-13jl.dll"
basename(path) # "libLLVM-13jl.dll"

path = String(Base.libllvm_path()) # "d:\\Users\\\xc5\xf4\xb3\xcc\\AppData\\Local\\Programs\\Julia-1.8.5\\bin\\libLLVM-13jl.dll"
basename(path) # ERROR: type Nothing has no field captures
`

Base.libllvm_path()
GBK: "d:\\Users\\鹏程\\AppData\\Local\\Programs\\Julia-1.8.5\\bin\\libLLVM-13jl.dll"

@melonedo
Copy link
Contributor

Probably caused by #45126, since basename needn't support illegal encodings.

@vtjnash
Copy link
Member

vtjnash commented Feb 12, 2023

We may need to ensure utf8 is disabled for path regexes (or all ascii-only regexes which don't use character classes?)

@vtjnash vtjnash added bug Indicates an unexpected problem or unintended behavior filesystem Underlying file system and functions that use it labels Feb 12, 2023
@vtjnash
Copy link
Member

vtjnash commented Feb 14, 2023

fixed by #45127

@vtjnash vtjnash closed this as completed Feb 14, 2023
vtjnash added a commit that referenced this issue Feb 15, 2023
Previously, we might try to interpret the random bytes in a path as
UTF-8 and excluding \n, causing the regex match to fail or be incomplete
in some cases. But those are valid in a path, so we want PCRE2 to treat
them as transparent bytes. Accordingly, change r""a to specify all flags
needed to interpret the values simply as ASCII.

Note, this would be breaking if someone was previously trying to match a
Unicode character by `\u` while also disabling UCP matching of \w and
\s, but that seems an odd specific choice to need.

    julia> match(r"\u03b1"a, "α")
    ERROR: PCRE compilation error: character code point value in \u.... sequence is too large at offset 6

(this would have previously worked). Note that explicitly starting the
regex with (*UTF) or using a literal α in the regex would continue to
work as before however.

Note that `s` (DOTALL) is a more efficient matcher (if the pattern
contains `.`), as is `a`, so it is often preferable to set both when in
doubt: http://man.he.net/man3/pcre2perform

Refs: #48648
vtjnash added a commit that referenced this issue Feb 16, 2023
Previously, we might try to interpret the random bytes in a path as
UTF-8 and excluding \n, causing the regex match to fail or be incomplete
in some cases. But those are valid in a path, so we want PCRE2 to treat
them as transparent bytes. Accordingly, change r""a to specify all flags
needed to interpret the values simply as ASCII.

Note, this would be breaking if someone was previously trying to match a
Unicode character by `\u` while also disabling UCP matching of \w and
\s, but that seems an odd specific choice to need.

    julia> match(r"\u03b1"a, "α")
    ERROR: PCRE compilation error: character code point value in \u.... sequence is too large at offset 6

(this would have previously worked). Note that explicitly starting the
regex with (*UTF) or using a literal α in the regex would continue to
work as before however.

Note that `s` (DOTALL) is a more efficient matcher (if the pattern
contains `.`), as is `a`, so it is often preferable to set both when in
doubt: http://man.he.net/man3/pcre2perform

Refs: #48648
vtjnash added a commit that referenced this issue Feb 17, 2023
Previously, we might try to interpret the random bytes in a path as
UTF-8 and excluding \n, causing the regex match to fail or be incomplete
in some cases. But those are valid in a path, so we want PCRE2 to treat
them as transparent bytes. Accordingly, change r""a to specify all flags
needed to interpret the values simply as ASCII.

Note, this would be breaking if someone was previously trying to match a
Unicode character by `\u` while also disabling UCP matching of \w and
\s, but that seems an odd specific choice to need.

    julia> match(r"\u03b1"a, "α")
    ERROR: PCRE compilation error: character code point value in \u.... sequence is too large at offset 6

(this would have previously worked). Note that explicitly starting the
regex with (*UTF) or using a literal α in the regex would continue to
work as before however.

Note that `s` (DOTALL) is a more efficient matcher (if the pattern
contains `.`), as is `a`, so it is often preferable to set both when in
doubt: http://man.he.net/man3/pcre2perform

Refs: #48648
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior filesystem Underlying file system and functions that use it
Projects
None yet
Development

No branches or pull requests

3 participants