-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move OsStr::slice_encoded_bytes
validation to platform modules
#118569
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
imo this doesn't feel worth it unless we do move in the direction of relaxing bounds. This creates more code paths that are effectively meant to be in-sync, creating more of burden for people trying to understand the code or maintaining it. In my mind to justify this, we'd need benchmarks showing the improvement and would it would have to be compelling enough. |
I felt that the general wasm improvement and better error messages tipped it over the edge. But I'm also OK with shelving it. |
I don't have a strong opinion either, but I'm not very familiar with this code or the direction we're headed in here. I think hearing @BurntSushi's thoughts or someone who is actively maintaining this area could convince me either way. When you say "increased performance" -- is that a difference between "did nothing" vs. "have to scan the string", or more minor, like "compared 10 bytes" vs. "compared 15 bytes"? |
On wasm it's the difference between "do nothing" and "scan the whole string". (Not for this operation, which was an expensive O(1) and becomes checking a single byte, but for Maybe On Windows it's the difference between "call a full UTF-8 validation function to check a short substring" and "check 1 or 2 bytes" if you're splitting in the middle of non-ASCII text, which might be uncommon or might be vanishingly rare depending on how the method gets used in the wild. |
☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts. |
694a789
to
7504473
Compare
check_valid_boundary(encoded_bytes, start); | ||
check_valid_boundary(encoded_bytes, end); | ||
|
||
self.inner.check_public_boundary(start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What makes this a public boundary? Is there a private boundary?
I think it would be good to add some docs here if we're adding an extension point -- perhaps a couple lines of common describing what the function should (roughly) do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The public boundaries are where we let users split without panicking. The private boundaries would depend on the safety invariants for the internal encoding. For example if you split in the middle of a WTF-8 codepoint you can cause out-of-bounds reads, so that's neither a public nor a private boundary. But if you split between surrogate codepoints then that's fine as far as the implementation is concerned, we just don't allow users to do that, so that's a private boundary but not a public boundary.
I've added a comment, good call.
None => false, | ||
Some(&b) => b < 128 || b >= 192, | ||
None => index == slice.len(), | ||
Some(&b) => (b as i8) >= -0x40, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this implementation change?
(In particular it seems like the behavior is no longer an unconditional true for index = 0, and also doesn't correspond with the str::is_char_boundary impl?)
The (b as i8) >= -0x40
is probably clearer as b.is_utf8_char_boundary()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is identical to the str::is_char_boundary
impl. The point was to bring it in line with that and to be consistent with the new function.
I can't use b.is_utf8_char_boundary()
because it's private to core
. (Unless there are workarounds for that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it is. We should think about exposing that as a public API, it seems consistent with the is_ascii_ functions we already expose that have similar bit-twiddling internally.
/// implementation detail. | ||
#[track_caller] | ||
#[inline] | ||
pub fn check_utf8_boundary(slice: &Wtf8, index: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused how this is different from the is_code_point_boundary (from just the method name/comments)?
str::is_char_boundary
is documented as "is the first byte in a UTF-8 code point sequence or the end of the string", which sounds very similar to "at the edge of either a valid UTF-8 codepoint or of the whole string". Is this needed separately due to WTF-8 details perhaps?
I'm worried that it'll be easy to use the wrong function so I think some detail on when we should use each in comments would be good. It's also a bit worrying to me that we want a new function since that feels like it implies we're changing behavior rather than just optimizing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm posting a longer explanation below, but the gist of it is that there are some WTF-8 codepoints that are not UTF-8 codepoints. If the string is pure UTF-8 then the boundaries are the same. I've tweaked the comment a little.
It's not a behavioral change, the old function wasn't used for this functionality to begin with. The cases in which this implementation panics should be the same as those in which the old one does.
Apologies for the delay in reviewing.
I didn't look too closely yet, but I guess I'm confused by this framing. Unless this PR is changing what we expose to users (based on the description it doesn't sound like it), these two operations should be basically equivalent, no? I.e., we need to check just as many bytes. It's possible that the check itself can be cheaper if we know the bytes are wtf8...? In general my current inclination is that this additional complexity (~500 LOC) is not worth it for what seem like either tiny or non-existent gains on Windows. For wasm, my suspicion is that avoiding the utf-8 checks isn't materially impactful to almost any programs, so I definitely don't think it's worth the complexity. |
On Windows and UEFI this improves performance and error messaging. On other platforms we optimize the fast path a bit more. This also prepares for later relaxing the checks on certain platforms.
7504473
to
51a7396
Compare
Thanks for the review! No worries about the delay.
Yes, much cheaper. WTF-8 is only a slight superset of UTF-8, so most of the work is already done. Validating a UTF-8 codepoint sequence looks something like this:
This is somewhat involved even with optimizations. And the API used for the current implementation is designed to validate whole strings starting at the front, so we need to hold it in a weird way to perform the checks we want (validating just a single codepoint, sometimes starting at the back). But if you already know you have valid UTF-8 it's possible to check a boundary with a single comparison of a single byte. WTF-8 partially relaxes requirement 4. It's shaped exactly like UTF-8, but it can contain surrogate codepoints. So to satisfy the
This is still very cheap. The
I've removed the WASM module for this PR. The Windows optimization doesn't require it. A future semantical change may require something like it but if we go through with that we can either bring it back or do it in a way that requires less code. |
@bors r+ Thanks for the explanations. That makes sense to me, so let's go ahead and merge this in. |
…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules) - rust-lang#121067 (make "invalid fragment specifier" translatable) - rust-lang#121224 (Remove unnecessary unit binding) - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`) - rust-lang#121257 (remove extraneous text from example config) - rust-lang#121260 (Remove const_prop.rs) - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs) r? `@ghost` `@rustbot` modify labels: rollup
Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`
…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules) - rust-lang#121067 (make "invalid fragment specifier" translatable) - rust-lang#121224 (Remove unnecessary unit binding) - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`) - rust-lang#121257 (remove extraneous text from example config) - rust-lang#121260 (Remove const_prop.rs) - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs) r? `@ghost` `@rustbot` modify labels: rollup
This delegates OS string slicing (
OsStr::slice_encoded_bytes
) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.The existing implementation is still used on Unix and most other platforms and is now optimized a little better.
Tracking issue: #118485
cc @epage, @BurntSushi