Move `OsStr::slice_encoded_bytes` validation to platform modules #118569

blyxxyz · 2023-12-03T12:46:06Z

This delegates OS string slicing (OsStr::slice_encoded_bytes) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms.

The existing implementation is still used on Unix and most other platforms and is now optimized a little better.

Tracking issue: #118485

cc @epage, @BurntSushi

rustbot · 2023-12-03T12:46:15Z

r? @Mark-Simulacrum

(rustbot has picked a reviewer for you, use r? to override)

epage · 2023-12-04T17:42:11Z

imo this doesn't feel worth it unless we do move in the direction of relaxing bounds. This creates more code paths that are effectively meant to be in-sync, creating more of burden for people trying to understand the code or maintaining it. In my mind to justify this, we'd need benchmarks showing the improvement and would it would have to be compelling enough.

blyxxyz · 2023-12-04T19:07:55Z

I felt that the general wasm improvement and better error messages tipped it over the edge. But I'm also OK with shelving it.

Mark-Simulacrum · 2023-12-09T14:51:47Z

I don't have a strong opinion either, but I'm not very familiar with this code or the direction we're headed in here. I think hearing @BurntSushi's thoughts or someone who is actively maintaining this area could convince me either way.

When you say "increased performance" -- is that a difference between "did nothing" vs. "have to scan the string", or more minor, like "compared 10 bytes" vs. "compared 15 bytes"?

blyxxyz · 2023-12-09T16:42:15Z

On wasm it's the difference between "do nothing" and "scan the whole string". (Not for this operation, which was an expensive O(1) and becomes checking a single byte, but for to_str, to_string_lossy, etcetera.) It's useful independently of the slicing feature. OsStr/Path may not be commonly used on wasm though, so it'd be helpful to have a reality check from someone who's more familiar with it (including use cases like Cloudflare Workers).

Maybe unsupported is used by platforms besides wasm? They could also benefit.

On Windows it's the difference between "call a full UTF-8 validation function to check a short substring" and "check 1 or 2 bytes" if you're splitting in the middle of non-ASCII text, which might be uncommon or might be vanishingly rare depending on how the method gets used in the wild.

bors · 2024-01-13T16:53:25Z

☔ The latest upstream changes (presumably #117285) made this pull request unmergeable. Please resolve the merge conflicts.

Mark-Simulacrum · 2024-01-20T19:58:21Z

library/std/src/ffi/os_str.rs

-        check_valid_boundary(encoded_bytes, start);
-        check_valid_boundary(encoded_bytes, end);
+
+        self.inner.check_public_boundary(start);


What makes this a public boundary? Is there a private boundary?

I think it would be good to add some docs here if we're adding an extension point -- perhaps a couple lines of common describing what the function should (roughly) do?

The public boundaries are where we let users split without panicking. The private boundaries would depend on the safety invariants for the internal encoding. For example if you split in the middle of a WTF-8 codepoint you can cause out-of-bounds reads, so that's neither a public nor a private boundary. But if you split between surrogate codepoints then that's fine as far as the implementation is concerned, we just don't allow users to do that, so that's a private boundary but not a public boundary.

I've added a comment, good call.

Mark-Simulacrum · 2024-01-20T20:04:06Z

library/std/src/sys_common/wtf8.rs

-        None => false,
-        Some(&b) => b < 128 || b >= 192,
+        None => index == slice.len(),
+        Some(&b) => (b as i8) >= -0x40,


Why did this implementation change?

(In particular it seems like the behavior is no longer an unconditional true for index = 0, and also doesn't correspond with the str::is_char_boundary impl?)

The (b as i8) >= -0x40 is probably clearer as b.is_utf8_char_boundary().

This is identical to the str::is_char_boundary impl. The point was to bring it in line with that and to be consistent with the new function.

I can't use b.is_utf8_char_boundary() because it's private to core. (Unless there are workarounds for that?)

Ah, so it is. We should think about exposing that as a public API, it seems consistent with the is_ascii_ functions we already expose that have similar bit-twiddling internally.

Mark-Simulacrum · 2024-01-20T20:08:57Z

library/std/src/sys_common/wtf8.rs

+/// implementation detail.
+#[track_caller]
+#[inline]
+pub fn check_utf8_boundary(slice: &Wtf8, index: usize) {


I'm confused how this is different from the is_code_point_boundary (from just the method name/comments)?

str::is_char_boundary is documented as "is the first byte in a UTF-8 code point sequence or the end of the string", which sounds very similar to "at the edge of either a valid UTF-8 codepoint or of the whole string". Is this needed separately due to WTF-8 details perhaps?

I'm worried that it'll be easy to use the wrong function so I think some detail on when we should use each in comments would be good. It's also a bit worrying to me that we want a new function since that feels like it implies we're changing behavior rather than just optimizing here?

I'm posting a longer explanation below, but the gist of it is that there are some WTF-8 codepoints that are not UTF-8 codepoints. If the string is pure UTF-8 then the boundaries are the same. I've tweaked the comment a little.

It's not a behavioral change, the old function wasn't used for this functionality to begin with. The cases in which this implementation panics should be the same as those in which the old one does.

Mark-Simulacrum · 2024-01-20T20:20:12Z

Apologies for the delay in reviewing.

On Windows it's the difference between "call a full UTF-8 validation function to check a short substring" and "check 1 or 2 bytes" if you're splitting in the middle of non-ASCII text, which might be uncommon or might be vanishingly rare depending on how the method gets used in the wild.

I didn't look too closely yet, but I guess I'm confused by this framing. Unless this PR is changing what we expose to users (based on the description it doesn't sound like it), these two operations should be basically equivalent, no? I.e., we need to check just as many bytes. It's possible that the check itself can be cheaper if we know the bytes are wtf8...?

In general my current inclination is that this additional complexity (~500 LOC) is not worth it for what seem like either tiny or non-existent gains on Windows. For wasm, my suspicion is that avoiding the utf-8 checks isn't materially impactful to almost any programs, so I definitely don't think it's worth the complexity.

On Windows and UEFI this improves performance and error messaging. On other platforms we optimize the fast path a bit more. This also prepares for later relaxing the checks on certain platforms.

blyxxyz · 2024-01-21T20:01:57Z

Thanks for the review! No worries about the delay.

It's possible that the check itself can be cheaper if we know the bytes are wtf8...?

Yes, much cheaper. WTF-8 is only a slight superset of UTF-8, so most of the work is already done.

Validating a UTF-8 codepoint sequence looks something like this:

Check that all the byte headers are valid and consistent (e.g. if byte 1 starts with 1110 then bytes 2 and 3 must start with 10)
Check that the shortest possible encoding is used, with no more leading zeros than necessary
Check that the codepoint is at most 0x10FFFF (the highest codepoint)
Check that the codepoint is outside the range 0xD800..=0xDFFF (the surrogate codepoints)

This is somewhat involved even with optimizations. And the API used for the current implementation is designed to validate whole strings starting at the front, so we need to hold it in a weird way to perform the checks we want (validating just a single codepoint, sometimes starting at the back).

But if you already know you have valid UTF-8 it's possible to check a boundary with a single comparison of a single byte.

WTF-8 partially relaxes requirement 4. It's shaped exactly like UTF-8, but it can contain surrogate codepoints. So to satisfy the slice_encoded_bytes() requirements we have to:

Perform the same boundary check.
Watch out for surrogate codepoints.

This is still very cheap. The wtf8 module already had examples of how to efficiently do the second.

In general my current inclination is that this additional complexity (~500 LOC) is not worth it for what seem like either tiny or non-existent gains on Windows. For wasm, my suspicion is that avoiding the utf-8 checks isn't materially impactful to almost any programs, so I definitely don't think it's worth the complexity.

I've removed the WASM module for this PR. The Windows optimization doesn't require it. A future semantical change may require something like it but if we go through with that we can either bring it back or do it in a way that requires less code.

Mark-Simulacrum · 2024-02-18T16:21:39Z

@bors r+

Thanks for the explanations. That makes sense to me, so let's go ahead and merge this in.

bors · 2024-02-18T16:21:41Z

📌 Commit 51a7396 has been approved by Mark-Simulacrum

It is now in the queue for this repository.

…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules) - rust-lang#121067 (make "invalid fragment specifier" translatable) - rust-lang#121224 (Remove unnecessary unit binding) - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`) - rust-lang#121257 (remove extraneous text from example config) - rust-lang#121260 (Remove const_prop.rs) - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs) r? `@ghost` `@rustbot` modify labels: rollup

Rollup merge of rust-lang#118569 - blyxxyz:platform-os-str-slice, r=Mark-Simulacrum Move `OsStr::slice_encoded_bytes` validation to platform modules This delegates OS string slicing (`OsStr::slice_encoded_bytes`) validation to the underlying platform implementation. For now that results in increased performance and better error messages on Windows without any changes to semantics. In the future we may want to provide different semantics for different platforms. The existing implementation is still used on Unix and most other platforms and is now optimized a little better. Tracking issue: rust-lang#118485 cc `@epage,` `@BurntSushi`

…iaskrgr Rollup of 7 pull requests Successful merges: - rust-lang#118569 (Move `OsStr::slice_encoded_bytes` validation to platform modules) - rust-lang#121067 (make "invalid fragment specifier" translatable) - rust-lang#121224 (Remove unnecessary unit binding) - rust-lang#121247 (Add help to `hir_analysis_unrecognized_intrinsic_function`) - rust-lang#121257 (remove extraneous text from example config) - rust-lang#121260 (Remove const_prop.rs) - rust-lang#121266 (Add uncontroversial syscall doc aliases to std docs) r? `@ghost` `@rustbot` modify labels: rollup

rustbot assigned Mark-Simulacrum Dec 3, 2023

blyxxyz mentioned this pull request Dec 3, 2023

Tracking Issue for os_str_slice #118485

Open

4 tasks

blyxxyz force-pushed the platform-os-str-slice branch from 694a789 to 7504473 Compare January 13, 2024 18:19

Mark-Simulacrum reviewed Jan 20, 2024

View reviewed changes

Mark-Simulacrum added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 20, 2024

Move OsStr::slice_encoded_bytes validation to platform modules

51a7396

On Windows and UEFI this improves performance and error messaging. On other platforms we optimize the fast path a bit more. This also prepares for later relaxing the checks on certain platforms.

blyxxyz force-pushed the platform-os-str-slice branch from 7504473 to 51a7396 Compare January 21, 2024 19:09

Mark-Simulacrum added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jan 22, 2024

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 18, 2024

matthiaskrgr mentioned this pull request Feb 18, 2024

Rollup of 7 pull requests #121271

Merged

bors merged commit 99560a4 into rust-lang:master Feb 18, 2024
11 checks passed

rustbot added this to the 1.78.0 milestone Feb 18, 2024

blyxxyz deleted the platform-os-str-slice branch February 18, 2024 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `OsStr::slice_encoded_bytes` validation to platform modules #118569

Move `OsStr::slice_encoded_bytes` validation to platform modules #118569

blyxxyz commented Dec 3, 2023 •

edited

Loading

rustbot commented Dec 3, 2023

epage commented Dec 4, 2023

blyxxyz commented Dec 4, 2023

Mark-Simulacrum commented Dec 9, 2023

blyxxyz commented Dec 9, 2023 •

edited

Loading

bors commented Jan 13, 2024

Mark-Simulacrum Jan 20, 2024

blyxxyz Jan 21, 2024

Mark-Simulacrum Jan 20, 2024

blyxxyz Jan 21, 2024

Mark-Simulacrum Feb 18, 2024

Mark-Simulacrum Jan 20, 2024

blyxxyz Jan 21, 2024

Mark-Simulacrum commented Jan 20, 2024

blyxxyz commented Jan 21, 2024

Mark-Simulacrum commented Feb 18, 2024

bors commented Feb 18, 2024

Move OsStr::slice_encoded_bytes validation to platform modules #118569

Move OsStr::slice_encoded_bytes validation to platform modules #118569

Conversation

blyxxyz commented Dec 3, 2023 • edited Loading

rustbot commented Dec 3, 2023

epage commented Dec 4, 2023

blyxxyz commented Dec 4, 2023

Mark-Simulacrum commented Dec 9, 2023

blyxxyz commented Dec 9, 2023 • edited Loading

bors commented Jan 13, 2024

Mark-Simulacrum Jan 20, 2024

Choose a reason for hiding this comment

blyxxyz Jan 21, 2024

Choose a reason for hiding this comment

Mark-Simulacrum Jan 20, 2024

Choose a reason for hiding this comment

blyxxyz Jan 21, 2024

Choose a reason for hiding this comment

Mark-Simulacrum Feb 18, 2024

Choose a reason for hiding this comment

Mark-Simulacrum Jan 20, 2024

Choose a reason for hiding this comment

blyxxyz Jan 21, 2024

Choose a reason for hiding this comment

Mark-Simulacrum commented Jan 20, 2024

blyxxyz commented Jan 21, 2024

Mark-Simulacrum commented Feb 18, 2024

bors commented Feb 18, 2024

Move `OsStr::slice_encoded_bytes` validation to platform modules #118569

Move `OsStr::slice_encoded_bytes` validation to platform modules #118569

blyxxyz commented Dec 3, 2023 •

edited

Loading

blyxxyz commented Dec 9, 2023 •

edited

Loading