Implement String::make_(upp|low)ercase #135888

krtab · 2025-01-22T16:54:13Z

Tracking issue: #135885

My plan is to first add both implementations and their tests, without caring about performances and allocations and then improve performances.

rustbot · 2025-01-22T16:54:22Z

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

clarfonthey · 2025-01-25T00:52:42Z

library/core/src/str/validations.rs

 ///
 /// # Safety
 ///
 /// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string
 #[unstable(feature = "str_internals", issue = "none")]
 #[inline]
-pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
+#[allow(dead_code)]
+pub unsafe fn next_code_point_with_width<'a, I: Iterator<Item = &'a u8>>(


This isn't required for this PR, but for completeness, it would be nice to make this change to the next_code_point_reverse function as well, also offering a next_code_point_reverse_with_width function that provides the width. I do have a version I've written that I use in the implementation I proposed in Zulip, so, it would likely be used in the final implementation.

That said, you don't need to make this change now since it isn't strictly required for the first pass. Just figured I'd mention it.

krtab · 2025-02-18T13:43:24Z

I think this is ready for a first review and the API can be unstably exposed.

The current implementation has not been optimized for performances outside of guaranteeing that not allocation happens in the happy path where no code point needs more bytes to have its case changed.

One question that may need discussion is what to do with the current strategy of implementing methods on [u8]. It may be better to simply put it all in String, but I find the current strategy easier to reason about in term of panic safety.

@rustbot ready

jhpratt · 2025-02-19T00:30:55Z

My review capacity is a bit limited at the moment.

r? libs

krtab · 2025-02-19T08:48:04Z

library/alloc/src/slice/byte_slice_make_case.rs

+    let len = c.encode_utf8(&mut buffer).len();
+    let writable_slice = &mut slice[*write_offset..];
+    let direct_copy_length = core::cmp::min(len, writable_slice.len());
+    writable_slice[..direct_copy_length].copy_from_slice(&buffer[..direct_copy_length]);


Once performance is on the table, this may better be replaced by a loop rather than copy_from_slice which will likely call memcpy

joboet

I have some remarks about the code organization.

Aside from that, I'm not really a big fan of this algorithm, in particular the introduction of an additional allocation. In my opinion, a better algorithm would be to divide the string into chunks:

While the case-changed string is smaller or the same size as the original string, do the case change immediately.
Otherwise, scan forwards until either the strings have the same size again or the end of the original string is reached. If the end is reached, grow the allocation. Then, convert the chunk going backwards.

Another question is the necessity of FinalSigmaAutomata. I don't know Unicode that much – so correct me if I'm wrong – but I'd expect that the Case_Ignorable and Cased properties stay the same after case transformation. If that is true, then you could get rid of FinalSigmaAutomata and do the case_ignorable_then_cased check on the already converted part of the string.

joboet · 2025-02-20T15:06:10Z

library/alloc/src/string.rs

@@ -1127,6 +1127,32 @@ impl String {
        self.vec.extend_from_slice(string.as_bytes())
    }

+    #[cfg(not(no_global_oom_handling))]
+    #[unstable(feature = "string_make_uplowercase", issue = "135885")]
+    #[allow(missing_docs)]


Please write documentation for these functions. We do not allow undocumented public items in std, even if they are unstable.

joboet · 2025-02-20T15:09:23Z

library/alloc/src/slice/byte_slice_make_case.rs

+    #[rustc_allow_incoherent_impl]
+    #[unstable(issue = "none", feature = "std_internals")]


Why add an incoherent, public and unsafe method when you could also implement this as a private, free-standing function?

joboet · 2025-02-20T15:11:13Z

library/core/src/str/validations.rs

 ///
 /// # Safety
 ///
 /// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string
 #[unstable(feature = "str_internals", issue = "none")]
 #[inline]
-pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
+#[allow(dead_code)]


Why do you need this?

joboet · 2025-02-20T15:35:06Z

library/alloc/src/slice/byte_slice_make_case.rs

+        assert_eq!(read_offset, self.len());
+        return if write_offset < read_offset { Ok(write_offset) } else { Err(queue) };
+
+        // For now this is copy pasted from core::str, FIXME: DRY


It's in alloc, so it's in the same crate. I'd much prefer it if you moved this submodule to str and moved all the Unicode case-changing implementations there.

bors · 2025-03-08T16:10:11Z

☔ The latest upstream changes (presumably #138208) made this pull request unmergeable. Please resolve the merge conflicts.

rustbot assigned jhpratt Jan 22, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 22, 2025

krtab mentioned this pull request Jan 22, 2025

Tracking Issue for APC #316: In-place case change methods for String #135885

Open

3 tasks

krtab force-pushed the make_uppercase branch from 9bbd949 to 41aea71 Compare January 22, 2025 17:01

This comment has been minimized.

Sign in to view

jhpratt added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 23, 2025

clarfonthey reviewed Jan 25, 2025

View reviewed changes

krtab added 2 commits February 17, 2025 19:05

First prototype of make_uppercase

c5b0e91

First prototype of make_lowercase

fffbb33

krtab force-pushed the make_uppercase branch from 41aea71 to fffbb33 Compare February 17, 2025 18:06

This comment has been minimized.

Sign in to view

Bypass queue when possible in slice::make_*case

a00b4ef

krtab force-pushed the make_uppercase branch from eb5113b to a00b4ef Compare February 18, 2025 10:43

Add needed no_global_oom_handling cfg to make_case methods

6f1f32e

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 18, 2025

krtab marked this pull request as ready for review February 18, 2025 13:43

rustbot assigned joboet and unassigned jhpratt Feb 19, 2025

krtab commented Feb 19, 2025

View reviewed changes

joboet requested changes Feb 20, 2025

View reviewed changes

rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement String::make_(upp|low)ercase #135888

Implement String::make_(upp|low)ercase #135888

krtab commented Jan 22, 2025

rustbot commented Jan 22, 2025

This comment has been minimized.

clarfonthey Jan 25, 2025

This comment has been minimized.

krtab commented Feb 18, 2025

jhpratt commented Feb 19, 2025

krtab Feb 19, 2025

joboet left a comment

joboet Feb 20, 2025

joboet Feb 20, 2025

joboet Feb 20, 2025

joboet Feb 20, 2025

bors commented Mar 8, 2025

		#[rustc_allow_incoherent_impl]
		#[unstable(issue = "none", feature = "std_internals")]

Implement String::make_(upp|low)ercase #135888

Are you sure you want to change the base?

Implement String::make_(upp|low)ercase #135888

Conversation

krtab commented Jan 22, 2025

rustbot commented Jan 22, 2025

This comment has been minimized.

clarfonthey Jan 25, 2025

Choose a reason for hiding this comment

This comment has been minimized.

krtab commented Feb 18, 2025

jhpratt commented Feb 19, 2025

krtab Feb 19, 2025

Choose a reason for hiding this comment

joboet left a comment

Choose a reason for hiding this comment

joboet Feb 20, 2025

Choose a reason for hiding this comment

joboet Feb 20, 2025

Choose a reason for hiding this comment

joboet Feb 20, 2025

Choose a reason for hiding this comment

joboet Feb 20, 2025

Choose a reason for hiding this comment

bors commented Mar 8, 2025