Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement String::make_(upp|low)ercase #135888

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

krtab
Copy link
Contributor

@krtab krtab commented Jan 22, 2025

Tracking issue: #135885

My plan is to first add both implementations and their tests, without caring about performances and allocations and then improve performances.

@rustbot
Copy link
Collaborator

rustbot commented Jan 22, 2025

r? @jhpratt

rustbot has assigned @jhpratt.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 22, 2025
@rust-log-analyzer

This comment has been minimized.

@jhpratt jhpratt added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 23, 2025
///
/// # Safety
///
/// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string
#[unstable(feature = "str_internals", issue = "none")]
#[inline]
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
#[allow(dead_code)]
pub unsafe fn next_code_point_with_width<'a, I: Iterator<Item = &'a u8>>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't required for this PR, but for completeness, it would be nice to make this change to the next_code_point_reverse function as well, also offering a next_code_point_reverse_with_width function that provides the width. I do have a version I've written that I use in the implementation I proposed in Zulip, so, it would likely be used in the final implementation.

That said, you don't need to make this change now since it isn't strictly required for the first pass. Just figured I'd mention it.

@rust-log-analyzer

This comment has been minimized.

@krtab
Copy link
Contributor Author

krtab commented Feb 18, 2025

I think this is ready for a first review and the API can be unstably exposed.

The current implementation has not been optimized for performances outside of guaranteeing that not allocation happens in the happy path where no code point needs more bytes to have its case changed.

One question that may need discussion is what to do with the current strategy of implementing methods on [u8]. It may be better to simply put it all in String, but I find the current strategy easier to reason about in term of panic safety.

@rustbot ready

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Feb 18, 2025
@krtab krtab marked this pull request as ready for review February 18, 2025 13:43
@jhpratt
Copy link
Member

jhpratt commented Feb 19, 2025

My review capacity is a bit limited at the moment.

r? libs

@rustbot rustbot assigned joboet and unassigned jhpratt Feb 19, 2025
let len = c.encode_utf8(&mut buffer).len();
let writable_slice = &mut slice[*write_offset..];
let direct_copy_length = core::cmp::min(len, writable_slice.len());
writable_slice[..direct_copy_length].copy_from_slice(&buffer[..direct_copy_length]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once performance is on the table, this may better be replaced by a loop rather than copy_from_slice which will likely call memcpy

Copy link
Member

@joboet joboet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some remarks about the code organization.

Aside from that, I'm not really a big fan of this algorithm, in particular the introduction of an additional allocation. In my opinion, a better algorithm would be to divide the string into chunks:

  • While the case-changed string is smaller or the same size as the original string, do the case change immediately.
  • Otherwise, scan forwards until either the strings have the same size again or the end of the original string is reached. If the end is reached, grow the allocation. Then, convert the chunk going backwards.

Another question is the necessity of FinalSigmaAutomata. I don't know Unicode that much – so correct me if I'm wrong – but I'd expect that the Case_Ignorable and Cased properties stay the same after case transformation. If that is true, then you could get rid of FinalSigmaAutomata and do the case_ignorable_then_cased check on the already converted part of the string.

@@ -1127,6 +1127,32 @@ impl String {
self.vec.extend_from_slice(string.as_bytes())
}

#[cfg(not(no_global_oom_handling))]
#[unstable(feature = "string_make_uplowercase", issue = "135885")]
#[allow(missing_docs)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write documentation for these functions. We do not allow undocumented public items in std, even if they are unstable.

Comment on lines +5 to +6
#[rustc_allow_incoherent_impl]
#[unstable(issue = "none", feature = "std_internals")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add an incoherent, public and unsafe method when you could also implement this as a private, free-standing function?

///
/// # Safety
///
/// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string
#[unstable(feature = "str_internals", issue = "none")]
#[inline]
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> {
#[allow(dead_code)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

assert_eq!(read_offset, self.len());
return if write_offset < read_offset { Ok(write_offset) } else { Err(queue) };

// For now this is copy pasted from core::str, FIXME: DRY
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in alloc, so it's in the same crate. I'd much prefer it if you moved this submodule to str and moved all the Unicode case-changing implementations there.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2025
@bors
Copy link
Contributor

bors commented Mar 8, 2025

☔ The latest upstream changes (presumably #138208) made this pull request unmergeable. Please resolve the merge conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants