-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement String::make_(upp|low)ercase #135888
base: master
Are you sure you want to change the base?
Conversation
This comment has been minimized.
This comment has been minimized.
/// | ||
/// # Safety | ||
/// | ||
/// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string | ||
#[unstable(feature = "str_internals", issue = "none")] | ||
#[inline] | ||
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> { | ||
#[allow(dead_code)] | ||
pub unsafe fn next_code_point_with_width<'a, I: Iterator<Item = &'a u8>>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't required for this PR, but for completeness, it would be nice to make this change to the next_code_point_reverse
function as well, also offering a next_code_point_reverse_with_width
function that provides the width. I do have a version I've written that I use in the implementation I proposed in Zulip, so, it would likely be used in the final implementation.
That said, you don't need to make this change now since it isn't strictly required for the first pass. Just figured I'd mention it.
41aea71
to
fffbb33
Compare
This comment has been minimized.
This comment has been minimized.
eb5113b
to
a00b4ef
Compare
I think this is ready for a first review and the API can be unstably exposed. The current implementation has not been optimized for performances outside of guaranteeing that not allocation happens in the happy path where no code point needs more bytes to have its case changed. One question that may need discussion is what to do with the current strategy of implementing methods on @rustbot ready |
My review capacity is a bit limited at the moment. r? libs |
let len = c.encode_utf8(&mut buffer).len(); | ||
let writable_slice = &mut slice[*write_offset..]; | ||
let direct_copy_length = core::cmp::min(len, writable_slice.len()); | ||
writable_slice[..direct_copy_length].copy_from_slice(&buffer[..direct_copy_length]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once performance is on the table, this may better be replaced by a loop rather than copy_from_slice
which will likely call memcpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some remarks about the code organization.
Aside from that, I'm not really a big fan of this algorithm, in particular the introduction of an additional allocation. In my opinion, a better algorithm would be to divide the string into chunks:
- While the case-changed string is smaller or the same size as the original string, do the case change immediately.
- Otherwise, scan forwards until either the strings have the same size again or the end of the original string is reached. If the end is reached, grow the allocation. Then, convert the chunk going backwards.
Another question is the necessity of FinalSigmaAutomata
. I don't know Unicode that much – so correct me if I'm wrong – but I'd expect that the Case_Ignorable
and Cased
properties stay the same after case transformation. If that is true, then you could get rid of FinalSigmaAutomata
and do the case_ignorable_then_cased
check on the already converted part of the string.
@@ -1127,6 +1127,32 @@ impl String { | |||
self.vec.extend_from_slice(string.as_bytes()) | |||
} | |||
|
|||
#[cfg(not(no_global_oom_handling))] | |||
#[unstable(feature = "string_make_uplowercase", issue = "135885")] | |||
#[allow(missing_docs)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write documentation for these functions. We do not allow undocumented public items in std
, even if they are unstable.
#[rustc_allow_incoherent_impl] | ||
#[unstable(issue = "none", feature = "std_internals")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add an incoherent, public and unsafe
method when you could also implement this as a private, free-standing function?
/// | ||
/// # Safety | ||
/// | ||
/// `bytes` must produce a valid UTF-8-like (UTF-8 or WTF-8) string | ||
#[unstable(feature = "str_internals", issue = "none")] | ||
#[inline] | ||
pub unsafe fn next_code_point<'a, I: Iterator<Item = &'a u8>>(bytes: &mut I) -> Option<u32> { | ||
#[allow(dead_code)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this?
assert_eq!(read_offset, self.len()); | ||
return if write_offset < read_offset { Ok(write_offset) } else { Err(queue) }; | ||
|
||
// For now this is copy pasted from core::str, FIXME: DRY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's in alloc
, so it's in the same crate. I'd much prefer it if you moved this submodule to str
and moved all the Unicode case-changing implementations there.
☔ The latest upstream changes (presumably #138208) made this pull request unmergeable. Please resolve the merge conflicts. |
Tracking issue: #135885
My plan is to first add both implementations and their tests, without caring about performances and allocations and then improve performances.