You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There was a post recently on /r/rust about microbenchmarks in different programming languages (https://github.com/kostya/benchmarks), and I was trying to understand how the Rust implementation works.
I noticed that in the Base64 benchmark the author used this code to construct a srting with str_size copies of a character:
I thought that push_str is a bit inefficient for a single character, so I looked at the source code. I started playing with String::push and found a trick in the implemetation for ASCII characters (relevant PR: #20079).
The actual problem:
If the character is 1 byte long in UTF-8, it is treated as an ASCII character (as in C), but if it's bigger than 1 byte, the algorythm always reserves 4 bytes in the vector even if the pushed character's length is only 2 or 3 bytes. So even if the vector has just enough space for one of these smaller characters, the vector reallocates its storage (and doubles the size of the buffer). As a result, the string now has an unnecessarily doubled capacity.
I know it doesn't seem like a big problem, but if someone uses push somehow with non-English text, it might become a significant memory overhead. The issue happens if the length of the string becomes a power of 2, so strings with smaller length are more affected.
A possible solution:
// This is a modified version of the originalpubfnpush(&mutself,ch:char){match ch.len_utf8(){1 => self.vec.push(ch asu8),
ch_len => {let cur_len = self.len();self.vec.reserve(ch_len);unsafe{// Attempt to not use an intermediate buffer by just pushing bytes// directly onto this string.let slice = slice::from_raw_parts_mut(self.vec.as_mut_ptr().offset(cur_len asisize),
ch_len
);let used = ch.encode_utf8(slice).unwrap_or(0);self.vec.set_len(cur_len + used);}}}}
I benchmarked this version, therer was no measurable difference.
There's one problem though: char::len_utf8 is not marked with #[inline], so without LTO, it's currently slower.
The text was updated successfully, but these errors were encountered:
Various methods in both libcore/char.rs and librustc_unicode/char.rs were previously marked with #[inline], now every method is marked in char's impl blocks.
Partially fixes#26124.
EDIT: I'm not familiar with pull reqests (yet), apparently Github added my second commit to thit PR...
Fixes#26124
There was a post recently on /r/rust about microbenchmarks in different programming languages (https://github.com/kostya/benchmarks), and I was trying to understand how the Rust implementation works.
I noticed that in the Base64 benchmark the author used this code to construct a srting with
str_size
copies of a character:I thought that push_str is a bit inefficient for a single character, so I looked at the source code. I started playing with String::push and found a trick in the implemetation for ASCII characters (relevant PR: #20079).
The actual problem:
If the character is 1 byte long in UTF-8, it is treated as an ASCII character (as in C), but if it's bigger than 1 byte, the algorythm always reserves 4 bytes in the vector even if the pushed character's length is only 2 or 3 bytes. So even if the vector has just enough space for one of these smaller characters, the vector reallocates its storage (and doubles the size of the buffer). As a result, the string now has an unnecessarily doubled capacity.
I know it doesn't seem like a big problem, but if someone uses push somehow with non-English text, it might become a significant memory overhead. The issue happens if the length of the string becomes a power of 2, so strings with smaller length are more affected.
A possible solution:
I benchmarked this version, therer was no measurable difference.
There's one problem though: char::len_utf8 is not marked with #[inline], so without LTO, it's currently slower.
The text was updated successfully, but these errors were encountered: