Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String::push reallocates too aggressively in corner cases #26124

Closed
pmarcelll opened this issue Jun 9, 2015 · 2 comments · Fixed by #26154
Closed

String::push reallocates too aggressively in corner cases #26124

pmarcelll opened this issue Jun 9, 2015 · 2 comments · Fixed by #26154
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@pmarcelll
Copy link
Contributor

There was a post recently on /r/rust about microbenchmarks in different programming languages (https://github.com/kostya/benchmarks), and I was trying to understand how the Rust implementation works.
I noticed that in the Base64 benchmark the author used this code to construct a srting with str_size copies of a character:

let str_size = 10000000;
let mut str: String = "".to_string();
for _ in 0..str_size { str.push_str("a"); }

I thought that push_str is a bit inefficient for a single character, so I looked at the source code. I started playing with String::push and found a trick in the implemetation for ASCII characters (relevant PR: #20079).

The actual problem:
If the character is 1 byte long in UTF-8, it is treated as an ASCII character (as in C), but if it's bigger than 1 byte, the algorythm always reserves 4 bytes in the vector even if the pushed character's length is only 2 or 3 bytes. So even if the vector has just enough space for one of these smaller characters, the vector reallocates its storage (and doubles the size of the buffer). As a result, the string now has an unnecessarily doubled capacity.
I know it doesn't seem like a big problem, but if someone uses push somehow with non-English text, it might become a significant memory overhead. The issue happens if the length of the string becomes a power of 2, so strings with smaller length are more affected.

A possible solution:

// This is a modified version of the original
pub fn push(&mut self, ch: char) {
    match ch.len_utf8() {
        1 => self.vec.push(ch as u8),
        ch_len => {
            let cur_len = self.len();

            self.vec.reserve(ch_len);

            unsafe {
                // Attempt to not use an intermediate buffer by just pushing bytes
                // directly onto this string.
                let slice = slice::from_raw_parts_mut (
                    self.vec.as_mut_ptr().offset(cur_len as isize),
                    ch_len
                );
                let used = ch.encode_utf8(slice).unwrap_or(0);
                self.vec.set_len(cur_len + used);
            }
        }
    }
}

I benchmarked this version, therer was no measurable difference.
There's one problem though: char::len_utf8 is not marked with #[inline], so without LTO, it's currently slower.

@Gankra
Copy link
Contributor

Gankra commented Jun 9, 2015

A PR that marks it inline and does this change would be happily merged 😄

@steveklabnik steveklabnik added I-slow Issue: Problems and improvements with respect to performance of generated code. A-libs labels Jun 9, 2015
@pmarcelll
Copy link
Contributor Author

I'm a bit unfamiliar with pull requests but it's done 😄

bors added a commit that referenced this issue Jun 11, 2015
Various methods in both libcore/char.rs and librustc_unicode/char.rs were previously marked with #[inline], now every method is marked in char's impl blocks.
Partially fixes #26124.
EDIT: I'm not familiar with pull reqests (yet), apparently Github added my second commit to thit PR...
Fixes #26124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
3 participants