String::push reallocates too aggressively in corner cases #26124

pmarcelll · 2015-06-09T16:09:41Z

There was a post recently on /r/rust about microbenchmarks in different programming languages (https://github.com/kostya/benchmarks), and I was trying to understand how the Rust implementation works.
I noticed that in the Base64 benchmark the author used this code to construct a srting with str_size copies of a character:

let str_size = 10000000;
let mut str: String = "".to_string();
for _ in 0..str_size { str.push_str("a"); }

I thought that push_str is a bit inefficient for a single character, so I looked at the source code. I started playing with String::push and found a trick in the implemetation for ASCII characters (relevant PR: #20079).

The actual problem:
If the character is 1 byte long in UTF-8, it is treated as an ASCII character (as in C), but if it's bigger than 1 byte, the algorythm always reserves 4 bytes in the vector even if the pushed character's length is only 2 or 3 bytes. So even if the vector has just enough space for one of these smaller characters, the vector reallocates its storage (and doubles the size of the buffer). As a result, the string now has an unnecessarily doubled capacity.
I know it doesn't seem like a big problem, but if someone uses push somehow with non-English text, it might become a significant memory overhead. The issue happens if the length of the string becomes a power of 2, so strings with smaller length are more affected.

A possible solution:

// This is a modified version of the original
pub fn push(&mut self, ch: char) {
    match ch.len_utf8() {
        1 => self.vec.push(ch as u8),
        ch_len => {
            let cur_len = self.len();

            self.vec.reserve(ch_len);

            unsafe {
                // Attempt to not use an intermediate buffer by just pushing bytes
                // directly onto this string.
                let slice = slice::from_raw_parts_mut (
                    self.vec.as_mut_ptr().offset(cur_len as isize),
                    ch_len
                );
                let used = ch.encode_utf8(slice).unwrap_or(0);
                self.vec.set_len(cur_len + used);
            }
        }
    }
}

I benchmarked this version, therer was no measurable difference.
There's one problem though: char::len_utf8 is not marked with #[inline], so without LTO, it's currently slower.

The text was updated successfully, but these errors were encountered:

Gankra · 2015-06-09T16:11:18Z

A PR that marks it inline and does this change would be happily merged 😄

pmarcelll · 2015-06-10T01:15:54Z

I'm a bit unfamiliar with pull requests but it's done 😄

Various methods in both libcore/char.rs and librustc_unicode/char.rs were previously marked with #[inline], now every method is marked in char's impl blocks. Partially fixes #26124. EDIT: I'm not familiar with pull reqests (yet), apparently Github added my second commit to thit PR... Fixes #26124

steveklabnik added I-slow Issue: Problems and improvements with respect to performance of generated code. A-libs labels Jun 9, 2015

pmarcelll mentioned this issue Jun 9, 2015

Add missing #[inline] to methods related to char and fix related problem in String::push #26154

Merged

bors closed this as completed in #26154 Jun 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String::push reallocates too aggressively in corner cases #26124

String::push reallocates too aggressively in corner cases #26124

pmarcelll commented Jun 9, 2015

Gankra commented Jun 9, 2015

pmarcelll commented Jun 10, 2015

String::push reallocates too aggressively in corner cases #26124

String::push reallocates too aggressively in corner cases #26124

Comments

pmarcelll commented Jun 9, 2015

Gankra commented Jun 9, 2015

pmarcelll commented Jun 10, 2015