Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize is_ascii for str and [u8]. #74066

Merged
merged 6 commits into from
Jul 12, 2020
Merged

Conversation

thomcc
Copy link
Member

@thomcc thomcc commented Jul 5, 2020

This optimizes the is_ascii function for [u8] and str. I've been surprised this wasn't done for a while, so I just did it.

Benchmarks comparing before/after look like:

test ascii::long_readonly::is_ascii_slice_iter_all              ... bench:         174 ns/iter (+/- 79) = 40172 MB/s
test ascii::long_readonly::is_ascii_slice_libcore               ... bench:          16 ns/iter (+/- 5) = 436875 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          12 ns/iter (+/- 3) = 2666 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           2 ns/iter (+/- 0) = 16000 MB/s
test ascii::short_readonly::is_ascii_slice_iter_all             ... bench:           3 ns/iter (+/- 0) = 2333 MB/s
test ascii::short_readonly::is_ascii_slice_libcore              ... bench:           4 ns/iter (+/- 0) = 1750 MB/s

(Taken on a x86_64 macbook 2.9 GHz Intel Core i9 with 6 cores)

Where is_ascii_slice_iter_all is the old version, and is_ascii_slice_libcore is the new.

I tried to document the code well, so hopefully it's understandable. It has fairly exhaustive tests ensuring size/align doesn't get violated -- because miri doesn't really help a lot for this sort of code right now, I tried to debug_assert all the safety invariants I'm depending on. (Of course, none of them are required for correctness or soundness -- just allows us to test that this sort of pointer manipulation is sound and such).

Anyway, thanks. Let me know if you have questions/desired changes.

@rust-highfive
Copy link
Collaborator

r? @shepmaster

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jul 5, 2020
@jonas-schievink
Copy link
Contributor

is_ascii_slice_libcore ... bench: 16 ns/iter (+/- 5) = 436875 MB/s

That's over 400 GB/s, are you sure that's measuring the correct thing?

@thomcc
Copy link
Member Author

thomcc commented Jul 5, 2020

Good point!

@ecstatic-morse
Copy link
Contributor

You should have a look at slice::align_to.

@thomcc
Copy link
Member Author

thomcc commented Jul 5, 2020

@ecstatic-morse: The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to.

@jonas-schievink: Not sure I know how to debug this really -- all the timings from that module seem dodgy. Well over 30GB/s for most of them, and those also contain to_vec() of a ~6000 byte slice.

test ascii::long::is_ascii_alphabetic                           ... bench:         186 ns/iter (+/- 17) = 37580 MB/s
test ascii::long::is_ascii_alphanumeric                         ... bench:         208 ns/iter (+/- 62) = 33605 MB/s
test ascii::long::is_ascii_control                              ... bench:         238 ns/iter (+/- 89) = 29369 MB/s
test ascii::long::is_ascii_digit                                ... bench:         301 ns/iter (+/- 199) = 23222 MB/s
test ascii::long::is_ascii_graphic                              ... bench:         252 ns/iter (+/- 242) = 27738 MB/s
test ascii::long::is_ascii_hexdigit                             ... bench:         213 ns/iter (+/- 87) = 32816 MB/s
test ascii::long::is_ascii_lowercase                            ... bench:         239 ns/iter (+/- 114) = 29246 MB/s
test ascii::long::is_ascii_punctuation                          ... bench:         197 ns/iter (+/- 103) = 35482 MB/s
test ascii::long::is_ascii_uppercase                            ... bench:         188 ns/iter (+/- 34) = 37180 MB/s
test ascii::long::is_ascii_whitespace                           ... bench:         207 ns/iter (+/- 195) = 33768 MB/s

@ecstatic-morse
Copy link
Contributor

Simple code is good code, and people running ARM may also want to check whether their strings are ASCII-only. If your more complex version is meaningfully more efficient, then fine, but I suspect it's not.

@thomcc
Copy link
Member Author

thomcc commented Jul 5, 2020

test ascii::long_readonly::is_ascii_slice_align_to              ... bench:          23 ns/iter (+/- 5) = 303913 MB/s
test ascii::long_readonly::is_ascii_slice_iter_all              ... bench:         161 ns/iter (+/- 24) = 43416 MB/s
test ascii::long_readonly::is_ascii_slice_libcore               ... bench:          16 ns/iter (+/- 4) = 436875 MB/s
test ascii::medium_readonly::is_ascii_slice_align_to            ... bench:           6 ns/iter (+/- 1) = 5333 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          14 ns/iter (+/- 6) = 2285 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           3 ns/iter (+/- 3) = 10666 MB/s
test ascii::short_readonly::is_ascii_slice_align_to             ... bench:           8 ns/iter (+/- 2) = 875 MB/s
test ascii::short_readonly::is_ascii_slice_iter_all             ... bench:           3 ns/iter (+/- 3) = 2333 MB/s
test ascii::short_readonly::is_ascii_slice_libcore              ... bench:           4 ns/iter (+/- 5) = 1750 MB/s

for

    fn is_ascii_slice_align_to(bytes: &[u8]) {
        fn contains_nonascii(v: usize) -> bool {
            const NONASCII_MASK: usize = 0x80808080_80808080u64 as usize;
            (NONASCII_MASK & v) != 0
        }
        let (head, body, tail) = unsafe { bytes.align_to::<usize>() };
        head.iter().all(|b| b.is_ascii()) &&
        body.iter().all(|w| !contains_nonascii(*w)) &&
        tail.iter().all(|b| b.is_ascii())
    }

Of course with the standard caveat of "I spent far longer on my version than this version".

Also: On modern ARM (armv7+) read_unaligned is only slow if the address is actually unaligned, and even then it's really not that bad in my experience. Like, I don't have numbers for this but it seems around the same as just reading the bytes.

There are platforms where this is worse though, but I'd expect those to compile read_unaligned down to load each byte. Which is what the align_to version has to do (I'll admit, I find align_to a little annoying for this reason).

Anyway, I actually noticed is_ascii was bad because I was optimizing firefox for android's (the "preview") autocomplete search implementation, so I do care very much about ARM performance for this code.

@ecstatic-morse
Copy link
Contributor

I would have added a fast path for when slice.len() < mem::size_of::<usize>() to the align_to version since your version has a similar check that makes it fast on the small benchmark. I will note that the margin of error for both versions overlaps, but I would call your version meaningfully more efficient on x86 at least. Thanks for doing the work.

@shepmaster
Copy link
Member

Beyond my knowledge, let's try...

r? @sfackler

/cc @BurntSushi as I know your interest in these matters.

@rust-highfive rust-highfive assigned sfackler and unassigned shepmaster Jul 5, 2020
Copy link
Member

@nagisa nagisa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider slice::align_to instead of align_offset? It also takes care of various other asserts you have in the rest of this function as well as removes the necessity for the "thoroughly" test case and significantly reduce the amount of unsafe in the code...

EDIT: wrote this before I saw the other comments...

The algorithm I'm using is different (and more efficient on most machines) than what I'd get with slice::align_to.

That likely means slice::align_to can be improved instead. I’ll experiment with it.

@thomcc
Copy link
Member Author

thomcc commented Jul 5, 2020

Consider slice::align_to instead of align_offset? It also takes care of various other asserts you have in the rest of this function as well as removes the necessity for the "thoroughly" test case and significantly reduce the amount of unsafe in the code...

This is discussed in other comments above, it's measurably less performant (on my machine, anyway).

I suspect in practice it will do a little worse than it does on the benchmark I linked in #74066 (comment) (at least for the case that caused me to file this PR -- mostly medium-length strings). That is, the benchmarks don't really hit the cases where align_to is bad -- And I specifically wrote my function the way I did to avoid the cases where align_to is bad (concretely: to avoid a per-byte loop on the unaligned case).

To demonstrate this, if I modify the code* so that first and last bytes of the test string are skipped (to force unaligned on both ends), you see the following for medium (small is smaller than a usize, and long is too large for something like this to matter):

test ascii::medium_readonly::is_ascii_slice_align_to            ... bench:          12 ns/iter (+/- 4) = 2500 MB/s
test ascii::medium_readonly::is_ascii_slice_iter_all            ... bench:          10 ns/iter (+/- 2) = 3000 MB/s
test ascii::medium_readonly::is_ascii_slice_libcore             ... bench:           2 ns/iter (+/- 1) = 15000 MB/s

Note that previously medium would have been aligned to usize (came right out of an allocator), and was 32 bytes -- e.g. it would totally avoid the problem cases for align_to.

In practice a case like bit me before (it was with SIMD vectors, which exacerbates the issue, since they have longer tail/head sequences, but...). Which is... admittedly why hold a little bit of a grudge against align_to >_>

That said, I do feel like I'm getting a signal of "This is not worth the complexity", and I'd rather have an is_ascii that uses align_to than the current one, and I'm totally willing to change it if desired... I also have a few ideas on how to make it perform closer to my implementation, although IDK if there's any way to avoid bad performance in it's problem cases.

That said, IMO the stdlib is a better-than-average place to make a complexity/performance trade-off like this, so I'm going to probably wait for the reviewer to tell me to switch to align_to if desired.


* Uh, I'll have the is_ascii_slice_align_to bench up in a few minutes after sending this comment, but it's basically what's #74066 (comment) + the suggestion from #74066 (comment).

I probably won't add the hacky code I used to force misalignment (replacing vec[..] with vec[1..vec.len()-1]...) and generate the benchmark above, but I'll try to add a non-hacky benchmark that measures the misaligned case soon. I've added this in not-entirely-hacky way.

EDIT: (My comment was posted before I saw your response too)

That likely means slice::align_to can be improved instead. I’ll experiment with it.

@nagisa IMO it's kind of a problem with the API rather than something like a missed optimization (although, to be clear, align_to compiles weird -- LLVM aggressively often unrolls loops on the head/tail, e.g: https://godbolt.org/z/N7t96u).

In particular a big issue for me is that coming from &[u8] is common, but you often really don't want to be doing operations on each u8 in the lead/trail. In practice it's a little annoying since it feels like 80% of what I want, but the remaining 20%... If I'm reaching for it at all.

That said, I can't deny it's much more readable and maintainable than what I came up with.

@nagisa
Copy link
Member

nagisa commented Jul 5, 2020

I extracted the implementations into a local crate and wrote a criterion-based benchmark… and tested the three implementations on both x86_64 and an aarch64. From what I can see, the proposed implementation is (noticeably) better than the align_to implementation at input sizes ~8 bytes to ~1kB, after which point the align_to begins overtaking the proposed implementation.

This makes sense to me given that align_to computes the slices ahead of time and the align_offset version checks its ability to stay in the fast path on every iteration.

So ultimately its probably a question of what kind of input we want to optimize for (and I think the medium length case is the most interesting one, favouring the currently proposed implementation).

That said, I also see extreme (>100%) run-to-run variance in some of the benchmark results, even though I did make an effort to avoid most of the well known pitfalls in benchmarking, so whatever data my benchmark is generating is gibberish anyway… This also makes it super difficult to just experiment with the minor changes to the code...

@thomcc
Copy link
Member Author

thomcc commented Jul 5, 2020

One idea I had was something like:

fn is_ascii_align_to_unrolled(bytes: &[u8]) -> bool {
    // Not really clear if this should be testing size_of [usize; 2] or size_of usize still...
    if bytes.len() < core::mem::size_of::<[usize; 2]>() {
        return bytes.iter().all(|b| b.is_ascii());
    }
    // SAFETY: transmuting a sequence of `u8` to `[usize; 2]` is always fine
    let (head, body, tail) = unsafe { bytes.align_to::<[usize; 2]>() };
    head.iter().all(|b| b.is_ascii())
        && body.iter().all(|w| !contains_nonascii(w[0] | w[1]))
        && tail.iter().all(|b| b.is_ascii())
}

And this actually is as fast or almost as fast as the main impl in the PR... So long as your string's length is a multiple of size_of::<usize>() (Actually, I guess it would be size_of::<[usize; 2]>(), since align_to's tail is based on the size of the type, not it's alignment...).

Which seems like a pretty common case.

Anyway my most recent push adds a bench for that, separates out benches for different alignment issues that align_to can have problem with (head vs tail vs both), and moves the benchmarks to a different file, which cleaned a lot of things up. I probably should have been a bit better about making it clear which cases I was trying to optimize (or avoid pessimizing) in my initial PR)

A run of the benchmark results is here.

This makes sense to me given that align_to computes the slices ahead of time and the align_offset version checks its ability to stay in the fast path on every iteration.

Hm. I'm going to think about this since I feel like there's no reason we shouldn't be doing the same. I'm not fully sure I follow though...

That said really there are a few things I probably should do to improve the implementation of is_ascii in the PR (it probably should unroll the inner loop once), but also it might be complex enough as-is given that it's complexity already caused minor controversy.

@nagisa
Copy link
Member

nagisa commented Jul 11, 2020

I’m going to approve this. I think optimizing for performance with smaller strings here makes sense and the algorithm as implemented here will do better with said smaller strings specifically because it just does less work.

I also don't see a reason to block this on benchmarks that are super fuzzy anyway...

@bors r+

@bors
Copy link
Contributor

bors commented Jul 11, 2020

📌 Commit a150dcc has been approved by nagisa

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 11, 2020
bors added a commit to rust-lang-ci/rust that referenced this pull request Jul 12, 2020
…arth

Rollup of 10 pull requests

Successful merges:

 - rust-lang#72920 (Stabilize `transmute` in constants and statics but not const fn)
 - rust-lang#73715 (debuginfo:  Mangle tuples to be natvis friendly, typedef basic types)
 - rust-lang#74066 (Optimize is_ascii for str and [u8].)
 - rust-lang#74116 (Fix cross compilation of LLVM to aarch64 Windows targets)
 - rust-lang#74167 (linker: illumos ld does not support --eh-frame-hdr)
 - rust-lang#74168 (Add a help to use `in_band_lifetimes` in nightly)
 - rust-lang#74197 (Reword incorrect `self` token suggestion)
 - rust-lang#74213 (Minor refactor for rustc_resolve diagnostics match)
 - rust-lang#74240 (Fix rust-lang#74081 and add the test case from rust-lang#74236)
 - rust-lang#74241 (update miri)

Failed merges:

r? @ghost
@bors bors merged commit 1979fa8 into rust-lang:master Jul 12, 2020
Comment on lines +2932 to +2934
if byte_pos == len {
return true;
}
Copy link
Contributor

@pickfire pickfire Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? What happens if we always check last_word? As in making it branchless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.

Copy link
Contributor

@pickfire pickfire Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an optimization to avoid an extra (redundant) load. It's not necessary for correctness.

Avoid extra load? But what happens if you remove it? Along with the debug assert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we perform an extra load of the last word, which we've already checked. What's your point?

Copy link
Contributor

@pickfire pickfire Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I mean, perform the extra load of the last word to remove the branch. Or we could rework the logic a bit and remove this branch, I wonder would it be faster.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that doesn't seem worth it to me at all. If you can show it's faster in benchmarks (which there are many), I suggest you submit a PR.

Keep in mind on some platforms a read_unaligned is comparatively expensive, so it's worth avoiding if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a minimal testing crate to test this easily without having to build rust?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry.

bors added a commit to rust-lang-ci/rust that referenced this pull request Aug 16, 2020
Remove branch in optimized is_ascii

Performs slightly better in short or medium bytes by eliminating
the last branch check on `byte_pos == len` and always check the
last byte as it is always at most one `usize`.

Benchmark, before `libcore`, after `libcore_new`. It improves
medium and short by 1ns but regresses unaligned_tail by 2ns,
either way we can get unaligned_tail have a tiny chance of 1/8
on a 64 bit machine. I don't think we should bet on that, the
probability is worse than dice.

```
test long::case00_libcore                     ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case00_libcore_new                 ... bench:          38 ns/iter (+/- 1) = 183947 MB/s
test long::case01_iter_all                    ... bench:         227 ns/iter (+/- 6) = 30792 MB/s
test long::case02_align_to                    ... bench:          40 ns/iter (+/- 1) = 174750 MB/s
test long::case03_align_to_unrolled           ... bench:          19 ns/iter (+/- 1) = 367894 MB/s
test medium::case00_libcore                   ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test medium::case00_libcore_new               ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test medium::case01_iter_all                  ... bench:          20 ns/iter (+/- 1) = 1600 MB/s
test medium::case02_align_to                  ... bench:           6 ns/iter (+/- 0) = 5333 MB/s
test medium::case03_align_to_unrolled         ... bench:           5 ns/iter (+/- 0) = 6400 MB/s
test short::case00_libcore                    ... bench:           7 ns/iter (+/- 0) = 1000 MB/s
test short::case00_libcore_new                ... bench:           6 ns/iter (+/- 0) = 1166 MB/s
test short::case01_iter_all                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case02_align_to                   ... bench:           5 ns/iter (+/- 0) = 1400 MB/s
test short::case03_align_to_unrolled          ... bench:           5 ns/iter (+/- 1) = 1400 MB/s
test unaligned_both::case00_libcore           ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case00_libcore_new       ... bench:           4 ns/iter (+/- 0) = 7500 MB/s
test unaligned_both::case01_iter_all          ... bench:          26 ns/iter (+/- 0) = 1153 MB/s
test unaligned_both::case02_align_to          ... bench:          13 ns/iter (+/- 2) = 2307 MB/s
test unaligned_both::case03_align_to_unrolled ... bench:          11 ns/iter (+/- 0) = 2727 MB/s
test unaligned_head::case00_libcore           ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_head::case01_iter_all          ... bench:          19 ns/iter (+/- 1) = 1631 MB/s
test unaligned_head::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_head::case03_align_to_unrolled ... bench:          14 ns/iter (+/- 0) = 2214 MB/s
test unaligned_tail::case00_libcore           ... bench:           3 ns/iter (+/- 0) = 10333 MB/s
test unaligned_tail::case00_libcore_new       ... bench:           5 ns/iter (+/- 0) = 6200 MB/s
test unaligned_tail::case01_iter_all          ... bench:          19 ns/iter (+/- 0) = 1631 MB/s
test unaligned_tail::case02_align_to          ... bench:          10 ns/iter (+/- 0) = 3100 MB/s
test unaligned_tail::case03_align_to_unrolled ... bench:          13 ns/iter (+/- 0) = 2384 MB/s
```

Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns

Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎)
and branchless implementation of binary_search in rust.

cc @thomcc rust-lang#74066
r? @nagisa
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 6, 2022
Optimize `core::str::Chars::count`

I wrote this a while ago after seeing this function as a bottleneck in a profile, but never got around to contributing it. I saw it again, and so here it is. The implementation is fairly complex, but I tried to explain what's happening at both a high level (in the header comment for the file), and in line comments in the impl. Hopefully it's clear enough.

This implementation (`case00_cur_libcore` in the benchmarks below) is somewhat consistently around 4x to 5x faster than the old implementation (`case01_old_libcore` in the benchmarks below), for a wide variety of workloads, without regressing performance on any of the workload sizes I've tried.

I also improved the benchmarks for this code, so that they explicitly check text in different languages and of different sizes (err, the cross product of language x size). The results of the benchmarks are here:

<details>
<summary>Benchmark results</summary>
<pre>
test str::char_count::emoji_huge::case00_cur_libcore       ... bench:      20,216 ns/iter (+/- 3,673) = 17931 MB/s
test str::char_count::emoji_huge::case01_old_libcore       ... bench:     108,851 ns/iter (+/- 12,777) = 3330 MB/s
test str::char_count::emoji_huge::case02_iter_increment    ... bench:     329,502 ns/iter (+/- 4,163) = 1100 MB/s
test str::char_count::emoji_huge::case03_manual_char_len   ... bench:     223,333 ns/iter (+/- 14,167) = 1623 MB/s
test str::char_count::emoji_large::case00_cur_libcore      ... bench:         293 ns/iter (+/- 6) = 19331 MB/s
test str::char_count::emoji_large::case01_old_libcore      ... bench:       1,681 ns/iter (+/- 28) = 3369 MB/s
test str::char_count::emoji_large::case02_iter_increment   ... bench:       5,166 ns/iter (+/- 85) = 1096 MB/s
test str::char_count::emoji_large::case03_manual_char_len  ... bench:       3,476 ns/iter (+/- 62) = 1629 MB/s
test str::char_count::emoji_medium::case00_cur_libcore     ... bench:          48 ns/iter (+/- 0) = 14750 MB/s
test str::char_count::emoji_medium::case01_old_libcore     ... bench:         217 ns/iter (+/- 4) = 3262 MB/s
test str::char_count::emoji_medium::case02_iter_increment  ... bench:         642 ns/iter (+/- 7) = 1102 MB/s
test str::char_count::emoji_medium::case03_manual_char_len ... bench:         445 ns/iter (+/- 3) = 1591 MB/s
test str::char_count::emoji_small::case00_cur_libcore      ... bench:          18 ns/iter (+/- 0) = 3777 MB/s
test str::char_count::emoji_small::case01_old_libcore      ... bench:          23 ns/iter (+/- 0) = 2956 MB/s
test str::char_count::emoji_small::case02_iter_increment   ... bench:          66 ns/iter (+/- 2) = 1030 MB/s
test str::char_count::emoji_small::case03_manual_char_len  ... bench:          29 ns/iter (+/- 1) = 2344 MB/s
test str::char_count::en_huge::case00_cur_libcore          ... bench:      25,909 ns/iter (+/- 39,260) = 13299 MB/s
test str::char_count::en_huge::case01_old_libcore          ... bench:     102,887 ns/iter (+/- 3,257) = 3349 MB/s
test str::char_count::en_huge::case02_iter_increment       ... bench:     166,370 ns/iter (+/- 12,439) = 2071 MB/s
test str::char_count::en_huge::case03_manual_char_len      ... bench:     166,332 ns/iter (+/- 4,262) = 2071 MB/s
test str::char_count::en_large::case00_cur_libcore         ... bench:         281 ns/iter (+/- 6) = 19160 MB/s
test str::char_count::en_large::case01_old_libcore         ... bench:       1,598 ns/iter (+/- 19) = 3369 MB/s
test str::char_count::en_large::case02_iter_increment      ... bench:       2,598 ns/iter (+/- 167) = 2072 MB/s
test str::char_count::en_large::case03_manual_char_len     ... bench:       2,578 ns/iter (+/- 55) = 2088 MB/s
test str::char_count::en_medium::case00_cur_libcore        ... bench:          44 ns/iter (+/- 1) = 15295 MB/s
test str::char_count::en_medium::case01_old_libcore        ... bench:         201 ns/iter (+/- 51) = 3348 MB/s
test str::char_count::en_medium::case02_iter_increment     ... bench:         322 ns/iter (+/- 40) = 2090 MB/s
test str::char_count::en_medium::case03_manual_char_len    ... bench:         319 ns/iter (+/- 5) = 2109 MB/s
test str::char_count::en_small::case00_cur_libcore         ... bench:          15 ns/iter (+/- 0) = 2333 MB/s
test str::char_count::en_small::case01_old_libcore         ... bench:          14 ns/iter (+/- 0) = 2500 MB/s
test str::char_count::en_small::case02_iter_increment      ... bench:          30 ns/iter (+/- 1) = 1166 MB/s
test str::char_count::en_small::case03_manual_char_len     ... bench:          30 ns/iter (+/- 1) = 1166 MB/s
test str::char_count::ru_huge::case00_cur_libcore          ... bench:      16,439 ns/iter (+/- 3,105) = 19777 MB/s
test str::char_count::ru_huge::case01_old_libcore          ... bench:      89,480 ns/iter (+/- 2,555) = 3633 MB/s
test str::char_count::ru_huge::case02_iter_increment       ... bench:     217,703 ns/iter (+/- 22,185) = 1493 MB/s
test str::char_count::ru_huge::case03_manual_char_len      ... bench:     157,330 ns/iter (+/- 19,188) = 2066 MB/s
test str::char_count::ru_large::case00_cur_libcore         ... bench:         243 ns/iter (+/- 6) = 20905 MB/s
test str::char_count::ru_large::case01_old_libcore         ... bench:       1,384 ns/iter (+/- 51) = 3670 MB/s
test str::char_count::ru_large::case02_iter_increment      ... bench:       3,381 ns/iter (+/- 543) = 1502 MB/s
test str::char_count::ru_large::case03_manual_char_len     ... bench:       2,423 ns/iter (+/- 429) = 2096 MB/s
test str::char_count::ru_medium::case00_cur_libcore        ... bench:          42 ns/iter (+/- 1) = 15119 MB/s
test str::char_count::ru_medium::case01_old_libcore        ... bench:         180 ns/iter (+/- 4) = 3527 MB/s
test str::char_count::ru_medium::case02_iter_increment     ... bench:         402 ns/iter (+/- 45) = 1579 MB/s
test str::char_count::ru_medium::case03_manual_char_len    ... bench:         280 ns/iter (+/- 29) = 2267 MB/s
test str::char_count::ru_small::case00_cur_libcore         ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::ru_small::case01_old_libcore         ... bench:          12 ns/iter (+/- 0) = 2666 MB/s
test str::char_count::ru_small::case02_iter_increment      ... bench:          19 ns/iter (+/- 0) = 1684 MB/s
test str::char_count::ru_small::case03_manual_char_len     ... bench:          14 ns/iter (+/- 1) = 2285 MB/s
test str::char_count::zh_huge::case00_cur_libcore          ... bench:      15,053 ns/iter (+/- 2,640) = 20067 MB/s
test str::char_count::zh_huge::case01_old_libcore          ... bench:      82,622 ns/iter (+/- 3,602) = 3656 MB/s
test str::char_count::zh_huge::case02_iter_increment       ... bench:     230,456 ns/iter (+/- 7,246) = 1310 MB/s
test str::char_count::zh_huge::case03_manual_char_len      ... bench:     220,595 ns/iter (+/- 11,624) = 1369 MB/s
test str::char_count::zh_large::case00_cur_libcore         ... bench:         227 ns/iter (+/- 65) = 20792 MB/s
test str::char_count::zh_large::case01_old_libcore         ... bench:       1,136 ns/iter (+/- 144) = 4154 MB/s
test str::char_count::zh_large::case02_iter_increment      ... bench:       3,147 ns/iter (+/- 253) = 1499 MB/s
test str::char_count::zh_large::case03_manual_char_len     ... bench:       2,993 ns/iter (+/- 400) = 1577 MB/s
test str::char_count::zh_medium::case00_cur_libcore        ... bench:          36 ns/iter (+/- 5) = 16388 MB/s
test str::char_count::zh_medium::case01_old_libcore        ... bench:         142 ns/iter (+/- 18) = 4154 MB/s
test str::char_count::zh_medium::case02_iter_increment     ... bench:         379 ns/iter (+/- 37) = 1556 MB/s
test str::char_count::zh_medium::case03_manual_char_len    ... bench:         364 ns/iter (+/- 51) = 1620 MB/s
test str::char_count::zh_small::case00_cur_libcore         ... bench:          11 ns/iter (+/- 1) = 3000 MB/s
test str::char_count::zh_small::case01_old_libcore         ... bench:          11 ns/iter (+/- 1) = 3000 MB/s
test str::char_count::zh_small::case02_iter_increment      ... bench:          20 ns/iter (+/- 3) = 1650 MB/s
</pre>
</details>

I also added fairly thorough tests for different sizes and alignments. This completes on my machine in 0.02s, which is surprising given how thorough they are, but it seems to detect bugs in the implementation. (I haven't run the tests on a 32 bit machine yet since before I reworked the code a little though, so... hopefully I'm not about to embarrass myself).

This uses similar SWAR-style techniques to the `is_ascii` impl I contributed in rust-lang#74066, so I'm going to request review from the same person who reviewed that one. That said am not particularly picky, and might not have the correct syntax for requesting a review from someone (so it goes).

r? `@nagisa`
@cuviper cuviper added this to the 1.46 milestone May 2, 2024
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 20, 2024
Optimize `is_ascii` for `str` and `[u8]` further

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as `pmovmskb` can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current `usize` at a time check.

The resulting implementation is completely safe.

`case00_libcore` is the current implementation, `case04_while_loop` is this PR.
```
benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

```

`unaligned_head_medium` is the main regression in the benchmarks. It is a 32 byte string being sliced `bytes[1..]`.

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in rust-lang#74066

---

Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36
bors added a commit to rust-lang-ci/rust that referenced this pull request Dec 22, 2024
Optimize `is_ascii` for `str` and `[u8]` further

Replace the existing optimized function with one that enables auto-vectorization.

This is especially beneficial on x86-64 as `pmovmskb` can be emitted with careful structuring of the code. The instruction can detect non-ASCII characters one vector register width at a time instead of the current `usize` at a time check.

The resulting implementation is completely safe.

`case00_libcore` is the current implementation, `case04_while_loop` is this PR.
```
benchmarks:
    ascii::is_ascii_slice::long::case00_libcore                             22.25/iter  +/- 1.09
    ascii::is_ascii_slice::long::case04_while_loop                           6.78/iter  +/- 0.92
    ascii::is_ascii_slice::medium::case00_libcore                            2.81/iter  +/- 0.39
    ascii::is_ascii_slice::medium::case04_while_loop                         1.56/iter  +/- 0.78
    ascii::is_ascii_slice::short::case00_libcore                             5.55/iter  +/- 0.85
    ascii::is_ascii_slice::short::case04_while_loop                          3.75/iter  +/- 0.22
    ascii::is_ascii_slice::unaligned_both_long::case00_libcore              26.59/iter  +/- 0.66
    ascii::is_ascii_slice::unaligned_both_long::case04_while_loop            5.78/iter  +/- 0.16
    ascii::is_ascii_slice::unaligned_both_medium::case00_libcore             2.97/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_both_medium::case04_while_loop          2.41/iter  +/- 0.10
    ascii::is_ascii_slice::unaligned_head_long::case00_libcore              23.71/iter  +/- 0.79
    ascii::is_ascii_slice::unaligned_head_long::case04_while_loop            7.83/iter  +/- 1.31
    ascii::is_ascii_slice::unaligned_head_medium::case00_libcore             3.69/iter  +/- 0.54
    ascii::is_ascii_slice::unaligned_head_medium::case04_while_loop          7.05/iter  +/- 0.32
    ascii::is_ascii_slice::unaligned_tail_long::case00_libcore              24.44/iter  +/- 1.41
    ascii::is_ascii_slice::unaligned_tail_long::case04_while_loop            5.12/iter  +/- 0.18
    ascii::is_ascii_slice::unaligned_tail_medium::case00_libcore             3.24/iter  +/- 0.40
    ascii::is_ascii_slice::unaligned_tail_medium::case04_while_loop          2.86/iter  +/- 0.14

```

`unaligned_head_medium` is the main regression in the benchmarks. It is a 32 byte string being sliced `bytes[1..]`.

The first commit can be used to run the benchmarks against the current core implementation.

Previous implementation was done in rust-lang#74066

---

Two potential drawbacks of this implementation are that it increases instruction count and may regress other platforms/architectures. The benches here may also be too artificial to glean much insight from.
https://rust.godbolt.org/z/G9znGfY36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants