-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fast path for ASCII in UTF-8 validation #30740
Conversation
r? @brson (rust_highfive has picked a reviewer for you, use r? to override) |
Benchmarks using long texts are here: https://gist.github.com/bluss/bf45e07e711238e22b7a 2-3% slowdown on japanese and cyrillic texts that are mostly non-ascii. I don't have a problem championing that regression, given the speedup on utf-8 validation for predominantly ASCII input. The example texts are pretty arbitrary, the wikipedia texts a /little/ less so. |
@@ -468,6 +468,18 @@ fn test_is_utf8() { | |||
assert!(from_utf8(&[0xEF, 0xBF, 0xBF]).is_ok()); | |||
assert!(from_utf8(&[0xF0, 0x90, 0x80, 0x80]).is_ok()); | |||
assert!(from_utf8(&[0xF4, 0x8F, 0xBF, 0xBF]).is_ok()); | |||
|
|||
// deny embedded in long stretches of ascii |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what this specific set of tests is doing.
I always have a bit of a sad when there are these giant "test everything" tests; my personal pref would be another test like is_utf8_is_not_tricked_by_non_ascii_in_long_stretches_of_ascii
. No need to add test_
, no need to have a comment, a failed test tells you what failed. 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entirely reasonable, no reason to share test name there, no common setup or anything. Fixed to have its own test function.
7ddcac2
to
6fd108d
Compare
let ptr = v.as_ptr(); | ||
|
||
let mut offset = 0; | ||
if len >= 2 * usize::BYTES { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loop is unrolled by 2 (reads 2 usize per lap).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I wasn't very clear. I guess it's a two-part question:
- Why unroll at all?
- Why only unroll by 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit arbitrary, I've only tried 1, 2, and 4 and compared performance, and it's a trade off. In the memchr code, where this is taken from it's to fill a 16-byte register on x86-64, but that doesn't happen here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you extract the 2 to a const with a descriptive name about unrolling? Since I don't see any hand-unrolling here, I am guessing that the if
statement allows the compiler to the unrolling according to the unrolling factor. This is not obvious to me. Can you also add a comment explaining?
Edit: Oh, are the duplicated contains_nonascii
calls the loop unrolling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and the two ptr.offset
and deref per iteration.
Pedantically, I'd say it should be ASCII (all caps) when in comments or prose as it's an acronym. Also |
} | ||
} | ||
|
||
// find the byte after the point the loop stopped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the result of (x & 0x80808080_80808080) is non-zero, you can "immediately" find which byte it is using leading_zeros() / 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
depends on endianness, it works fine with .trailing_zeros()
on x86-64. It deserved to be tried for sure, but I couldn't make it be an improvement.
What llvm compiles the current code into, the beast it is, is actually if contains_nonascii(u | v) { break; }
which seems to make for a much simpler computation inside the loop, and a tight loop.
I'm not 100% happy with the code in find_nonascii, so any suggestion for improvement would be super welcome, feel free to take the code (from the benchmark link) and find something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I downloaded the gist, but I am having some trouble in getting the datasets you used. Specifically, I assumed that enwik8 should be http://mattmahoney.net/dc/enwik8.zip and that the specific version of the Japanese wiki should not matter much, but I have no idea about big10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, maybe you can just skip those datasets you don't have though? I could have provided everything better.
big10 is the dataset in http://vaskir.blogspot.ru/2015/09/regular-expressions-rust-vs-f.html
so it's the first 10MB of the unzipped file from https://drive.google.com/open?id=0B8HLQUKik9VtUWlOaHJPdG0xbnM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jawik10 is the first 10MB from the unzip of http://dumps.wikimedia.org/archive/2006/2006-07/jawiki/20061016/jawiki-20061016-pages-articles.xml.bz2
6fd108d
to
7037ef7
Compare
Sweet wins. r=me but please do extract |
Ok, I'll look over if there's a neat way to write the unrolling factor |
This speeds up the ascii case (and long stretches of ascii in otherwise mixed UTF-8 data) when checking UTF-8 validity. Benchmark results suggest that on purely ASCII input, we can improve throughput (megabytes verified / second) by a factor of 13 to 14! On xml and mostly english language input (en.wikipedia xml dump), throughput increases by a factor 7. On mostly non-ASCII input, performance increases slightly or is the same. The UTF-8 validation is rewritten to use indexed access; since all access is preceded by a (mandatory for validation) length check, they are statically elided by llvm and this formulation is in fact the best for performance. A previous version had losses due to slice to iterator conversions. A large credit to Björn Steinbrink who improved this patch immensely, writing this second version. Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3. Old code is `regular`, this PR is called `fast`. Datasets: - `ascii` is just ascii (2.5 kB) - `cyr` is cyrillic script with ascii spaces (5 kB) - `dewik10` is 10MB of a de.wikipedia xml dump - `enwik10` is 100MB of an en.wikipedia xml dump - `jawik10` is 10MB of a ja.wikipedia xml dump ``` test from_utf8_ascii_fast ... bench: 140 ns/iter (+/- 4) = 18221 MB/s test from_utf8_ascii_regular ... bench: 1,932 ns/iter (+/- 19) = 1320 MB/s test from_utf8_cyr_fast ... bench: 10,025 ns/iter (+/- 245) = 511 MB/s test from_utf8_cyr_regular ... bench: 12,250 ns/iter (+/- 437) = 418 MB/s test from_utf8_dewik10_fast ... bench: 6,017,909 ns/iter (+/- 105,755) = 1740 MB/s test from_utf8_dewik10_regular ... bench: 11,669,493 ns/iter (+/- 264,045) = 891 MB/s test from_utf8_enwik8_fast ... bench: 14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s test from_utf8_enwik8_regular ... bench: 93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s test from_utf8_jawik10_fast ... bench: 29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s test from_utf8_jawik10_regular ... bench: 29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s ``` Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
7037ef7
to
11e3de3
Compare
I received an improved version by @dotdash (with permission to incorporate, of course!) and it's an improvement you wouldn't believe.
Updated PR description & benchmarks are in there @brson I addressed loop unrolling only by adding another comment for it, don't see a nice way to factor it out to a constant |
Wow, awesome stuff! |
Pushed a fix, there was a missing conditional, let's try this in travis. I can't measure any difference in perf. Oh and the fix actually has a |
As a bit of "real world" performance information, I pulled this down and used it for SXD. Parsing a 16M XML file Valgrind reported that
And I measured a ~1.25% overall speedup in the program. Parsing a 111M XML file
And I measured a ~1.1% overall speedup in the program. Thanks for the awesome performance gains! |
Thanks a lot @shepmaster! Always encouraging to get that kind of feedback! 😻 And thanks to @bluss for getting this started, I've been completely blind to the masking quick check when I initially looked into this a few weeks ago! 🍻 |
@shepmaster Awesome to see some numbers! I'm guessing your data files are almost purely ASCII (as a lot of the data in the world is). @brson This is ready for re-review. It's the same algorithm, indexed access though, and the fast skip ahead loop is simpler, because it's only attempted at aligned locations. The main loop will progress to an aligned location quickly anyway, if the input is mostly ascii. |
Ah, yes, I meant to mention that. They indeed are pure-ASCII. |
} | ||
} | ||
// step from the point where the wordwise loop stopped | ||
while offset < len && v[offset] < 128 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through this, I thought at first that 128
was another number relating to byte widths, then realized it is the ASCII cutoff value. Since this is also used above (first >= 128
), perhaps another constant could be in order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I don't think it's needed
We need to guard that `len` is large enough for the fast skip loop.
4cc87ee
to
cadcd70
Compare
I updated the second commit to use a constant for 2 * usize::BYTES instead, to follow shepmaster's suggestion roughly. |
@bors r+ |
📌 Commit cadcd70 has been approved by |
Add fast path for ASCII in UTF-8 validation This speeds up the ASCII case (and long stretches of ASCII in otherwise mixed UTF-8 data) when checking UTF-8 validity. Benchmark results suggest that on purely ASCII input, we can improve throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input). On XML and mostly English language input (en.wikipedia XML dump), throughput improves by a factor 7 (large input). On mostly non-ASCII input, performance increases slightly or is the same. The UTF-8 validation is rewritten to use indexed access; since all access is preceded by a (mandatory for validation) length check, bounds checks are statically elided by LLVM and this formulation is in fact the best for performance. A previous version had losses due to slice to iterator conversions. A large credit to Björn Steinbrink who improved this patch immensely, writing this second version. Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3. Old code is `regular`, this PR is called `fast`. Datasets: - `ascii` is just ASCII (2.5 kB) - `cyr` is cyrillic script with ascii spaces (5 kB) - `dewik10` is 10MB of a de.wikipedia XML dump - `enwik8` is 100MB of an en.wikipedia XML dump - `jawik10` is 10MB of a ja.wikipedia XML dump ``` test from_utf8_ascii_fast ... bench: 140 ns/iter (+/- 4) = 18221 MB/s test from_utf8_ascii_regular ... bench: 1,932 ns/iter (+/- 19) = 1320 MB/s test from_utf8_cyr_fast ... bench: 10,025 ns/iter (+/- 245) = 511 MB/s test from_utf8_cyr_regular ... bench: 10,944 ns/iter (+/- 795) = 468 MB/s test from_utf8_dewik10_fast ... bench: 6,017,909 ns/iter (+/- 105,755) = 1740 MB/s test from_utf8_dewik10_regular ... bench: 11,669,493 ns/iter (+/- 264,045) = 891 MB/s test from_utf8_enwik8_fast ... bench: 14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s test from_utf8_enwik8_regular ... bench: 93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s test from_utf8_jawik10_fast ... bench: 29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s test from_utf8_jawik10_regular ... bench: 29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s ``` Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
awesome. Thanks @brson and everyone. |
Add fast path for ASCII in UTF-8 validation
This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.
Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).
On mostly non-ASCII input, performance increases slightly or is the
same.
The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.
A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.
Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.
Old code is
regular
, this PR is calledfast
.Datasets:
ascii
is just ASCII (2.5 kB)cyr
is cyrillic script with ascii spaces (5 kB)dewik10
is 10MB of a de.wikipedia XML dumpenwik8
is 100MB of an en.wikipedia XML dumpjawik10
is 10MB of a ja.wikipedia XML dumpCo-authored-by: Björn Steinbrink bsteinbr@gmail.com