-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix simd accumulator overflow #1
Conversation
The intrinsics version (`opt4_simd`) has a bug: >```rust >// Flush accumulator every 256 iterations to avoid overflow >if block_i % (u8::MAX as usize + 1) == u8::MAX as usize { >``` > >The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 256 iterations to prevent overflowing each u8 lane. Flushing every 256 iterations isn't enough, because you can potentially encounter 256 `'s'` in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article: ```rust fn main() { let input = "s".repeat(1024*1024); println!("{}", baseline(&input)); println!("{}", opt4_simd(&input)); } ``` Produces: ``` 1048576 -1040384 ``` The fix, performance untested: ```rust // Flush accumulator every 255 iterations to avoid overflow if block_i % u8::MAX as usize == (u8::MAX - 1) as usize { ``` Correctly produces: ``` 1048576 1048576 ``` https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv
The new test I added also caused the chunked versions to overflow, so I committed an additional fix changing the chunk size to 255. Unfortunately this regresses performance:
Test machine: MacBook Air, 15-inch, M2, 2023. |
The fix also regresses the original fastest (10x unrolled), but not by as much:
|
$ RUST_BACKTRACE=1 cargo test Finished test [unoptimized + debuginfo] target(s) in 0.03s Running unittests src/lib.rs (target/debug/deps/n_times_faster_than_c-39ece82b67ddb14c) running 3 tests test tests::test_simple ... ok test tests::test_all_s ... FAILED test tests::test_large ... ok v_acc~I = vaddq_u8(v_acc~I, v_eq_s~I); failures: ---- tests::test_all_s stdout ---- thread 'tests::test_all_s' panicked at 'attempt to add with overflow', /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:149:1 stack backtrace: 0: rust_begin_unwind at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/std/src/panicking.rs:593:5 1: core::panicking::panic_fmt at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:67:14 2: core::panicking::panic at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/panicking.rs:117:5 3: <u8 as core::iter::traits::accum::Sum>::sum::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:53:28 4: core::iter::adapters::map::map_fold::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:21 5: core::iter::traits::iterator::Iterator::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21 6: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 7: <u8 as core::iter::traits::accum::Sum>::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17 8: core::iter::traits::iterator::Iterator::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9 9: n_times_faster_than_c::opt6_chunk_count::{{closure}} at ./src/lib.rs:125:22 10: core::iter::adapters::map::map_fold::{{closure}} at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:84:28 11: core::iter::traits::iterator::Iterator::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:2481:21 12: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 13: <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/adapters/map.rs:124:9 14: <i64 as core::iter::traits::accum::Sum>::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/accum.rs:50:17 15: core::iter::traits::iterator::Iterator::sum at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/iter/traits/iterator.rs:3476:9 16: n_times_faster_than_c::opt6_chunk_count at ./src/lib.rs:122:15 17: n_times_faster_than_c::tests::test_all_s at ./src/lib.rs:195:9 18: n_times_faster_than_c::tests::test_all_s::{{closure}} at ./src/lib.rs:192:21 19: core::ops::function::FnOnce::call_once at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5 20: core::ops::function::FnOnce::call_once at /rustc/8ede3aae28fe6e4d52b38157d7bfe0d3bceef225/library/core/src/ops/function.rs:250:5 note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. failures: tests::test_all_s test result: FAILED. 2 passed; 1 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.29s res = (2 * res) - n_simd_elems as i64; error: test failed, to rerun pass `--lib`
The large performance regression in the chunked versions was obviously because the chunks were no longer well-aligned. However the real solution is obvious: use something less than 256 that is well-aligned: 128. I amended my commit to use 128, recovering some of the losses. I then amended my commit again to use 192, which recovers all of the losses for
|
Flushing every 255 iterations regressed performance by ~5%, but this eliminates the regression. run_switches/opt5_simd_unrolled_10x time: [24.636 µs 24.641 µs 24.647 µs] thrpt: [37.786 GiB/s 37.795 GiB/s 37.804 GiB/s] change: time: [+0.2771% +0.5216% +0.7572%] (p = 0.00 < 0.05) thrpt: [-0.7516% -0.5189% -0.2763%] Change within noise threshold. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) high mild 6 (6.00%) high severe
Flushing the accumulator every 128 iterations (vs 255) in the intrinsics versions reversed the regressions there as well. |
Good catch! This only reduces the best version by 0.5% which is good news. Do you know why flushing at unaligned interval for the simd intrinsics version causes such a large slow down? |
Not sure. Looking at the simd intrinsics version without unrolling, doing a mod by 255 adds a lot of instructions to the inner loop: LBB2_17:
- add x10, x10, #1
add x2, x2, #16
- cmp x9, x10
- b.eq LBB2_4
+ sub x11, x11, #1
+ add x10, x10, #1
+ sub x9, x9, #1
+ cbz x9, LBB2_4
LBB2_21:
+ umulh x13, x10, x12
+ lsr x13, x13, #7
+ sub x14, x11, x13
+ add x13, x14, x13, lsl #8
ldr q2, [x0, x2]
and.16b v2, v2, v1
add.16b v0, v2, v0
- mvn w11, w10
- tst x11, #0x7f
- b.ne LBB2_17
+ cbnz x13, LBB2_17 I'd have thought it wouldn't matter as much for the 10x version, but I didn't expand the macro to look at that. |
The intrinsics version (
opt4_simd
) has a bug:Flushing every 256 iterations isn't enough, because you can potentially encounter 256
's'
in 256 iterations, thus overflowing the max of 255. Using the function definitions from the article:Produces:
The fix, performance untested:
Correctly produces:
https://lobste.rs/s/sqn7m0/n_times_faster_than_c_where_n_128#c_iooycv