-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialize iter::Chain<A, B>::next when A==B #107701
Conversation
r? @cuviper (rustbot has picked a reviewer for you, use r? to override) |
This comment was marked as resolved.
This comment was marked as resolved.
@bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
⌛ Trying commit 68ce47f with merge 6b2bec53c7983506be629a98e73fc510a2eb956f... |
Fat fingered Ctrl+Enter on my previous comment 😅 . Very interesting that apparently for slice iters the old approach was up to 4x slower than this swapping approach. |
☀️ Try build successful - checks-actions |
1 similar comment
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (6b2bec53c7983506be629a98e73fc510a2eb956f): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
|
It's disappointing that the huge benchmark improvement didn't bear out in the perf results -- maybe we're too good at using internal iteration for our chains. The regressed |
How about a similar |
done |
I tried to reproduce your benchmark results, but it doesn't look as good here:
Any idea why we would see it so differently? One possibility is that I have |
I went through several combinations of flags and can't repro your result. I'm running stage0 benchmarks. incremental=false, lld, 1CGU, lto=fat, target-cpu=native (znver2), debuginfo=2
incremental=false, lld, 1CGU, lto=fat, target-cpu=default, debuginfo=2
incremental=false, lld, 1CGU, lto=default, target-cpu=default, debuginfo=2
incremental=true, lld, 1CGU, lto=default, target-cpu=default, debuginfo=2
incremental=true, lld, CGUs=default (256), lto=default, target-cpu=default, debuginfo=2
Maybe it's something CPU-specific, I'll run a single benchmark under perf stat tomorrow to see what makes it faster |
I'm running the default (stage 1) on Fedora 37, Ryzen 7 5800X. My config: # Includes one of the default files in src/bootstrap/defaults
profile = "compiler"
changelog-seen = 2
[rust]
debug-assertions = true
codegen-units-std = 1
verbose-tests = true
[llvm]
assertions = true I tried without assertions too, and it didn't look much different. |
Here's what I got for the same config on
|
FWIW, when @seritools and I were working on the issue that spawned this idea, we were also doing some benchmarks (I have an Apple M1 Max, they have an Intel i7 12700k) and the same code had pretty drastic benchmark differences, so I would encourage investigating this avenue. |
I've heard before that the AMD and Apple branch predictors aren't as good as Intel's. Given that this optimization is all about making the first branch as hot as possible, could it just be that CPUs with worse branch predictors can take less advantage of the better predictability? |
For an apples-to-apples comparison, here are my And here are those exact binaries compared on a few machines that I have readily available: AMD Ryzen 7 5800X, Fedora 37
Intel Core i7-7700K, Fedora 37
Intel Core i7-9850H, CentOS Stream 9
Intel Xeon Platinum 8370C, Ubuntu 22.04 (
|
This is on a Zen2 3960X, this time with stage1, 1CGU: base: perf stat
perf report (cycles)
HEAD: perf stat
perf report (cycles)
Absolute numbers aren't comparable due to libtest's adaptive iteration count (a kingdom for an for a fixed iteration option...) but it's clear that IPC is higher and the number of branch misses is lower. |
Using your binaries: AMD 3960X, Arch Linux
So yeah, it's the CPU. |
AMD did improve the branch predictor in Zen 3, so it seems we're seeing that to great effect. base (
new (
So the branch-misses and stalled-cycles are negligible in either case. My i7-7700K is similar with negligible branch-misses (and stalls aren't reported).
I'm not sure how to make that decision. I tend to think we should favor the future, especially since the code is simpler without specialization, but that appears biased toward my machines. :) |
I'll try on an intel laptop and check if the code can be tweaked further. |
From T-libs meeting: We're unlikely to take this optimization as long as the results are mixed (benefiting some CPUs while regressing others) and the optimization is unsafe and tricky. |
@rustbot author |
☔ The latest upstream changes (presumably #109692) made this pull request unmergeable. Please resolve the merge conflicts. |
@the8472 |
The by_ref versions used to exercise external iteration but since the try_fold optimiations this is no loner the case.
The job Click to see the possible cause of the failure (guessed by this bot)
|
Managed to get some improvements on intel but it's still mixed compared to the Zen2 results. |
This improves external iteration for
Chain
where both sides have the same type.Similar to the optimization in vec_deque::Iter::next