-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codegen weirdness for sum
of count_ones
over an array
#101060
Comments
Beta still trips on length 4. So does 1.61.0, however. 1.59.0 is the latest that produces 4 popcounts and 3 adds. And yes, it's still hugely faster not to autovectorize length 4 (at least on Zen 2). |
WG-prioritization assigning priority (Zulip discussion). IIUC this and the related issues seem to be caused by an LLVM regression (see comment). @rustbot label -I-prioritize +P-high t-compiler |
Reduced example: target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"
define i64 @test(ptr %arr) {
entry:
br label %loop
loop:
%accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
%iv.next = add nuw i64 %iv, 1
%gep = getelementptr inbounds i64, ptr %arr, i64 %iv
%value = load i64, ptr %gep, align 8
%ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
%accum.next = add i64 %accum, %ctpop
%exitcond = icmp eq i64 %iv.next, 2
br i1 %exitcond, label %exit, label %loop
exit:
%lcssa = phi i64 [ %accum.next, %loop ]
ret i64 %lcssa
}
declare i64 @llvm.ctpop.i64(i64) The cost model for znver2 says that ctpop.i64 costs 1 and ctpop.v2i64 costs 3, which is why the vectorization is considered profitable. |
Upstream issue: llvm/llvm-project#57476 Alternatively, this would also be fixed if we managed to unroll the loop early (during full unroll rather than runtime unroll). That's probably where the into_iter distinction comes from. |
Still an issue with LLVM 16. |
Godbolt: https://godbolt.org/z/MoYTvb9qW Still an issue with LLVM 17. |
Now nightly version(but not stable or beta) produce such output
|
(Issue loosely owned by @wesleywiser and @pnkfelix monitoring llvm/llvm-project#57476 )
Original Description below
Before 1.62.0, this code correctly compiled to two popcounts and an addition on a modern x86-64 target.
Since 1.62.0 (up to latest nightly), the codegen is... baffling at best.
The assembly for the original function is now a terribly misguided autovectorization. And, just to make sure (even though it's pretty obvious), I did run a benchmark - the autovectorized function is ~8x slower on my Zen 2 system.
Calling that function from a different function brings back normal assembly.
-Cno-vectorize-slp
does nothing. I don't know exactly what-Cno-vectorize-loops
does, but it's not good.If you change the length of the array to 4, both functions get autovectorized.
-Cno-vectorize-slp
fixes the second function now. Adding-Cno-vectorize-loops
causes the passthrough function to generate the worst assembly.Changing
into_iter
toiter
fixes length 2, but doesn't fix length 4.I could go on, but in short it's a whole mess.
I found a workaround that consistently works for all lengths:
iter
and-Cno-vectorize-slp
.@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged +A-array +A-codegen +A-iterators +A-LLVM +A-simd +I-slow +O-x86_64 +perf-regression
The text was updated successfully, but these errors were encountered: