Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen weirdness for sum of count_ones over an array #101060

Open
alion02 opened this issue Aug 26, 2022 · 8 comments
Open

Codegen weirdness for sum of count_ones over an array #101060

alion02 opened this issue Aug 26, 2022 · 8 comments
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-llvm Working group: LLVM backend code generation

Comments

@alion02
Copy link

alion02 commented Aug 26, 2022

(Issue loosely owned by @wesleywiser and @pnkfelix monitoring llvm/llvm-project#57476 )

Original Description below

pub fn f(arr: [u64; 2]) -> u32 {
    arr.into_iter().map(u64::count_ones).sum()
}

Before 1.62.0, this code correctly compiled to two popcounts and an addition on a modern x86-64 target.

example::f:
        popcnt  rcx, qword ptr [rdi]
        popcnt  rax, qword ptr [rdi + 8]
        add     eax, ecx
        ret

Since 1.62.0 (up to latest nightly), the codegen is... baffling at best.

.LCPI0_0:
        .zero   16,15
.LCPI0_1:
        .byte   0
        .byte   1
        .byte   1
        .byte   2
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   1
        .byte   2
        .byte   2
        .byte   3
        .byte   2
        .byte   3
        .byte   3
        .byte   4
example::f:
        sub     rsp, 40
        vmovups xmm0, xmmword ptr [rdi]
        vmovdqa xmm1, xmmword ptr [rip + .LCPI0_0]
        vmovdqa xmm3, xmmword ptr [rip + .LCPI0_1]
        vmovaps xmmword ptr [rsp], xmm0
        vmovdqa xmm0, xmmword ptr [rsp]
        vpand   xmm2, xmm0, xmm1
        vpsrlw  xmm0, xmm0, 4
        vpand   xmm0, xmm0, xmm1
        vpshufb xmm2, xmm3, xmm2
        vpxor   xmm1, xmm1, xmm1
        vpshufb xmm0, xmm3, xmm0
        vpaddb  xmm0, xmm0, xmm2
        vpsadbw xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 170
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        add     rsp, 40
        ret

The assembly for the original function is now a terribly misguided autovectorization. And, just to make sure (even though it's pretty obvious), I did run a benchmark - the autovectorized function is ~8x slower on my Zen 2 system.

Calling that function from a different function brings back normal assembly. -Cno-vectorize-slp does nothing. I don't know exactly what -Cno-vectorize-loops does, but it's not good.

If you change the length of the array to 4, both functions get autovectorized. -Cno-vectorize-slp fixes the second function now. Adding -Cno-vectorize-loops causes the passthrough function to generate the worst assembly.

Changing into_iter to iter fixes length 2, but doesn't fix length 4.

I could go on, but in short it's a whole mess.

I found a workaround that consistently works for all lengths: iter and -Cno-vectorize-slp.

@rustbot modify labels: +regression-from-stable-to-stable -regression-untriaged +A-array +A-codegen +A-iterators +A-LLVM +A-simd +I-slow +O-x86_64 +perf-regression

@alion02 alion02 added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Aug 26, 2022
@rustbot rustbot added I-prioritize Issue: Indicates that prioritization has been requested for this issue. A-array Area: `[T; N]` A-codegen Area: Code generation A-iterators Area: Iterators A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) perf-regression Performance regression. regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Aug 26, 2022
@the8472 the8472 removed perf-regression Performance regression. A-SIMD Area: SIMD (Single Instruction Multiple Data) A-iterators Area: Iterators A-array Area: `[T; N]` labels Aug 26, 2022
@the8472
Copy link
Member

the8472 commented Aug 27, 2022

Looks bad on stable and nightly, but oddly the current beta looks fine.

#100220 and #100214 might be relevant

@alion02
Copy link
Author

alion02 commented Aug 27, 2022

Beta still trips on length 4. So does 1.61.0, however. 1.59.0 is the latest that produces 4 popcounts and 3 adds. And yes, it's still hugely faster not to autovectorize length 4 (at least on Zen 2).

@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion). IIUC this and the related issues seem to be caused by an LLVM regression (see comment).

@rustbot label -I-prioritize +P-high t-compiler

@rustbot rustbot added P-high High priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels Aug 31, 2022
@nikic
Copy link
Contributor

nikic commented Aug 31, 2022

Reduced example:

target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define i64 @test(ptr %arr) {
entry:
  br label %loop

loop:
  %accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
  %iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
  %iv.next = add nuw i64 %iv, 1
  %gep = getelementptr inbounds i64, ptr %arr, i64 %iv
  %value = load i64, ptr %gep, align 8
  %ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
  %accum.next = add i64 %accum, %ctpop
  %exitcond = icmp eq i64 %iv.next, 2
  br i1 %exitcond, label %exit, label %loop

exit:
  %lcssa = phi i64 [ %accum.next, %loop ]
  ret i64 %lcssa
}

declare i64 @llvm.ctpop.i64(i64)

The cost model for znver2 says that ctpop.i64 costs 1 and ctpop.v2i64 costs 3, which is why the vectorization is considered profitable.

@nikic
Copy link
Contributor

nikic commented Aug 31, 2022

Upstream issue: llvm/llvm-project#57476

Alternatively, this would also be fixed if we managed to unroll the loop early (during full unroll rather than runtime unroll). That's probably where the into_iter distinction comes from.

@pnkfelix pnkfelix added the WG-none Indicates that no working group is assigned to an issue, but it *does* have an active owner label Sep 30, 2022
@nikic
Copy link
Contributor

nikic commented Apr 3, 2023

Still an issue with LLVM 16.

@workingjubilee workingjubilee added the A-autovectorization Area: Autovectorization, which can impact perf or code size label Jul 22, 2023
@nikic
Copy link
Contributor

nikic commented Aug 14, 2023

Godbolt: https://godbolt.org/z/MoYTvb9qW

Still an issue with LLVM 17.

@wesleywiser wesleywiser added WG-llvm Working group: LLVM backend code generation and removed WG-none Indicates that no working group is assigned to an issue, but it *does* have an active owner labels Nov 17, 2023
@qarmin
Copy link

qarmin commented Dec 27, 2023

Now nightly version(but not stable or beta) produce such output

example::f:
        popcnt  rcx, qword ptr [rdi]
        popcnt  rax, qword ptr [rdi + 8]
        add     eax, ecx
        ret

example::g:
        popcnt  rcx, qword ptr [rdi]
        popcnt  rax, qword ptr [rdi + 8]
        add     eax, ecx
        ret

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-autovectorization Area: Autovectorization, which can impact perf or code size A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. O-x86_64 Target: x86-64 processors (like x86_64-*) P-high High priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-llvm Working group: LLVM backend code generation
Projects
None yet
Development

No branches or pull requests

9 participants