Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression for BitMatrix multiplication in 1.11.2 #56954

Closed
jonocarroll opened this issue Jan 5, 2025 · 11 comments
Closed

Performance regression for BitMatrix multiplication in 1.11.2 #56954

jonocarroll opened this issue Jan 5, 2025 · 11 comments
Labels
arrays [a, r, r, a, y, s] linear algebra Linear algebra performance Must go faster regression 1.11 Regression in the 1.11 release

Comments

@jonocarroll
Copy link

jonocarroll commented Jan 5, 2025

I believe there might be a significant performance regression between 1.11.1 and 1.11.2. I encountered this after upgrading and have managed to pin it down as far as matrix multiplying two large BitMatrix objects.

I found the following (after running the multiplication a couple of times already):

1.11.0:

a = BitMatrix(undef, (3000, 3000));
@time a * a
  0.014537 seconds (3 allocations: 68.665 MiB, 13.85% gc time)

1.11.1:

a = BitMatrix(undef, (3000, 3000));
@time a * a
  0.018001 seconds (3 allocations: 68.665 MiB, 19.69% gc time)

1.11.2:

a = BitMatrix(undef, (3000, 3000));
@time a * a
 11.244051 seconds (3 allocations: 68.665 MiB, 0.02% gc time)

A significant decrease in gc time, but vastly outweighed by the runtime.

I did run some profiling on a fuller example (where I encountered this) and found a large increase associated with a setindex!

1.11.0:

   ╎    ╎    ╎    ╎    ╎    ╎   10  …c/matmul.jl:114; *
   ╎    ╎    ╎    ╎    ╎    ╎    1   …c/matmul.jl:117; matprod_dest
   ╎    ╎    ╎    ╎    ╎    ╎     1   …bitarray.jl:375; similar
   ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1   …ase/boot.jl:599; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎  1   …ase/boot.jl:592; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   1   …ase/boot.jl:582; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    1   …ase/boot.jl:535; new_as_memoryref
  1╎    ╎    ╎    ╎    ╎    ╎    ╎     1   …ase/boot.jl:516; GenericMemory
   ╎    ╎    ╎    ╎    ╎    ╎    9   …c/matmul.jl:253; mul!
   ╎    ╎    ╎    ╎    ╎    ╎     9   …c/matmul.jl:285; mul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎ 9   …c/matmul.jl:287; _mul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎  9   …c/matmul.jl:868; generic_matmatmul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …c/matmul.jl:892; _generic_matmatmul!(…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    2   …actarray.jl:1312; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎     2   …actarray.jl:1341; _getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 2   …bitarray.jl:682; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  2   …bitarray.jl:676; unsafe_bitgetind…
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …sentials.jl:916; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   5   …c/matmul.jl:893; _generic_matmatmul!(…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    5   …simdloop.jl:77; macro expansion
   ╎    ╎    ╎    ╎    ╎    ╎    ╎     5   …c/matmul.jl:894; macro expansion
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1   …actarray.jl:1312; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  1   …actarray.jl:1341; _getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1   …actarray.jl:1347; _to_linear_ind…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1   …actarray.jl:3048; _sub2ind
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     1   …actarray.jl:98; axes
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +1 1   …bitarray.jl:105; size
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +2 1   …ase/Base.jl:49; getproperty
  4╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 4   …se/array.jl:983; setindex!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …c/matmul.jl:896; _generic_matmatmul!(…	
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    2   …se/range.jl:908; iterate

1.11.2

     ╎    ╎    ╎    ╎    ╎    ╎   15355 …c/matmul.jl:114; *
     ╎    ╎    ╎    ╎    ╎    ╎    1     …c/matmul.jl:117; matprod_dest
     ╎    ╎    ╎    ╎    ╎    ╎     1     …bitarray.jl:375; similar
     ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1     …ase/boot.jl:599; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎  1     …ase/boot.jl:592; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   1     …ase/boot.jl:582; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    1     …ase/boot.jl:535; new_as_memoryref
    1╎    ╎    ╎    ╎    ╎    ╎    ╎     1     …ase/boot.jl:516; GenericMemory
     ╎    ╎    ╎    ╎    ╎    ╎    15354 …c/matmul.jl:253; mul!
     ╎    ╎    ╎    ╎    ╎    ╎     15354 …c/matmul.jl:285; mul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎ 15354 …c/matmul.jl:287; _mul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎  15354 …c/matmul.jl:868; generic_matmatmul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   15280 …c/matmul.jl:895; _generic_matmatm…
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    15280 …simdloop.jl:77; macro expansion
     ╎    ╎    ╎    ╎    ╎    ╎    ╎     15280 …c/matmul.jl:896; macro expansion
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 535   …actarray.jl:1312; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  535   …actarray.jl:1341; _getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   404   …actarray.jl:1347; _to_linear…
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    404   …actarray.jl:3048; _sub2ind
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     404   …actarray.jl:98; axes
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +1 404   …bitarray.jl:105; size
  404╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +2 404   …ase/Base.jl:49; getproperty
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   131   …bitarray.jl:682; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    131   …bitarray.jl:676; unsafe_bit…
  131╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     131   …sentials.jl:917; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 502   …se/array.jl:930; getindex
  502╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  502   …sentials.jl:917; getindex
14243╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 14243 …se/array.jl:994; setindex!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   74    …c/matmul.jl:898; _generic_matmatm…
   74╎    ╎    ╎    ╎    ╎    ╎    ╎    74    …se/range.jl:908; iterate

The difference doesn't seem to appear for a pair of similarly sized Matrix{Int}

1.11.1:

a = Matrix{Int}(undef, 3000, 3000);
@time a * a'
  7.678765 seconds (4 allocations: 68.665 MiB, 0.45% gc time)

1.11.2:

a = Matrix{Int}(undef, 3000, 3000);
@time a * a'
  7.780284 seconds (4 allocations: 68.665 MiB, 0.49% gc time)

Hopefully someone can reproduce this.

My system:

Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)
@jishnub jishnub added performance Must go faster regression 1.11 Regression in the 1.11 release labels Jan 5, 2025
@nsajko nsajko added linear algebra Linear algebra arrays [a, r, r, a, y, s] bisect wanted and removed bisect wanted labels Jan 5, 2025
@Zentrik
Copy link
Member

Zentrik commented Jan 5, 2025

I bisected to one of 0bd77f5...b28fbd0

@Zentrik
Copy link
Member

Zentrik commented Jan 5, 2025

Bisected on master branch to 0af99e6

@giordano
Copy link
Contributor

giordano commented Jan 5, 2025

CC @jishnub (author of #56089)

@jishnub
Copy link
Contributor

jishnub commented Jan 6, 2025

I'm afk at the moment, but does reverting the change resolve the performance regression?

@jishnub
Copy link
Contributor

jishnub commented Feb 12, 2025

This is a strange issue. If I change line 944 to use the type parameter ais1 from the MulAddMul object as

Balpha = ais1 ? B[k,n] : B[k,n] * _add.alpha

this introduces the regression as seen in this issue. However, changing it to the runtime check

Balpha = isone(_add.alpha) ? B[k,n] : B[k,n] * _add.alpha

gets rid of the regression and recovers the performance of v1.11.1. However, the type parameter ais1 is defined as isone(_add.alpha), so the two forms should be equivalent, and the former should merely be a compile-time variant of the latter. This being the case, I don't understand why this is so much slower.

@nsajko
Copy link
Contributor

nsajko commented Feb 12, 2025

I think that implies ais1 isn't known at compile time, so it causes a run time dispatch? Cthulhu could help.

@jishnub
Copy link
Contributor

jishnub commented Feb 12, 2025

But how won't it be known at compile time? The MulAddMul instance is an argument to the method, so it'll be specialized for the type. And ais1 is a parameter from the type, so this should also be known at compile time. Cthulhu shows that ais1 is a compile-time constant, as expected.

@jakobnissen
Copy link
Contributor

jakobnissen commented Feb 12, 2025

I believe this hangs on a fickle optimization done by LLVM, which is enabled/disabled by even tiny code changes. E.g. these will make it go fast:

Balpha = (B[k,n]*_add.alpha)
Balpha = B[k,n]
Balpha *= _add.alpha
Balpha = B[k,n] & _add.alpha

But these will make it go slow

Balpha = B[k,n] & true # the literal value of add.alpha
Balpha = B[k, n]

The hot loop compiles to different code, but the difference is not stability, nor allocation, nor vectorization, nor does the code work on whole chunks of the bitarray. Unless there is something wrong with the code introspection tools, it really just seems to be an issue of which precise (scalar) code is generated.

To me this suggests that it's not worth trying to second guess LLVM here, and instead a special BitMatrix method could be provided.

Hm, coming to think of it, it makes no sense that using somewhat different scalar instructions show a difference of 100x.

@brenhinkeller
Copy link
Contributor

brenhinkeller commented Feb 12, 2025

One does sometimes wonder whether our code introspection is fully accurate, though I sure hope it is.. Maybe there would be a useful difference in juliac-style compiled code if @code_llvm/@code_native doesn't explain it?

@jakobnissen
Copy link
Contributor

Having looked into it, the performance difference comes from an optimization done by LLVM only in the fast version. Since we don't want performance of this magnitude to randomly appear or disappear according to the whims of the LLVM optimizer, I have manually implemented it in a PR to LinearAlgebra: JuliaLang/LinearAlgebra.jl#1202.

jishnub pushed a commit to JuliaLang/LinearAlgebra.jl that referenced this issue Feb 14, 2025
This manually adds the critical optimisation investigated in
JuliaLang/julia#56954. While we could rely on
LLVM to continue doing this optimisation, it's more robust to add it
manually.
@jakobnissen
Copy link
Contributor

Fixed in #57387 and JuliaLang/LinearAlgebra.jl#1202

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] linear algebra Linear algebra performance Must go faster regression 1.11 Regression in the 1.11 release
Projects
None yet
Development

No branches or pull requests

7 participants