Performance regression for BitMatrix multiplication in 1.11.2 #56954

jonocarroll · 2025-01-05T09:59:40Z

I believe there might be a significant performance regression between 1.11.1 and 1.11.2. I encountered this after upgrading and have managed to pin it down as far as matrix multiplying two large BitMatrix objects.

I found the following (after running the multiplication a couple of times already):

1.11.0:

a = BitMatrix(undef, (3000, 3000));
@time a * a
  0.014537 seconds (3 allocations: 68.665 MiB, 13.85% gc time)

1.11.1:

a = BitMatrix(undef, (3000, 3000));
@time a * a
  0.018001 seconds (3 allocations: 68.665 MiB, 19.69% gc time)

1.11.2:

a = BitMatrix(undef, (3000, 3000));
@time a * a
 11.244051 seconds (3 allocations: 68.665 MiB, 0.02% gc time)

A significant decrease in gc time, but vastly outweighed by the runtime.

I did run some profiling on a fuller example (where I encountered this) and found a large increase associated with a setindex!

1.11.0:

   ╎    ╎    ╎    ╎    ╎    ╎   10  …c/matmul.jl:114; *
   ╎    ╎    ╎    ╎    ╎    ╎    1   …c/matmul.jl:117; matprod_dest
   ╎    ╎    ╎    ╎    ╎    ╎     1   …bitarray.jl:375; similar
   ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1   …ase/boot.jl:599; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎  1   …ase/boot.jl:592; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   1   …ase/boot.jl:582; Array
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    1   …ase/boot.jl:535; new_as_memoryref
  1╎    ╎    ╎    ╎    ╎    ╎    ╎     1   …ase/boot.jl:516; GenericMemory
   ╎    ╎    ╎    ╎    ╎    ╎    9   …c/matmul.jl:253; mul!
   ╎    ╎    ╎    ╎    ╎    ╎     9   …c/matmul.jl:285; mul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎ 9   …c/matmul.jl:287; _mul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎  9   …c/matmul.jl:868; generic_matmatmul!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …c/matmul.jl:892; _generic_matmatmul!(…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    2   …actarray.jl:1312; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎     2   …actarray.jl:1341; _getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 2   …bitarray.jl:682; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  2   …bitarray.jl:676; unsafe_bitgetind…
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …sentials.jl:916; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   5   …c/matmul.jl:893; _generic_matmatmul!(…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    5   …simdloop.jl:77; macro expansion
   ╎    ╎    ╎    ╎    ╎    ╎    ╎     5   …c/matmul.jl:894; macro expansion
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1   …actarray.jl:1312; getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  1   …actarray.jl:1341; _getindex
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   1   …actarray.jl:1347; _to_linear_ind…
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    1   …actarray.jl:3048; _sub2ind
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     1   …actarray.jl:98; axes
   ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +1 1   …bitarray.jl:105; size
  1╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +2 1   …ase/Base.jl:49; getproperty
  4╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 4   …se/array.jl:983; setindex!
   ╎    ╎    ╎    ╎    ╎    ╎    ╎   2   …c/matmul.jl:896; _generic_matmatmul!(…	
  2╎    ╎    ╎    ╎    ╎    ╎    ╎    2   …se/range.jl:908; iterate

1.11.2

     ╎    ╎    ╎    ╎    ╎    ╎   15355 …c/matmul.jl:114; *
     ╎    ╎    ╎    ╎    ╎    ╎    1     …c/matmul.jl:117; matprod_dest
     ╎    ╎    ╎    ╎    ╎    ╎     1     …bitarray.jl:375; similar
     ╎    ╎    ╎    ╎    ╎    ╎    ╎ 1     …ase/boot.jl:599; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎  1     …ase/boot.jl:592; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   1     …ase/boot.jl:582; Array
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    1     …ase/boot.jl:535; new_as_memoryref
    1╎    ╎    ╎    ╎    ╎    ╎    ╎     1     …ase/boot.jl:516; GenericMemory
     ╎    ╎    ╎    ╎    ╎    ╎    15354 …c/matmul.jl:253; mul!
     ╎    ╎    ╎    ╎    ╎    ╎     15354 …c/matmul.jl:285; mul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎ 15354 …c/matmul.jl:287; _mul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎  15354 …c/matmul.jl:868; generic_matmatmul!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   15280 …c/matmul.jl:895; _generic_matmatm…
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    15280 …simdloop.jl:77; macro expansion
     ╎    ╎    ╎    ╎    ╎    ╎    ╎     15280 …c/matmul.jl:896; macro expansion
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 535   …actarray.jl:1312; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  535   …actarray.jl:1341; _getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   404   …actarray.jl:1347; _to_linear…
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    404   …actarray.jl:3048; _sub2ind
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     404   …actarray.jl:98; axes
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +1 404   …bitarray.jl:105; size
  404╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  +2 404   …ase/Base.jl:49; getproperty
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎   131   …bitarray.jl:682; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎    131   …bitarray.jl:676; unsafe_bit…
  131╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎     131   …sentials.jl:917; getindex
     ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 502   …se/array.jl:930; getindex
  502╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎  502   …sentials.jl:917; getindex
14243╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 14243 …se/array.jl:994; setindex!
     ╎    ╎    ╎    ╎    ╎    ╎    ╎   74    …c/matmul.jl:898; _generic_matmatm…
   74╎    ╎    ╎    ╎    ╎    ╎    ╎    74    …se/range.jl:908; iterate

The difference doesn't seem to appear for a pair of similarly sized Matrix{Int}

1.11.1:

a = Matrix{Int}(undef, 3000, 3000);
@time a * a'
  7.678765 seconds (4 allocations: 68.665 MiB, 0.45% gc time)

1.11.2:

a = Matrix{Int}(undef, 3000, 3000);
@time a * a'
  7.780284 seconds (4 allocations: 68.665 MiB, 0.49% gc time)

Hopefully someone can reproduce this.

My system:

Platform Info:
  OS: macOS (arm64-apple-darwin24.0.0)
  CPU: 11 × Apple M3 Pro
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, apple-m2)

The text was updated successfully, but these errors were encountered:

Zentrik · 2025-01-05T13:37:42Z

I bisected to one of 0bd77f5...b28fbd0

Zentrik · 2025-01-05T13:42:51Z

Bisected on master branch to 0af99e6

giordano · 2025-01-05T13:47:55Z

CC @jishnub (author of #56089)

jishnub · 2025-01-06T07:50:41Z

I'm afk at the moment, but does reverting the change resolve the performance regression?

jishnub · 2025-02-12T09:20:16Z

This is a strange issue. If I change line 944 to use the type parameter ais1 from the MulAddMul object as

Balpha = ais1 ? B[k,n] : B[k,n] * _add.alpha

this introduces the regression as seen in this issue. However, changing it to the runtime check

Balpha = isone(_add.alpha) ? B[k,n] : B[k,n] * _add.alpha

gets rid of the regression and recovers the performance of v1.11.1. However, the type parameter ais1 is defined as isone(_add.alpha), so the two forms should be equivalent, and the former should merely be a compile-time variant of the latter. This being the case, I don't understand why this is so much slower.

nsajko · 2025-02-12T09:40:45Z

I think that implies ais1 isn't known at compile time, so it causes a run time dispatch? Cthulhu could help.

jishnub · 2025-02-12T11:17:51Z

But how won't it be known at compile time? The MulAddMul instance is an argument to the method, so it'll be specialized for the type. And ais1 is a parameter from the type, so this should also be known at compile time. Cthulhu shows that ais1 is a compile-time constant, as expected.

jakobnissen · 2025-02-12T12:49:21Z

I believe this hangs on a fickle optimization done by LLVM, which is enabled/disabled by even tiny code changes. E.g. these will make it go fast:

Balpha = (B[k,n]*_add.alpha)

Balpha = B[k,n]
Balpha *= _add.alpha

Balpha = B[k,n] & _add.alpha

But these will make it go slow

Balpha = B[k,n] & true # the literal value of add.alpha

Balpha = B[k, n]

The hot loop compiles to different code, but the difference is not stability, nor allocation, nor vectorization, nor does the code work on whole chunks of the bitarray. Unless there is something wrong with the code introspection tools, it really just seems to be an issue of which precise (scalar) code is generated.

To me this suggests that it's not worth trying to second guess LLVM here, and instead a special BitMatrix method could be provided.

Hm, coming to think of it, it makes no sense that using somewhat different scalar instructions show a difference of 100x.

brenhinkeller · 2025-02-12T16:31:53Z

One does sometimes wonder whether our code introspection is fully accurate, though I sure hope it is.. Maybe there would be a useful difference in juliac-style compiled code if @code_llvm/@code_native doesn't explain it?

jakobnissen · 2025-02-12T20:15:49Z

Having looked into it, the performance difference comes from an optimization done by LLVM only in the fast version. Since we don't want performance of this magnitude to randomly appear or disappear according to the whims of the LLVM optimizer, I have manually implemented it in a PR to LinearAlgebra: JuliaLang/LinearAlgebra.jl#1202.

Fix #56954.

This manually adds the critical optimisation investigated in JuliaLang/julia#56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.

jakobnissen · 2025-02-14T08:26:56Z

Fixed in #57387 and JuliaLang/LinearAlgebra.jl#1202

jishnub added performance Must go faster regression 1.11 Regression in the 1.11 release labels Jan 5, 2025

giordano added the bisect wanted label Jan 5, 2025

nsajko added linear algebra Linear algebra arrays [a, r, r, a, y, s] bisect wanted and removed bisect wanted labels Jan 5, 2025

giordano removed the bisect wanted label Jan 5, 2025

This was referenced Feb 12, 2025

Add fast path in generic matmul JuliaLang/LinearAlgebra.jl#1202

Merged

Backport fix for bitmatrix matmul regression 1.11.1 to 1.11.2 #57387

Merged

giordano pushed a commit that referenced this issue Feb 13, 2025

Backport fix for bitmatrix matmul regression 1.11.1 to 1.11.2 (#57387)

20f842c

Fix #56954.

jakobnissen closed this as completed Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression for BitMatrix multiplication in 1.11.2 #56954

Performance regression for BitMatrix multiplication in 1.11.2 #56954

jonocarroll commented Jan 5, 2025 •

edited

Loading

Zentrik commented Jan 5, 2025

Zentrik commented Jan 5, 2025

giordano commented Jan 5, 2025

jishnub commented Jan 6, 2025

jishnub commented Feb 12, 2025 •

edited

Loading

nsajko commented Feb 12, 2025 •

edited

Loading

jishnub commented Feb 12, 2025 •

edited

Loading

jakobnissen commented Feb 12, 2025 •

edited

Loading

brenhinkeller commented Feb 12, 2025 •

edited

Loading

jakobnissen commented Feb 12, 2025

jakobnissen commented Feb 14, 2025

Performance regression for BitMatrix multiplication in 1.11.2 #56954

Performance regression for BitMatrix multiplication in 1.11.2 #56954

Comments

jonocarroll commented Jan 5, 2025 • edited Loading

1.11.0:

1.11.1:

1.11.2:

1.11.0:

1.11.2

1.11.1:

1.11.2:

Zentrik commented Jan 5, 2025

Zentrik commented Jan 5, 2025

giordano commented Jan 5, 2025

jishnub commented Jan 6, 2025

jishnub commented Feb 12, 2025 • edited Loading

nsajko commented Feb 12, 2025 • edited Loading

jishnub commented Feb 12, 2025 • edited Loading

jakobnissen commented Feb 12, 2025 • edited Loading

brenhinkeller commented Feb 12, 2025 • edited Loading

jakobnissen commented Feb 12, 2025

jakobnissen commented Feb 14, 2025

jonocarroll commented Jan 5, 2025 •

edited

Loading

jishnub commented Feb 12, 2025 •

edited

Loading

nsajko commented Feb 12, 2025 •

edited

Loading

jishnub commented Feb 12, 2025 •

edited

Loading

jakobnissen commented Feb 12, 2025 •

edited

Loading

brenhinkeller commented Feb 12, 2025 •

edited

Loading