-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression for BitMatrix multiplication in 1.11.2 #56954
Comments
I bisected to one of 0bd77f5...b28fbd0 |
Bisected on master branch to 0af99e6 |
I'm afk at the moment, but does reverting the change resolve the performance regression? |
This is a strange issue. If I change line 944 to use the type parameter Balpha = ais1 ? B[k,n] : B[k,n] * _add.alpha this introduces the regression as seen in this issue. However, changing it to the runtime check Balpha = isone(_add.alpha) ? B[k,n] : B[k,n] * _add.alpha gets rid of the regression and recovers the performance of v1.11.1. However, the type parameter |
I think that implies |
But how won't it be known at compile time? The |
I believe this hangs on a fickle optimization done by LLVM, which is enabled/disabled by even tiny code changes. E.g. these will make it go fast:
But these will make it go slow
The hot loop compiles to different code, but the difference is not stability, nor allocation, nor vectorization, nor does the code work on whole chunks of the bitarray. Unless there is something wrong with the code introspection tools, it really just seems to be an issue of which precise (scalar) code is generated. To me this suggests that it's not worth trying to second guess LLVM here, and instead a special BitMatrix method could be provided. Hm, coming to think of it, it makes no sense that using somewhat different scalar instructions show a difference of 100x. |
One does sometimes wonder whether our code introspection is fully accurate, though I sure hope it is.. Maybe there would be a useful difference in |
Having looked into it, the performance difference comes from an optimization done by LLVM only in the fast version. Since we don't want performance of this magnitude to randomly appear or disappear according to the whims of the LLVM optimizer, I have manually implemented it in a PR to LinearAlgebra: JuliaLang/LinearAlgebra.jl#1202. |
This manually adds the critical optimisation investigated in JuliaLang/julia#56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.
Fixed in #57387 and JuliaLang/LinearAlgebra.jl#1202 |
I believe there might be a significant performance regression between 1.11.1 and 1.11.2. I encountered this after upgrading and have managed to pin it down as far as matrix multiplying two large
BitMatrix
objects.I found the following (after running the multiplication a couple of times already):
1.11.0:
1.11.1:
1.11.2:
A significant decrease in gc time, but vastly outweighed by the runtime.
I did run some profiling on a fuller example (where I encountered this) and found a large increase associated with a
setindex!
1.11.0:
1.11.2
The difference doesn't seem to appear for a pair of similarly sized
Matrix{Int}
1.11.1:
1.11.2:
Hopefully someone can reproduce this.
My system:
The text was updated successfully, but these errors were encountered: