Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path in generic matmul #1202

Merged
merged 3 commits into from
Feb 14, 2025
Merged

Conversation

jakobnissen
Copy link
Contributor

@jakobnissen jakobnissen commented Feb 12, 2025

This manually adds the critical optimisation investigated in JuliaLang/julia#56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.

This manually adds the critical optimisation investigated in Julia issue
56954. While we could rely on LLVM to continue doing this optimisation,
it's more robust to add it manually.
@jishnub
Copy link
Collaborator

jishnub commented Feb 13, 2025

The performance regression doesn't seem to exist on the Julia master branch, which is what this repo is in sync with. This is probably because of various other refactorings. The issue is seen on v1.11.

We may still add this here though if this seems like a good idea. We'll just have to remember to make a separate PR to v1.11.

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.88%. Comparing base (2a1696a) to head (cbe2f39).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1202      +/-   ##
==========================================
+ Coverage   91.86%   91.88%   +0.02%     
==========================================
  Files          34       34              
  Lines       15365    15366       +1     
==========================================
+ Hits        14115    14119       +4     
+ Misses       1250     1247       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@KristofferC KristofferC added the backport 1.11 Change should be backported to the 1.11 release label Feb 13, 2025
@giordano giordano added backport 1.12 Change should be backported to release-1.12 and removed backport 1.11 Change should be backported to the 1.11 release labels Feb 13, 2025
@jishnub
Copy link
Collaborator

jishnub commented Feb 13, 2025

Also, would you mind elaborating on what the specific optimization is in this case, and how you tracked it down?

@jakobnissen
Copy link
Contributor Author

Yes - the optimisation is that, whenever Balpha == 0, then the entire loop:

@simd for m in axes(A, 1)
    C[m,n] = muladd(A[m,k], Balpha, C[m,n])
end

has no effect and can be skipped. You can verify by yourself on Julia 1.11.1 that a large bitmatrix with all zeros is much faster to multiply than one with ones.
I still don't know why this optimisation is only done sometimes.

I tracked it down by noticing that

  • The difference in speed is 1000x for 3000x3000 matrices, which is too much to be explained by low-level details like whether it vectorizes or not, but points to an algorithmic difference
  • Unix LinuxPerf (on suggestion from Gabriel Baraldi) showed the fast version executed several hundred times fewer instructions, and cache access, also pointing to actual less work being done, as opposed to doing the work more efficient
  • Then, I carefully studied the generated native code until I found a jump instruction in the fast one that wasn't present in the slow one.

@jishnub
Copy link
Collaborator

jishnub commented Feb 13, 2025

Thanks a lot! I think this is good to merge.

@jishnub jishnub merged commit ed35a37 into JuliaLang:master Feb 14, 2025
4 checks passed
@jakobnissen jakobnissen deleted the fast_bitmatrix branch February 14, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.12 Change should be backported to release-1.12 performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants