Add fast path in generic matmul #1202

jakobnissen · 2025-02-12T20:14:40Z

This manually adds the critical optimisation investigated in JuliaLang/julia#56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.

This manually adds the critical optimisation investigated in Julia issue 56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.

jishnub · 2025-02-13T06:01:46Z

The performance regression doesn't seem to exist on the Julia master branch, which is what this repo is in sync with. This is probably because of various other refactorings. The issue is seen on v1.11.

We may still add this here though if this seems like a good idea. We'll just have to remember to make a separate PR to v1.11.

codecov · 2025-02-13T10:05:15Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.88%. Comparing base (2a1696a) to head (cbe2f39).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1202      +/-   ##
==========================================
+ Coverage   91.86%   91.88%   +0.02%     
==========================================
  Files          34       34              
  Lines       15365    15366       +1     
==========================================
+ Hits        14115    14119       +4     
+ Misses       1250     1247       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jishnub · 2025-02-13T14:11:57Z

Also, would you mind elaborating on what the specific optimization is in this case, and how you tracked it down?

jakobnissen · 2025-02-13T14:40:12Z

Yes - the optimisation is that, whenever Balpha == 0, then the entire loop:

@simd for m in axes(A, 1)
    C[m,n] = muladd(A[m,k], Balpha, C[m,n])
end

has no effect and can be skipped. You can verify by yourself on Julia 1.11.1 that a large bitmatrix with all zeros is much faster to multiply than one with ones.
I still don't know why this optimisation is only done sometimes.

I tracked it down by noticing that

The difference in speed is 1000x for 3000x3000 matrices, which is too much to be explained by low-level details like whether it vectorizes or not, but points to an algorithmic difference
Unix LinuxPerf (on suggestion from Gabriel Baraldi) showed the fast version executed several hundred times fewer instructions, and cache access, also pointing to actual less work being done, as opposed to doing the work more efficient
Then, I carefully studied the generated native code until I found a jump instruction in the fast one that wasn't present in the slow one.

jishnub · 2025-02-13T14:46:20Z

Thanks a lot! I think this is good to merge.

Add fast path in generic matmul

7b46941

This manually adds the critical optimisation investigated in Julia issue 56954. While we could rely on LLVM to continue doing this optimisation, it's more robust to add it manually.

jakobnissen mentioned this pull request Feb 12, 2025

Performance regression for BitMatrix multiplication in 1.11.2 JuliaLang/julia#56954

Closed

oscardssmith added the performance Must go faster label Feb 12, 2025

jakobnissen added 2 commits February 13, 2025 08:04

Fixup tests

3a5bdbe

Handle missing

cbe2f39

KristofferC added the backport 1.11 Change should be backported to the 1.11 release label Feb 13, 2025

giordano added backport 1.12 Change should be backported to release-1.12 and removed backport 1.11 Change should be backported to the 1.11 release labels Feb 13, 2025

jishnub merged commit ed35a37 into JuliaLang:master Feb 14, 2025
4 checks passed

jakobnissen deleted the fast_bitmatrix branch February 14, 2025 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fast path in generic matmul #1202

Add fast path in generic matmul #1202

jakobnissen commented Feb 12, 2025 •

edited by giordano

Loading

jishnub commented Feb 13, 2025

codecov bot commented Feb 13, 2025 •

edited

Loading

jishnub commented Feb 13, 2025

jakobnissen commented Feb 13, 2025

jishnub commented Feb 13, 2025

Add fast path in generic matmul #1202

Add fast path in generic matmul #1202

Conversation

jakobnissen commented Feb 12, 2025 • edited by giordano Loading

jishnub commented Feb 13, 2025

codecov bot commented Feb 13, 2025 • edited Loading

Codecov Report

jishnub commented Feb 13, 2025

jakobnissen commented Feb 13, 2025

jishnub commented Feb 13, 2025

jakobnissen commented Feb 12, 2025 •

edited by giordano

Loading

codecov bot commented Feb 13, 2025 •

edited

Loading