[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

jerryyin · 2024-11-26T21:23:09Z

This PR depends on #19271.

The existing implementation of #19271 doesn't take gemmC into consideration when computing shared memory size. Though in condition of #19271, gemmC always get promoted and we ended always allocating the C tensor in shared memory. Ignoring C tensor will severely underestimate the amount of shared memory used and eventually cause deduceMMASchedule() to pick a tile size too large and go beyond the space limit for shared memory.

This PR addressed this by adding calculateResultSharedMemoryUsedInBytes() and apply it to matmul tiling size derive process.

nirvedhmeshram · 2024-12-02T15:31:24Z

compiler/src/iree/compiler/Codegen/Common/GPU/GPUHeuristics.cpp

+    // Currently we always do C promotion when the problem doesn't align with
+    // the intrinsic. TODO (jerryyin): Make this a tuning parameter when we are
+    // capable of not doing C promotion in unaligned cases.
+    bool doCPromotion = !mustBeAligned;


We are already capable of doing c promotion in aligned case, see this test
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir#L1031
and in unaligned case the previous dependent commit will add a pipeline test it, but we have a c promotion test for that too.
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_promote_matmul_operands.mlir#L136

I think what we want is to have doCPromotion as an input argument to deduceMMASchedule , maybe thats what you meant by the TODO above but it can already be done and we shouldnt try to connect it implicitly with the alignment.

Agreed, I double checked the code base for scenarios where we'd setPromotedOperandList(..., {0,1,2}); and confirmed that this new set you've added is the only place we'd care about. (The other place - set attention vector distribute - is less relevant and has its own heuristic derivation logic).

I've also investigated a bit about the other alternative: not using the doCPromotion as an argument carried through the stack when attempting to getting matmul lowering config. But rather, directly query the promote_operands field from the #iree_gpu.lowering_config. As it turns out, the field is only written after all tuning parameters have been deduced. So ordering wise, this alternative would make as much sense as we just pass doCPromotion as an argument as I currently implemented.

With this, I'll rebase and add doCPromotion through the calling stack in the following commit.

- This is a followup of iree-org#19271 - This take gemmC promotion into consideration in computing shared memory consumption Signed-off-by: jerryyin <zhuoryin@amd.com>

nirvedhmeshram

LGTM

… shapes (#19484) This PR does two things 1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the `iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this. #19196 #19307 llvm/llvm-project#117340 2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now #19399 llvm/llvm-project#119039 Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. [Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0) is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed. The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change. The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902 --------- Signed-off-by: Nirvedh <nirvedh@gmail.com>

jerryyin requested review from MaheshRavishankar, qedawkins, kuhar, Groverkss and antiagainst as code owners November 26, 2024 21:23

jerryyin changed the title ~~Compute gemmC size when C promotion is done in padding matmul~~ [Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul Nov 26, 2024

nirvedhmeshram reviewed Dec 2, 2024

View reviewed changes

jerryyin force-pushed the unaligned_mm branch from 6c9ffd2 to 7e8f9d3 Compare December 4, 2024 21:30

Compute gemmC size when C promotion is done in padding matmul

dc0a924

- This is a followup of iree-org#19271 - This take gemmC promotion into consideration in computing shared memory consumption Signed-off-by: jerryyin <zhuoryin@amd.com>

jerryyin force-pushed the unaligned_mm branch from 7e8f9d3 to dc0a924 Compare December 4, 2024 21:33

jerryyin requested a review from nirvedhmeshram December 4, 2024 21:37

nirvedhmeshram approved these changes Dec 5, 2024

View reviewed changes

nirvedhmeshram merged commit d2c8e5e into iree-org:main Dec 6, 2024
37 of 38 checks passed

nirvedhmeshram mentioned this pull request Dec 13, 2024

[GPU] Use padding in IGEMM pipeline to support unaligned to intrinsic shapes #19484

Merged

jerryyin deleted the unaligned_mm branch January 3, 2025 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

jerryyin commented Nov 26, 2024 •

edited

Loading

nirvedhmeshram Dec 2, 2024 •

edited

Loading

jerryyin Dec 4, 2024

nirvedhmeshram left a comment

[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

Conversation

jerryyin commented Nov 26, 2024 • edited Loading

nirvedhmeshram Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

jerryyin Dec 4, 2024

Choose a reason for hiding this comment

nirvedhmeshram left a comment

Choose a reason for hiding this comment

jerryyin commented Nov 26, 2024 •

edited

Loading

nirvedhmeshram Dec 2, 2024 •

edited

Loading