-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307
Conversation
// Currently we always do C promotion when the problem doesn't align with | ||
// the intrinsic. TODO (jerryyin): Make this a tuning parameter when we are | ||
// capable of not doing C promotion in unaligned cases. | ||
bool doCPromotion = !mustBeAligned; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are already capable of doing c promotion in aligned case, see this test
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir#L1031
and in unaligned case the previous dependent commit will add a pipeline test it, but we have a c promotion test for that too.
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_promote_matmul_operands.mlir#L136
I think what we want is to have doCPromotion
as an input argument to deduceMMASchedule
, maybe thats what you meant by the TODO above but it can already be done and we shouldnt try to connect it implicitly with the alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I double checked the code base for scenarios where we'd setPromotedOperandList(..., {0,1,2});
and confirmed that this new set you've added is the only place we'd care about. (The other place - set attention vector distribute - is less relevant and has its own heuristic derivation logic).
I've also investigated a bit about the other alternative: not using the doCPromotion
as an argument carried through the stack when attempting to getting matmul lowering config. But rather, directly query the promote_operands
field from the #iree_gpu.lowering_config
. As it turns out, the field is only written after all tuning parameters have been deduced. So ordering wise, this alternative would make as much sense as we just pass doCPromotion
as an argument as I currently implemented.
With this, I'll rebase and add doCPromotion
through the calling stack in the following commit.
6c9ffd2
to
7e8f9d3
Compare
- This is a followup of iree-org#19271 - This take gemmC promotion into consideration in computing shared memory consumption Signed-off-by: jerryyin <zhuoryin@amd.com>
7e8f9d3
to
dc0a924
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… shapes (#19484) This PR does two things 1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration. This is still behind the `iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default and does not change the default behavior. However following PRs that have landed in the past month make it possible to relax the guards we originally had on this. #19196 #19307 llvm/llvm-project#117340 2. Allow fused producers to use use padded TileAndFuse Matmul configuration. Following PRs make this possible now #19399 llvm/llvm-project#119039 Together this allows us to do padded IGEMM with intrinsics for shapes unaligned to intrinsic which we use by default. [Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0) is the performance difference observed in conv cases in iree-kernel-benchmark-module that utilize this change. A median speedup of 2.26x was observed. The numeric changes I observed with enabling this path were the same between any aligned shape when comparing intrinsic vs no intrinsic use. Generally some differences are noticed for narrow types like f16 but they are within a relative error of 0.001 but since our tests use absolute errors we may have to change some test values to account for this change. The perf difference in CI seem to be within noise margin compared to main, https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902 --------- Signed-off-by: Nirvedh <nirvedh@gmail.com>
This PR depends on #19271.
The existing implementation of #19271 doesn't take gemmC into consideration when computing shared memory size. Though in condition of #19271, gemmC always get promoted and we ended always allocating the C tensor in shared memory. Ignoring C tensor will severely underestimate the amount of shared memory used and eventually cause
deduceMMASchedule()
to pick a tile size too large and go beyond the space limit for shared memory.This PR addressed this by adding
calculateResultSharedMemoryUsedInBytes()
and apply it to matmul tiling size derive process.