Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul #19307

Merged
merged 1 commit into from
Dec 6, 2024

Conversation

jerryyin
Copy link
Member

@jerryyin jerryyin commented Nov 26, 2024

This PR depends on #19271.

The existing implementation of #19271 doesn't take gemmC into consideration when computing shared memory size. Though in condition of #19271, gemmC always get promoted and we ended always allocating the C tensor in shared memory. Ignoring C tensor will severely underestimate the amount of shared memory used and eventually cause deduceMMASchedule() to pick a tile size too large and go beyond the space limit for shared memory.

This PR addressed this by adding calculateResultSharedMemoryUsedInBytes() and apply it to matmul tiling size derive process.

@jerryyin jerryyin changed the title Compute gemmC size when C promotion is done in padding matmul [Codegen][llvmgpu] Compute gemmC size when C promotion is done in padding matmul Nov 26, 2024
// Currently we always do C promotion when the problem doesn't align with
// the intrinsic. TODO (jerryyin): Make this a tuning parameter when we are
// capable of not doing C promotion in unaligned cases.
bool doCPromotion = !mustBeAligned;
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are already capable of doing c promotion in aligned case, see this test
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/LLVMGPU/test/ROCDL/pipeline_tile_and_fuse.mlir#L1031
and in unaligned case the previous dependent commit will add a pipeline test it, but we have a c promotion test for that too.
https://github.com/iree-org/iree/blob/main/compiler/src/iree/compiler/Codegen/Common/GPU/test/gpu_promote_matmul_operands.mlir#L136

I think what we want is to have doCPromotion as an input argument to deduceMMASchedule , maybe thats what you meant by the TODO above but it can already be done and we shouldnt try to connect it implicitly with the alignment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I double checked the code base for scenarios where we'd setPromotedOperandList(..., {0,1,2}); and confirmed that this new set you've added is the only place we'd care about. (The other place - set attention vector distribute - is less relevant and has its own heuristic derivation logic).

I've also investigated a bit about the other alternative: not using the doCPromotion as an argument carried through the stack when attempting to getting matmul lowering config. But rather, directly query the promote_operands field from the #iree_gpu.lowering_config. As it turns out, the field is only written after all tuning parameters have been deduced. So ordering wise, this alternative would make as much sense as we just pass doCPromotion as an argument as I currently implemented.

With this, I'll rebase and add doCPromotion through the calling stack in the following commit.

- This is a followup of iree-org#19271
- This take gemmC promotion into consideration in computing shared
  memory consumption

Signed-off-by: jerryyin <zhuoryin@amd.com>
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nirvedhmeshram nirvedhmeshram merged commit d2c8e5e into iree-org:main Dec 6, 2024
37 of 38 checks passed
nirvedhmeshram added a commit that referenced this pull request Dec 18, 2024
… shapes (#19484)

This PR does two things
1. Allow all GEMM shapes to use padded TileAndFuse Matmul configuration.
This is still behind the
`iree-codegen-llvmgpu-test-tile-and-fuse-matmul=false` flag by default
and does not change the default behavior. However following PRs that
have landed in the past month make it possible to relax the guards we
originally had on this.
#19196
#19307
llvm/llvm-project#117340
2. Allow fused producers to use use padded TileAndFuse Matmul
configuration. Following PRs make this possible now
#19399
llvm/llvm-project#119039

Together this allows us to do padded IGEMM with intrinsics for shapes
unaligned to intrinsic which we use by default.
[Here](https://docs.google.com/spreadsheets/d/1O-SdUZCn5pHsxx7JTGjIIdH6PWCFnvlfe4XBbjEBaIM/edit?gid=0#gid=0)
is the performance difference observed in conv cases in
iree-kernel-benchmark-module that utilize this change. A median speedup
of 2.26x was observed.

The numeric changes I observed with enabling this path were the same
between any aligned shape when comparing intrinsic vs no intrinsic use.
Generally some differences are noticed for narrow types like f16 but
they are within a relative error of 0.001 but since our tests use
absolute errors we may have to change some test values to account for
this change.

The perf difference in CI seem to be within noise margin compared to
main,
https://github.com/iree-org/iree/actions/runs/12323399269/attempts/1#summary-34399247902

---------

Signed-off-by: Nirvedh <nirvedh@gmail.com>
@jerryyin jerryyin deleted the unaligned_mm branch January 3, 2025 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants