Poor performance of horizontally fused contraction #19855

IanWood1 · 2025-01-30T17:31:35Z

What happened?

See #19847 for some context, the changes there had the side effect of allowing more horizontal fusion. However, this was leading to a slow dispatch.

I noticed disabling horizontal fusion was improving unet perf on the base commit too (but by less), so this issue may be affecting main.

Steps to reproduce your issue

Running fp16 Unet at dd66e55 will reproduce the slow dispatch. It can also be obtained by downloading this gist

Version information

Dispatch was created using dd66e55 from the PR but codegen issues are from main at 50a7087 (PR's base commit)

The text was updated successfully, but these errors were encountered:

…#19857) This PR matches the failure criteria to what we see in the SetContractConfig for vector distribute for bailing out on skinny matmuls. The dispatch in #19855 goes to 0.068 ms vs the default path which gets 1.64 ms as this skinny matmul with multiple dims cannot be currently supported by vector reduction and warp reduction but tile and fuse can support it using padding. This needs the flag `--iree-codegen-llvmgpu-test-tile-and-fuse-matmul=true` Also the `GPUMatmulShapeType` was becoming too large as this PR adds batch sizes to it and was giving the following error. ``` error: static_assert failed due to requirement 'sizeof(mlir::iree_compiler::GPUMatmulShapeType) <= 256' "You are trying to use a default number of inlined elements for SmallVector<T> but sizeof(T) is really big! Please use an explicit number of inlined elements with SmallVector<T, N> to make sure you really want that much inline storage." ``` This PR fixes this issue both by explictly mentioning vector sizes in the struct members. --------- Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>

…iree-org#19884) Currently some efforts such as iree-org#19854 and iree-org#19520 are ongoing to make the Tile and Fuse matmul pipeline on by default. However, these efforts are still WIP in achieving exact parity with the current default of Vector Distribute in all use cases. This PR in the time being tries Tile and Fuse after Vector Distribute so that we can get the benefits of tile and fuse such as handling unaligned to intrinsic shape while leaving the shapes that vector distribute handles untouched. Fixes : iree-org#19864 Fixes: iree-org#19855 --------- Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>

…iree-org#19857) This PR matches the failure criteria to what we see in the SetContractConfig for vector distribute for bailing out on skinny matmuls. The dispatch in iree-org#19855 goes to 0.068 ms vs the default path which gets 1.64 ms as this skinny matmul with multiple dims cannot be currently supported by vector reduction and warp reduction but tile and fuse can support it using padding. This needs the flag `--iree-codegen-llvmgpu-test-tile-and-fuse-matmul=true` Also the `GPUMatmulShapeType` was becoming too large as this PR adds batch sizes to it and was giving the following error. ``` error: static_assert failed due to requirement 'sizeof(mlir::iree_compiler::GPUMatmulShapeType) <= 256' "You are trying to use a default number of inlined elements for SmallVector<T> but sizeof(T) is really big! Please use an explicit number of inlined elements with SmallVector<T, N> to make sure you really want that much inline storage." ``` This PR fixes this issue both by explictly mentioning vector sizes in the struct members. --------- Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com> Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>

IanWood1 added the bug 🐞 Something isn't working label Jan 30, 2025

IanWood1 assigned nirvedhmeshram Jan 30, 2025

IanWood1 added the codegen Shared code generation infrastructure and dialects label Jan 30, 2025

This was referenced Jan 30, 2025

[GPU] Match Tile And Fuse skinny matmul bail-out to Vector Distribute #19857

Merged

[GPU] Use tile and fuse for matmul after vector distribute by default #19884

Merged

nirvedhmeshram closed this as completed in d661efa Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance of horizontally fused contraction #19855

Poor performance of horizontally fused contraction #19855

IanWood1 commented Jan 30, 2025 •

edited

Loading

Poor performance of horizontally fused contraction #19855

Poor performance of horizontally fused contraction #19855

Comments

IanWood1 commented Jan 30, 2025 • edited Loading

What happened?

Steps to reproduce your issue

Version information

IanWood1 commented Jan 30, 2025 •

edited

Loading