Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GPU] Match Tile And Fuse skinny matmul bail-out to Vector Distribute #19857

Merged

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Jan 30, 2025

This PR matches the failure criteria to what we see in the SetContractConfig for vector distribute for bailing out on skinny matmuls.

The dispatch in #19855 goes to 0.068 ms vs the default path which gets 1.64 ms as this skinny matmul with multiple dims cannot be currently supported by vector reduction and warp reduction but tile and fuse can support it using padding.

This needs the flag --iree-codegen-llvmgpu-test-tile-and-fuse-matmul=true

Also the GPUMatmulShapeType was becoming too large as this PR adds batch sizes to it and was giving the following error.

 error: static_assert failed due to requirement 'sizeof(mlir::iree_compiler::GPUMatmulShapeType) <= 256' "You are trying to use a default number of inlined elements for SmallVector<T> but sizeof(T) is really big! Please use an explicit number of inlined elements with SmallVector<T, N> to make sure you really want that much inline storage."

This PR fixes this issue both by explictly mentioning vector sizes in the struct members.

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd need a bit more evidence to change this heuristics. Can you run this over matves-like shapes with, say, 1x to 8x columns in the 'vector' operand and report the numbers? We can add these to iree-kernel-benchmark.

When I benchmarked this in the past, the vector reduction pipeline did a very good job on matvecs with 4 columns.

@nirvedhmeshram
Copy link
Contributor Author

@kuhar would something like this work?
for i=1,2,4,8
NARROW = [
(i, 128, 128),
(i, 256, 256),
(i, 512, 512),
(i, 1024, 1024),
(i, 2048, 2048),
(i, 4096, 4096),
(i, 8192, 8192),
(128, i, 128),
(256, i, 256),
(512, i, 512),
(1024, i, 1024),
(2048, i, 2048),
(4096, i, 4096),
(8192, i, 8192),
]

@MaheshRavishankar
Copy link
Contributor

Can we make the path that doesn't map to intrinsic directly because of shapes go down the tile and fuse by default?

@kuhar
Copy link
Member

kuhar commented Jan 30, 2025

Can we make the path that doesn't map to intrinsic directly because of shapes go down the tile and fuse by default?

I think we need this case split:

  1. Multiple of the intrinsic of supported type --> Default (vector distribution or tile and fuse)
  2. Matvec-like --> warp reduction
  3. Everything else --> tile and fuse

@nirvedhmeshram nirvedhmeshram force-pushed the relax_matvec_threshold_further branch from 09ff19c to 01a6ae6 Compare January 31, 2025 19:14
@nirvedhmeshram nirvedhmeshram changed the title [GPU] Only dont do padding for pure matvecs [GPU] Match Tile And Fuse skinny matmul bail-out to Vector Distribute Jan 31, 2025
@nirvedhmeshram
Copy link
Contributor Author

Can we make the path that doesn't map to intrinsic directly because of shapes go down the tile and fuse by default?

I think we need this case split:

  1. Multiple of the intrinsic of supported type --> Default (vector distribution or tile and fuse)
  2. Matvec-like --> warp reduction
  3. Everything else --> tile and fuse

point 2. and 3. are in scope for this PR. I have re-invented this PR to make it so that we correctly fail on only those cases that vector reduction would actually support.

@nirvedhmeshram nirvedhmeshram requested a review from kuhar January 31, 2025 19:20
@nirvedhmeshram
Copy link
Contributor Author

@kuhar did not end up doing the shape sweep as I am not changing the heuristic.

@jerryyin jerryyin self-requested a review February 3, 2025 21:05
Copy link
Member

@jerryyin jerryyin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine

@nirvedhmeshram nirvedhmeshram force-pushed the relax_matvec_threshold_further branch 3 times, most recently from eb9609f to 5e66560 Compare February 3, 2025 21:42
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
@nirvedhmeshram nirvedhmeshram force-pushed the relax_matvec_threshold_further branch from a378bfb to f613634 Compare February 3, 2025 23:10
Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
@nirvedhmeshram nirvedhmeshram force-pushed the relax_matvec_threshold_further branch from d3b37a9 to e1c3217 Compare February 3, 2025 23:49
@nirvedhmeshram
Copy link
Contributor Author

W7900 runner seems down but since other runners are green, I believe this is okay to merge so landing it.

@nirvedhmeshram nirvedhmeshram merged commit d96a3f0 into iree-org:main Feb 4, 2025
42 checks passed
ita9naiwa pushed a commit to ita9naiwa/iree that referenced this pull request Feb 4, 2025
…iree-org#19857)

This PR matches the failure criteria to what we see in the
SetContractConfig for vector distribute for bailing out on skinny
matmuls.

The dispatch in iree-org#19855 goes to
0.068 ms vs the default path which gets 1.64 ms as this skinny matmul
with multiple dims cannot be currently supported by vector reduction and
warp reduction but tile and fuse can support it using padding.

This needs the flag
`--iree-codegen-llvmgpu-test-tile-and-fuse-matmul=true`

Also the `GPUMatmulShapeType` was becoming too large as this PR adds
batch sizes to it and was giving the following error.
```
 error: static_assert failed due to requirement 'sizeof(mlir::iree_compiler::GPUMatmulShapeType) <= 256' "You are trying to use a default number of inlined elements for SmallVector<T> but sizeof(T) is really big! Please use an explicit number of inlined elements with SmallVector<T, N> to make sure you really want that much inline storage."
 ```
 This PR fixes this issue both by explictly mentioning vector sizes in the struct members.

---------

Signed-off-by: Nirvedh Meshram <nirvedh@gmail.com>
Signed-off-by: Hyunsung Lee <ita9naiwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants