Get GEMMs working without minimize_global_loads #167

harsh-nod · 2024-09-25T02:36:28Z

This PR removes the need for propagating indices using
post expansion. The new approach propagates the MMA
indices to the MMA dimensions of all tensors (rather
than just MMA nodes) and then specializes them depending
on whether they lie within the backward slices of the
LHS and RHS or forward slices of the ACC.

This PR removes the need for propagating indices using post expansion. The new approach propagates the MMA indices to the MMA dimensions of all tensors (rather than just MMA nodes) and then specializes them depending on whether they lie within the backward slices of the LHS and RHS or forward slices of the ACC. Signed-off-by: Harsh Menon <harsh@nod-labs.com>

shark_turbine/kernel/wave/utils.py

raikonenfnu · 2024-09-25T06:38:47Z

shark_turbine/kernel/wave/utils.py

+    operand_map = {MMA_LHS: 0, MMA_RHS: 0, MMA_ACC: 0}
+    for key in mma_slices:
+        if custom.fx_node in mma_slices[key]:
+            operand_map[key] = 1


IIUC, we want for every for key in mma_slices to be
{MMA_LHS: 1, MMA_RHS: 0, MMA_ACC: 0} then
{MMA_LHS: 0, MMA_RHS: 1, MMA_ACC: 0} and then
{MMA_LHS: 0, MMA_RHS: 0, MMA_ACC: 1},

But in current state wouldn't this be
{MMA_LHS: 1, MMA_RHS: 0, MMA_ACC: 0}
{MMA_LHS: 1, MMA_RHS: 1, MMA_ACC: 0}
{MMA_LHS: 1, MMA_RHS: 1, MMA_ACC: 1}

Although if it that is indeed what we are going for, can you explaine intuition behind it? :)

So if a node is determined to be in the backward slice of the LHS, then we want to specialize it by substituting {MMA_LHS = 1, all else 0}. For RHS, we want {MMA_RHS = 1, all else 0}. For ACC, {MMA_ACC = 1, all else 0}. And if its not in the backward slices of the LHS and RHS or forward slice of the ACC, then {all = 0}. You can think of this as an alternative to propagation. Because we set the entire indices everywhere, we need to specialize them depending on some constraints, and for that we use the forward/backward slices of the MMA operands.

makes sense, in that case I think we need to move the operand_map = {MMA_LHS: 0, MMA_RHS: 0, MMA_ACC: 0} above the if custom.fx_node in mma_slices[key]:. Otherwise the previous state carry over. i.e we will get:

iter_0 setting MMA_LHS, {MMA_LHS: 1, MMA_RHS: 0, MMA_ACC: 0} iter_1 setting MMA_RHS, {MMA_LHS: 1, MMA_RHS: 1, MMA_ACC: 0} iter_2 setting MMA_ACC, {MMA_LHS: 1, MMA_RHS: 1, MMA_ACC: 1}

Realized I put in the wrong state on the previous comment, updated it to make it make more sense haha

Ah yes, we would get carry over, except for the fact that we return as soon as we get a match. So that guarantees that we our dictionary's values will always only have one non-zero entry (= 1) .

OK that makes sense that's why it's implicitly functionally equivalent. Can we still bring it down though for better clarity/straightforward-ness? :)

raikonenfnu

Code looks good overall, left some comments.

Also quick conceptual Q, so in this PR are we just obtaining the slices to collect all ops that will be impacted by indexing of MMA, and then adjusting their index_seq appropriately S.T it will work with the MMA indexing?

raikonenfnu · 2024-09-25T06:44:34Z

Code looks good overall, left some comments.

Also quick conceptual Q, so in this PR are we just obtaining the slices to collect all ops that will be impacted by indexing of MMA, and then adjusting their index_seq appropriately S.T it will work with the MMA indexing?

Seems like the "specialization"/"adjusting" of index is only related with subbing in the MMA_ACC/MMA_LHS/MMA_RHS to respective op's index_seq, if this is not happening before, wouldn't we still have a symbolic indexing and the program won't run at all?

raikonenfnu · 2024-09-25T06:46:16Z

Code looks good overall, left some comments.
Also quick conceptual Q, so in this PR are we just obtaining the slices to collect all ops that will be impacted by indexing of MMA, and then adjusting their index_seq appropriately S.T it will work with the MMA indexing?

Seems like the "specialization"/"adjusting" of index is only related with subbing in the MMA_ACC/MMA_LHS/MMA_RHS to respective op's index_seq, if this is not happening before, wouldn't we still have a symbolic indexing and the program won't run at all?

Ohhh actually were we circumventing around that issue by using self.lhs.index = self.lhs_index (similarly for rhs and acc)?

harsh-nod · 2024-09-25T17:12:52Z

Code looks good overall, left some comments.
Also quick conceptual Q, so in this PR are we just obtaining the slices to collect all ops that will be impacted by indexing of MMA, and then adjusting their index_seq appropriately S.T it will work with the MMA indexing?

Seems like the "specialization"/"adjusting" of index is only related with subbing in the MMA_ACC/MMA_LHS/MMA_RHS to respective op's index_seq, if this is not happening before, wouldn't we still have a symbolic indexing and the program won't run at all?

Ohhh actually were we circumventing around that issue by using self.lhs.index = self.lhs_index (similarly for rhs and acc)?

So what was happening before is that we were setting the MMA indices only to the MMA op and then propagating this to the operands during post-expansion. The problem with this was that it required us to do even more propagation to get the shared memory IGEMM case working (since we had to propagate the indices all the way to the global read). As an alternative to propagation, we are now setting globally per dimension indices that include the effects of all the constraints. But the problem with this is you end up with a whole bunch of Piecewise functions that you can't reason about. So we use the slices to determine how to convert these Piecewise functions to indices. As I said in another comment, this will be useful when we deal with multiple MMAs because we will now have overlapping slices and can do some sort of "equality saturation" to determine how to resolve multiple indices per node.

This was not a problem before because we restricted these MMA_{LHS/RHS/ACC} variables only to MMA nodes and their neighbors.

raikonenfnu · 2024-09-25T17:47:02Z

Code looks good overall, left some comments.
Also quick conceptual Q, so in this PR are we just obtaining the slices to collect all ops that will be impacted by indexing of MMA, and then adjusting their index_seq appropriately S.T it will work with the MMA indexing?

Seems like the "specialization"/"adjusting" of index is only related with subbing in the MMA_ACC/MMA_LHS/MMA_RHS to respective op's index_seq, if this is not happening before, wouldn't we still have a symbolic indexing and the program won't run at all?

Ohhh actually were we circumventing around that issue by using self.lhs.index = self.lhs_index (similarly for rhs and acc)?

So what was happening before is that we were setting the MMA indices only to the MMA op and then propagating this to the operands during post-expansion. The problem with this was that it required us to do even more propagation to get the shared memory IGEMM case working (since we had to propagate the indices all the way to the global read). As an alternative to propagation, we are now setting globally per dimension indices that include the effects of all the constraints. But the problem with this is you end up with a whole bunch of Piecewise functions that you can't reason about. So we use the slices to determine how to convert these Piecewise functions to indices. As I said in another comment, this will be useful when we deal with multiple MMAs because we will now have overlapping slices and can do some sort of "equality saturation" to determine how to resolve multiple indices per node.

This was not a problem before because we restricted these MMA_{LHS/RHS/ACC} variables only to MMA nodes and their neighbors.

Makes sense, thanks! :)

shark_turbine/kernel/wave/utils.py

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

raikonenfnu

LGTM!

This PR removes the need for propagating indices using post expansion. The new approach propagates the MMA indices to the MMA dimensions of all tensors (rather than just MMA nodes) and then specializes them depending on whether they lie within the backward slices of the LHS and RHS or forward slices of the ACC. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

This PR removes the need for propagating indices using post expansion. The new approach propagates the MMA indices to the MMA dimensions of all tensors (rather than just MMA nodes) and then specializes them depending on whether they lie within the backward slices of the LHS and RHS or forward slices of the ACC. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: Ian <ian.nordeng@amd.com>

This PR removes the need for propagating indices using post expansion. The new approach propagates the MMA indices to the MMA dimensions of all tensors (rather than just MMA nodes) and then specializes them depending on whether they lie within the backward slices of the LHS and RHS or forward slices of the ACC. --------- Signed-off-by: Harsh Menon <harsh@nod-labs.com>

harsh-nod force-pushed the fix_gemm branch 2 times, most recently from d9dffbe to d6f9844 Compare September 25, 2024 02:47

harsh-nod force-pushed the fix_gemm branch from d6f9844 to 8b1905b Compare September 25, 2024 04:46

harsh-nod requested review from Hardcode84, martin-luecke, raikonenfnu and erman-gurses September 25, 2024 04:48

raikonenfnu reviewed Sep 25, 2024

View reviewed changes

shark_turbine/kernel/wave/utils.py Outdated Show resolved Hide resolved

raikonenfnu reviewed Sep 25, 2024

View reviewed changes

shark_turbine/kernel/wave/utils.py Outdated Show resolved Hide resolved

Address Stan's comments

c3eb3cb

Signed-off-by: Harsh Menon <harsh@nod-labs.com>

raikonenfnu approved these changes Sep 26, 2024

View reviewed changes

harsh-nod merged commit d37c6a4 into iree-org:main Sep 26, 2024
8 checks passed

Hardcode84 mentioned this pull request Sep 26, 2024

Fix shared space #158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get GEMMs working without minimize_global_loads #167

Get GEMMs working without minimize_global_loads #167

harsh-nod commented Sep 25, 2024 •

edited

Loading

raikonenfnu Sep 25, 2024

harsh-nod Sep 25, 2024

raikonenfnu Sep 25, 2024 •

edited

Loading

raikonenfnu Sep 25, 2024

harsh-nod Sep 25, 2024

raikonenfnu Sep 25, 2024

raikonenfnu left a comment

raikonenfnu commented Sep 25, 2024

raikonenfnu commented Sep 25, 2024 •

edited

Loading

harsh-nod commented Sep 25, 2024 •

edited

Loading

raikonenfnu commented Sep 25, 2024

raikonenfnu left a comment

Get GEMMs working without minimize_global_loads #167

Get GEMMs working without minimize_global_loads #167

Conversation

harsh-nod commented Sep 25, 2024 • edited Loading

raikonenfnu Sep 25, 2024

Choose a reason for hiding this comment

harsh-nod Sep 25, 2024

Choose a reason for hiding this comment

raikonenfnu Sep 25, 2024 • edited Loading

Choose a reason for hiding this comment

raikonenfnu Sep 25, 2024

Choose a reason for hiding this comment

harsh-nod Sep 25, 2024

Choose a reason for hiding this comment

raikonenfnu Sep 25, 2024

Choose a reason for hiding this comment

raikonenfnu left a comment

Choose a reason for hiding this comment

raikonenfnu commented Sep 25, 2024

raikonenfnu commented Sep 25, 2024 • edited Loading

harsh-nod commented Sep 25, 2024 • edited Loading

raikonenfnu commented Sep 25, 2024

raikonenfnu left a comment

Choose a reason for hiding this comment

harsh-nod commented Sep 25, 2024 •

edited

Loading

raikonenfnu Sep 25, 2024 •

edited

Loading

raikonenfnu commented Sep 25, 2024 •

edited

Loading

harsh-nod commented Sep 25, 2024 •

edited

Loading