Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TKW] Implement support for multiple iter args on Reduction #166

Merged
merged 5 commits into from
Sep 27, 2024

Conversation

raikonenfnu
Copy link
Contributor

@raikonenfnu raikonenfnu commented Sep 25, 2024

The main motivation behind this PR is to enable multiple induction variable/iterArg on the same tiled "Reduction" loop. To enable above we did a couple things:

  1. Enable lowering/expansion on operator.getitem (the op that extract multiple results in python i.e res0, res1 = fn) by templating it onGetResult(CustomOp) since they have the same args and interface and can reuse most of the indexing/expansion helper.

  2. Introduce res_idx, a variable to represent which result index of an op we are referring to, during expansion and context map. This is useful for ops that has more than one results / variables as outputs.

  3. bug fix in expand_reduction, where we hoist out iterating and expanding of reduction.init_args out of the loop that iterates and expands over the yield/return_val of the reduction loop. It is expected that the size of init_args is the same as size of yield/return_val. Hence if we had N iter_args/yields, we ended up expanding the init_args N x N time instead of N times. We haven't seen it thus far because we have been only playing with 1 init_arg/iterArg, and 1x1 == 1.

  4. Introduce a canonicalization pattern to fold chains of GetResult. this is because GetResult by semantic/design is only expected to extract and have one result. Hence a chain of GetResult should just be replaced by itself. This help clean up the IR.

num.4 also helps circumvent issue where Reduction and GetResult is expanded completely by itself not following the DFS structure per dimension like the rest of the expansion code. This becomes especially problematic for multiple IterArg since Getitem is not expecting its' source value to be expanded without it.

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Copy link
Contributor

@harsh-nod harsh-nod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! modulo some comments

shark_turbine/kernel/wave/utils.py Outdated Show resolved Hide resolved
shark_turbine/kernel/wave/expansion.py Show resolved Hide resolved
shark_turbine/kernel/wave/expansion.py Outdated Show resolved Hide resolved
@@ -343,7 +351,7 @@ def repeat(
# Assert equal does cast to boolean on torch.Tensor
# which causes issues, hence we cast to numpy before
# checking.
assert_equal(c, ref.values.numpy())
assert_allclose(ref, c, atol=0.07)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That tolerance seems quite high for fp32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowered to 0.015 :)

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
@raikonenfnu raikonenfnu merged commit 0e16d54 into iree-org:main Sep 27, 2024
8 checks passed
IanNod pushed a commit to IanNod/iree-turbine that referenced this pull request Sep 30, 2024
…g#166)

The main motivation behind this PR is to enable multiple induction
variable/iterArg on the same tiled "Reduction" loop. To enable above we
did a couple things:

1. Enable lowering/expansion on `operator.getitem` (the op that extract
multiple results in python i.e `res0, res1 = fn`) by templating it
on`GetResult(CustomOp)` since they have the same args and interface and
can reuse most of the indexing/expansion helper.

2. Introduce `res_idx`, a variable to represent which result index of an
op we are referring to, during expansion and context map. This is useful
for ops that has more than one results / variables as outputs.

3. bug fix in expand_reduction, where we hoist out iterating and
expanding of `reduction.init_args` out of the loop that iterates and
expands over the `yield`/`return_val` of the reduction loop. It is
expected that the size of `init_args` is the same as size of
`yield`/`return_val`. Hence if we had N iter_args/yields, we ended up
expanding the `init_args` N x N time instead of N times. We haven't seen
it thus far because we have been only playing with 1 init_arg/iterArg,
and 1x1 == 1.

4. Introduce a canonicalization pattern to fold chains of GetResult.
this is because GetResult by semantic/design is only expected to extract
and have one result. Hence a chain of GetResult should just be replaced
by itself. This help clean up the IR.

num.4 also helps circumvent issue where Reduction and GetResult is
expanded completely by itself not following the DFS structure per
dimension like the rest of the expansion code. This becomes especially
problematic for multiple IterArg since Getitem is not expecting its'
source value to be expanded without it.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
IanNod pushed a commit to IanNod/iree-turbine that referenced this pull request Sep 30, 2024
…g#166)

The main motivation behind this PR is to enable multiple induction
variable/iterArg on the same tiled "Reduction" loop. To enable above we
did a couple things:

1. Enable lowering/expansion on `operator.getitem` (the op that extract
multiple results in python i.e `res0, res1 = fn`) by templating it
on`GetResult(CustomOp)` since they have the same args and interface and
can reuse most of the indexing/expansion helper.

2. Introduce `res_idx`, a variable to represent which result index of an
op we are referring to, during expansion and context map. This is useful
for ops that has more than one results / variables as outputs.

3. bug fix in expand_reduction, where we hoist out iterating and
expanding of `reduction.init_args` out of the loop that iterates and
expands over the `yield`/`return_val` of the reduction loop. It is
expected that the size of `init_args` is the same as size of
`yield`/`return_val`. Hence if we had N iter_args/yields, we ended up
expanding the `init_args` N x N time instead of N times. We haven't seen
it thus far because we have been only playing with 1 init_arg/iterArg,
and 1x1 == 1.

4. Introduce a canonicalization pattern to fold chains of GetResult.
this is because GetResult by semantic/design is only expected to extract
and have one result. Hence a chain of GetResult should just be replaced
by itself. This help clean up the IR.

num.4 also helps circumvent issue where Reduction and GetResult is
expanded completely by itself not following the DFS structure per
dimension like the rest of the expansion code. This becomes especially
problematic for multiple IterArg since Getitem is not expecting its'
source value to be expanded without it.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Ian <ian.nordeng@amd.com>
IanNod pushed a commit to IanNod/iree-turbine that referenced this pull request Sep 30, 2024
…g#166)

The main motivation behind this PR is to enable multiple induction
variable/iterArg on the same tiled "Reduction" loop. To enable above we
did a couple things:

1. Enable lowering/expansion on `operator.getitem` (the op that extract
multiple results in python i.e `res0, res1 = fn`) by templating it
on`GetResult(CustomOp)` since they have the same args and interface and
can reuse most of the indexing/expansion helper.

2. Introduce `res_idx`, a variable to represent which result index of an
op we are referring to, during expansion and context map. This is useful
for ops that has more than one results / variables as outputs.

3. bug fix in expand_reduction, where we hoist out iterating and
expanding of `reduction.init_args` out of the loop that iterates and
expands over the `yield`/`return_val` of the reduction loop. It is
expected that the size of `init_args` is the same as size of
`yield`/`return_val`. Hence if we had N iter_args/yields, we ended up
expanding the `init_args` N x N time instead of N times. We haven't seen
it thus far because we have been only playing with 1 init_arg/iterArg,
and 1x1 == 1.

4. Introduce a canonicalization pattern to fold chains of GetResult.
this is because GetResult by semantic/design is only expected to extract
and have one result. Hence a chain of GetResult should just be replaced
by itself. This help clean up the IR.

num.4 also helps circumvent issue where Reduction and GetResult is
expanded completely by itself not following the DFS structure per
dimension like the rest of the expansion code. This becomes especially
problematic for multiple IterArg since Getitem is not expecting its'
source value to be expanded without it.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Ian <ian.nordeng@amd.com>
stellaraccident pushed a commit that referenced this pull request Oct 13, 2024
The main motivation behind this PR is to enable multiple induction
variable/iterArg on the same tiled "Reduction" loop. To enable above we
did a couple things:

1. Enable lowering/expansion on `operator.getitem` (the op that extract
multiple results in python i.e `res0, res1 = fn`) by templating it
on`GetResult(CustomOp)` since they have the same args and interface and
can reuse most of the indexing/expansion helper.

2. Introduce `res_idx`, a variable to represent which result index of an
op we are referring to, during expansion and context map. This is useful
for ops that has more than one results / variables as outputs.

3. bug fix in expand_reduction, where we hoist out iterating and
expanding of `reduction.init_args` out of the loop that iterates and
expands over the `yield`/`return_val` of the reduction loop. It is
expected that the size of `init_args` is the same as size of
`yield`/`return_val`. Hence if we had N iter_args/yields, we ended up
expanding the `init_args` N x N time instead of N times. We haven't seen
it thus far because we have been only playing with 1 init_arg/iterArg,
and 1x1 == 1.

4. Introduce a canonicalization pattern to fold chains of GetResult.
this is because GetResult by semantic/design is only expected to extract
and have one result. Hence a chain of GetResult should just be replaced
by itself. This help clean up the IR.

num.4 also helps circumvent issue where Reduction and GetResult is
expanded completely by itself not following the DFS structure per
dimension like the rest of the expansion code. This becomes especially
problematic for multiple IterArg since Getitem is not expecting its'
source value to be expanded without it.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants