-
Notifications
You must be signed in to change notification settings - Fork 23.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor Iterator loop unrolling #17667
Tensor Iterator loop unrolling #17667
Conversation
Summary: Modified Tensor Iterator gpu reduction kernel. Creating multiple accumulator during thread reduce, this removes data dependency between unrolled loops, expose instruction level parallelism that benefits latency bounded kernels (e.g. welford used by `torch.std`) This approach increases register usage, such that we need to tune unrolling factors to prevent register spilling. Current implementation tune down the unrolling factor to 2 for welford (register heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.
This is the loop unrolling PR that was dependent on #17428 Perf number is listed in #17428 (comment) |
Pinging @umanwizard @colesbury @ngimel for visibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pytorchbot retest this please
@pytorchbot retest this please |
also looks like all tests are passing. I'll push another commit to update the typo in comments (that shouldn't break the test). It should be good to merge then and we don't have to wait for another CI run. |
@jjsjann123 thanks for the numbers, they look great. I noticed that the bandwidth has a big jump between 3200X512 to 6400X512 for the factor=2 case. Do you happen to know the reason behind it? |
I don't know the precise reason behind the jump. Most likely this means we need to check our launch configs. But we need to be careful with that, as the same launch configs are also shared with simple reduction kernels. The memory-instruction latency that's hurting welford perf is not relevant for those kernels. Adjustment on the launch config might be tricky to get the best out of both kernels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@umanwizard is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: Modified Tensor Iterator gpu reduction kernel. Creating multiple accumulator during thread reduce, this removes data dependency between unrolled loops, expose instruction level parallelism that benefits latency bounded kernels (e.g. welford used by `torch.std`) This approach increases register usage, such that we need to tune unrolling factors to prevent register spilling. Current implementation tune down the unrolling factor to 2 for welford (register heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels. Pull Request resolved: pytorch/pytorch#17667 Differential Revision: D14368325 Pulled By: umanwizard fbshipit-source-id: 9d64c0dccabdb1b7c3922a6557224af704a1974e
Summary:
Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by
torch.std
)This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.