Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor Iterator loop unrolling #17667

Closed

Conversation

jjsjann123
Copy link
Collaborator

Summary:

Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by torch.std)

This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.

Summary:

Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by `torch.std`)

This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.
@jjsjann123
Copy link
Collaborator Author

This is the loop unrolling PR that was dependent on #17428

Perf number is listed in #17428 (comment)

@jjsjann123
Copy link
Collaborator Author

Pinging @umanwizard @colesbury @ngimel for visibility.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pytorchbot retest this please

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved
aten/src/ATen/native/cuda/ReduceOpsKernel.cu Show resolved Hide resolved
@mrshenli
Copy link
Contributor

mrshenli commented Mar 8, 2019

@pytorchbot retest this please

@jjsjann123
Copy link
Collaborator Author

image
@mrshenli Here's the number with achieved bandwidth

also looks like all tests are passing. I'll push another commit to update the typo in comments (that shouldn't break the test). It should be good to merge then and we don't have to wait for another CI run.

@mrshenli
Copy link
Contributor

mrshenli commented Mar 8, 2019

@jjsjann123 thanks for the numbers, they look great. I noticed that the bandwidth has a big jump between 3200X512 to 6400X512 for the factor=2 case. Do you happen to know the reason behind it?

@jjsjann123
Copy link
Collaborator Author

I don't know the precise reason behind the jump. Most likely this means we need to check our launch configs.

But we need to be careful with that, as the same launch configs are also shared with simple reduction kernels. The memory-instruction latency that's hurting welford perf is not relevant for those kernels. Adjustment on the launch config might be tricky to get the best out of both kernels.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umanwizard is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Mar 14, 2019
Summary:
Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by `torch.std`)

This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.
Pull Request resolved: pytorch/pytorch#17667

Differential Revision: D14368325

Pulled By: umanwizard

fbshipit-source-id: 9d64c0dccabdb1b7c3922a6557224af704a1974e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants