Tensor Iterator loop unrolling #17667

jjsjann123 · 2019-03-04T22:32:50Z

Summary:

Modified Tensor Iterator gpu reduction kernel.
Creating multiple accumulator during thread reduce, this removes data dependency
between unrolled loops, expose instruction level parallelism that benefits
latency bounded kernels (e.g. welford used by torch.std)

This approach increases register usage, such that we need to tune unrolling
factors to prevent register spilling.
Current implementation tune down the unrolling factor to 2 for welford (register
heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.

Summary: Modified Tensor Iterator gpu reduction kernel. Creating multiple accumulator during thread reduce, this removes data dependency between unrolled loops, expose instruction level parallelism that benefits latency bounded kernels (e.g. welford used by `torch.std`) This approach increases register usage, such that we need to tune unrolling factors to prevent register spilling. Current implementation tune down the unrolling factor to 2 for welford (register heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels.

jjsjann123 · 2019-03-04T22:34:55Z

This is the loop unrolling PR that was dependent on #17428

Perf number is listed in #17428 (comment)

jjsjann123 · 2019-03-04T22:35:35Z

Pinging @umanwizard @colesbury @ngimel for visibility.

facebook-github-bot

@umanwizard has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

mrshenli

@pytorchbot retest this please

aten/src/ATen/native/cuda/Reduce.cuh

aten/src/ATen/native/cuda/ReduceOpsKernel.cu

mrshenli · 2019-03-08T06:07:41Z

@pytorchbot retest this please

jjsjann123 · 2019-03-08T19:40:31Z

@mrshenli Here's the number with achieved bandwidth

also looks like all tests are passing. I'll push another commit to update the typo in comments (that shouldn't break the test). It should be good to merge then and we don't have to wait for another CI run.

mrshenli · 2019-03-08T19:53:28Z

@jjsjann123 thanks for the numbers, they look great. I noticed that the bandwidth has a big jump between 3200X512 to 6400X512 for the factor=2 case. Do you happen to know the reason behind it?

jjsjann123 · 2019-03-08T21:42:14Z

I don't know the precise reason behind the jump. Most likely this means we need to check our launch configs.

But we need to be careful with that, as the same launch configs are also shared with simple reduction kernels. The memory-instruction latency that's hurting welford perf is not relevant for those kernels. Adjustment on the launch config might be tricky to get the best out of both kernels.

facebook-github-bot

@umanwizard is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary: Modified Tensor Iterator gpu reduction kernel. Creating multiple accumulator during thread reduce, this removes data dependency between unrolled loops, expose instruction level parallelism that benefits latency bounded kernels (e.g. welford used by `torch.std`) This approach increases register usage, such that we need to tune unrolling factors to prevent register spilling. Current implementation tune down the unrolling factor to 2 for welford (register heavy kernel), while keeping it unchanged (4) for the rest of reduction kernels. Pull Request resolved: pytorch/pytorch#17667 Differential Revision: D14368325 Pulled By: umanwizard fbshipit-source-id: 9d64c0dccabdb1b7c3922a6557224af704a1974e

facebook-github-bot reviewed Mar 7, 2019

View reviewed changes

mrshenli approved these changes Mar 8, 2019

View reviewed changes

aten/src/ATen/native/cuda/Reduce.cuh Outdated Show resolved Hide resolved

aten/src/ATen/native/cuda/ReduceOpsKernel.cu Show resolved Hide resolved

typo in comment corrected

0a00e9f

facebook-github-bot reviewed Mar 11, 2019

View reviewed changes

facebook-github-bot closed this in 6458a6f Mar 14, 2019

pytorchbot added the merged label Mar 14, 2019

ezyang added the open source label Jun 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor Iterator loop unrolling #17667

Tensor Iterator loop unrolling #17667

jjsjann123 commented Mar 4, 2019

jjsjann123 commented Mar 4, 2019

jjsjann123 commented Mar 4, 2019

facebook-github-bot left a comment

mrshenli left a comment

mrshenli commented Mar 8, 2019

jjsjann123 commented Mar 8, 2019

mrshenli commented Mar 8, 2019

jjsjann123 commented Mar 8, 2019

facebook-github-bot left a comment

Tensor Iterator loop unrolling #17667

Tensor Iterator loop unrolling #17667

Conversation

jjsjann123 commented Mar 4, 2019

jjsjann123 commented Mar 4, 2019

jjsjann123 commented Mar 4, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

mrshenli commented Mar 8, 2019

jjsjann123 commented Mar 8, 2019

mrshenli commented Mar 8, 2019

jjsjann123 commented Mar 8, 2019

facebook-github-bot left a comment

Choose a reason for hiding this comment