Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline PTX to support subgroup shuffle for Nvidia GPUs #297

Merged
merged 5 commits into from
Jul 23, 2018

Conversation

tyler-utah
Copy link
Contributor

Hi Cedric,

We spoke briefly at IWOCL about adding custom support for subgroups across some architectures, as currently only Intel is supported. I had a go at providing support for Nvidia GPUs using inline PTX:

https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html

I've checked on my Intel HD5500 and Nvidia 940m and the code compiles and runs as expected.

I tried doing some sgemm tuning on Nvidia, but it is really difficult because the current constraints don't allow enable Nvidia subgroups. From xgemm_part1.opencl it requires (NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE). Because Nvidia has a subgroup size of 32, this isn't triggered as far as I could tell in any provided tuning.

If you look at my first commit, I modified some of the tuning parameters to allow a small number of experiments around the Nvidia subgroups. I'm able to see a reliable performance difference of 92.5 GFLOPS vs 77.8 GFLOPS when subgroups are enabled on my 940m.

I would be interested in hearing your thoughts!

@CNugteren
Copy link
Owner

Thanks for your contribution! Looks good at first glance, I'll have a more detailed look either tomorrow evening or this weekend. I'll run some performance tests myself as well on NVIDIA hardware.

Also, @StreamHPC might be interested to see this, as they are working on something similar.

src/clpp11.hpp Outdated
@@ -44,6 +44,7 @@
#include <numeric> // std::accumulate
#include <cstring> // std::strlen
#include <cstdio> // fprintf, stderr
#include "assert.h"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

---> <assert.h>

@@ -355,6 +356,12 @@ class Device {
std::string{"."} + std::to_string(GetInfo<cl_uint>(CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV));
}

// Returns if the Nvidia chip is a Volta or later archicture (sm_70 or higher)
bool IsPostNVIDIAVolta() const {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLBlast also has a CUDA back-end, which works because every function in this file is also implemented in cupp11.h, so you'll have to mimic this behaviour with the same API in the other file as well to make the CUDA version still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice! I can do that.

#ifndef INTEL_SUBGROUP_EXTENSION
#define INTEL_SUBGROUP_EXTENSION 0
#endif
//#ifndef USE_SUBGROUP_SHUFFLING
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out? Now USE_SUBGROUP_SHUFFLING is always 0, or do I see it wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I was doing that to get some performance numbers. I will add it back in. Good catch

// For GPUs with subgroup support, use subgroup shuffling.
// Currently these are Intel via an extension and Nvidia using inline PTX (restricted to 32 bit)
if (device.IsGPU() && (device.HasExtension(kKhronosIntelSubgroups) ||
(device.IsNVIDIA() && static_cast<int>(precision) == 32))) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to formulate static_cast<int>(precision) == 32 as precision == Precision::kSingle.

header_string += "#define NVIDIA_POST_VOLTA 1\n";
}
}
else if (device.HasExtension(kKhronosIntelSubgroups)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a bit doubled now. What if we split it, i.e. keep the original code and just add something, e.g.:

  // For Intel GPUs with subgroup support, use subgroup shuffling.
  if (device.IsGPU() && device.HasExtension(kKhronosIntelSubgroups)) {
    header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
    header_string += "#define INTEL_SUBGROUP_EXTENSION 1\n";
  }

  // For NVIDIA GPUs, use subgroup shuffling.
  if (device.IsGPU() && device.IsNVIDIA) {
    header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
    header_string += "#define NVIDIA_WARPS_AS_SUBGROUPS 1\n";
     ... // Volta stuff
  }

But now the USE_SUBGROUP_SHUFFLING is duplicated... oh well :-) Doesn't matter too much I guess, your solution is also fine.

// Intel subgroups (https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt)
#ifndef USE_SUBGROUP_SHUFFLING
#define USE_SUBGROUP_SHUFFLING 0 // Optionally enables subgroup shuffling for Intel GPUs
#ifndef NVIDIA_WARPS_AS_SUBGROUPS
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid to get lost in the subgroup defines.... could we rename them with a common start, e.g.:

SUBGROUP_SHUFFLING_INTEL
SUBGROUP_SHUFFLING_NVIDIA_PRE_VOLTA
SUBGROUP_SHUFFLING_NVIDIA_POST_VOLTA

And perhaps treat the two NVIDIA ones as separate things, either one of them will be set. I think this makes the host code clearer and also the kernel code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll work on this over the weekend. Thanks!

@CNugteren
Copy link
Owner

Thanks for the update of the PR. I think everything is all fine now. I'll just also run some tests myself on NVIDIA hardware to get some idea of how helpful it works, and also as a final correctness check.

@CNugteren
Copy link
Owner

One comment still about the code perhaps: You do have a backslash in one of the pre-processor macro's to break the line. Perhaps better to not do that and just keep it on one (longer) line? Not sure if all OpenCL compilers properly support that. And for some reason even g++ complains about it:

CLBlast-test/CLBlast/src/routines/level3/../../kernels/level3/xgemm_part1.opencl:137:76: warning: backslash and newline separated by space

I've then also done some tests on a V100 GPU. I changed the tuning parameters to match yours (I hope). Here is what I get without shuffling:

* Found 3 configuration(s)
* Parameters explored: GEMMK MWG NWG KWG MDIMC NDIMC MDIMA NDIMB KWI VWM VWN STRM STRN SA SB KREG 

|   ID | total |                                                                           param |       compiles |         time | GFLOPS |            status |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                               - |             OK |      1.68 ms |      - |      reference OK |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|    1 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    1 |   OK      7 ms |      0.96 ms | 2237.7 |     results match |
|    2 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    2 |   OK      6 ms |      1.03 ms | 2094.9 |     results match |
|    3 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    4 |   OK      6 ms |      1.34 ms | 1599.6 |     results match |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x

And here is with your additions:

|   ID | total |                                                                           param |       compiles |         time | GFLOPS |            status |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                               - |             OK |      1.66 ms |      - |      reference OK |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|    1 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    1 |   OK      8 ms |      0.90 ms | 2391.8 |     results match |
|    2 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    2 |   OK      7 ms |      0.88 ms | 2446.8 |     results match |
|    3 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    4 |   OK     10 ms |      0.89 ms | 2420.3 |     results match |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x

So quite some gain for the last case in particular. Can you confirm that I am doing the experiments correct, i.e. do you also get only 3 test cases with those settings?

@tyler-utah
Copy link
Contributor Author

I did the backslash so that the line doesn't exceed 100 characters, but I'll put it back; you make a good point about compiler support.

And yes, I only got 3 test cases with the settings I played with. But I was going to ask: I'm not sure the current tuning configurations will ever trigger this feature. Do you think it would be worthwhile to add some new parameters to the gemm tuning? I don't understand all the options very well so I was wondering if you had some thoughts on this.

Thanks!

@CNugteren
Copy link
Owner

CNugteren commented Jul 22, 2018

Sorry for the late reply. The code looks good to merge it in, thanks for your contribution.

About the constraints to enable subgroup shuffling, the following is the current constraint:

  • NVIDIA subgroup size is 32
  • Subgroup size has to: 1) match the NWI parameter (the amount of work per thread), thus NWI has to be 32, and 2) MDIMC has to be smaller than the subgroup size, thus MDIMC has to be less than 32.
  • NWI is defined as NWG / NDIMC. Here NDIMC is the amount of threads per workgroup in N-dimension, and NWG is the amount of 'work' done per workgroup in the N-dimension.
  • In summary, we'll thus need: an NWG that is 32 times as large as NDIMC, and an MDIMC which is smaller than 32.

In the current set of fixed tuning parameters this scenario is not possible it seems, but in the larger set of parameters to be explored randomly it is possible. We could change the first set, but I'm not sure if it is worth it. I guess also that the kernel with shuffling is not the fastest of all GEMM kernels in the library for those devices? Perhaps for smaller inputs?

I propose that for now we merge this in, and perhaps later we'll have more benefit from this addition than now? Perhaps another kernel will use shuffling in the future as well, and this way it'll have NVIDIA and Intel. Any thoughts?

@tyler-utah
Copy link
Contributor Author

Sounds good to me; thanks for all the feedback!

If I have some time, maybe I'll look at the high-performing settings for Nvidia GPUs and see if shuffling can be incorporated.

I've also heard that shuffling for AMD can be incorporated in a similar way. I'm going to have a think about that too.

@CNugteren
Copy link
Owner

I've also heard that shuffling for AMD can be incorporated in a similar way. I'm going to have a think about that too.

From what I understood, Vincent's team at @StreamHPC already have something ready. Perhaps an idea to ask them first?

@CNugteren CNugteren merged commit f8fb707 into CNugteren:master Jul 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants