inline PTX to support subgroup shuffle for Nvidia GPUs #297

tyler-utah · 2018-07-11T19:58:35Z

Hi Cedric,

We spoke briefly at IWOCL about adding custom support for subgroups across some architectures, as currently only Intel is supported. I had a go at providing support for Nvidia GPUs using inline PTX:

https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html

I've checked on my Intel HD5500 and Nvidia 940m and the code compiles and runs as expected.

I tried doing some sgemm tuning on Nvidia, but it is really difficult because the current constraints don't allow enable Nvidia subgroups. From xgemm_part1.opencl it requires (NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE). Because Nvidia has a subgroup size of 32, this isn't triggered as far as I could tell in any provided tuning.

If you look at my first commit, I modified some of the tuning parameters to allow a small number of experiments around the Nvidia subgroups. I'm able to see a reliable performance difference of 92.5 GFLOPS vs 77.8 GFLOPS when subgroups are enabled on my 940m.

I would be interested in hearing your thoughts!

CNugteren · 2018-07-12T20:19:19Z

Thanks for your contribution! Looks good at first glance, I'll have a more detailed look either tomorrow evening or this weekend. I'll run some performance tests myself as well on NVIDIA hardware.

Also, @StreamHPC might be interested to see this, as they are working on something similar.

CNugteren · 2018-07-13T19:39:23Z

src/clpp11.hpp

@@ -44,6 +44,7 @@
 #include <numeric>   // std::accumulate
 #include <cstring>   // std::strlen
 #include <cstdio>    // fprintf, stderr
+#include "assert.h"


---> <assert.h>

CNugteren · 2018-07-13T19:40:43Z

src/clpp11.hpp

@@ -355,6 +356,12 @@ class Device {
           std::string{"."} + std::to_string(GetInfo<cl_uint>(CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV));
  }

+  // Returns if the Nvidia chip is a Volta or later archicture (sm_70 or higher)
+  bool IsPostNVIDIAVolta() const {


CLBlast also has a CUDA back-end, which works because every function in this file is also implemented in cupp11.h, so you'll have to mimic this behaviour with the same API in the other file as well to make the CUDA version still work.

Ah, I didn't notice! I can do that.

CNugteren · 2018-07-13T19:43:23Z

src/kernels/level3/xgemm_part1.opencl

+#ifndef INTEL_SUBGROUP_EXTENSION
+  #define INTEL_SUBGROUP_EXTENSION 0
+#endif
+//#ifndef USE_SUBGROUP_SHUFFLING


Why is this commented out? Now USE_SUBGROUP_SHUFFLING is always 0, or do I see it wrong?

oops, I was doing that to get some performance numbers. I will add it back in. Good catch

CNugteren · 2018-07-13T19:45:28Z

src/utilities/compile.cpp

+  // For GPUs with subgroup support, use subgroup shuffling.
+  // Currently these are Intel via an extension and Nvidia using inline PTX (restricted to 32 bit)
+  if (device.IsGPU() && (device.HasExtension(kKhronosIntelSubgroups) ||
+                        (device.IsNVIDIA() && static_cast<int>(precision) == 32))) {


Better to formulate static_cast<int>(precision) == 32 as precision == Precision::kSingle.

CNugteren · 2018-07-13T19:48:57Z

src/utilities/compile.cpp

+        header_string += "#define NVIDIA_POST_VOLTA 1\n";
+      }
+    }
+    else if (device.HasExtension(kKhronosIntelSubgroups)) {


This logic is a bit doubled now. What if we split it, i.e. keep the original code and just add something, e.g.:

// For Intel GPUs with subgroup support, use subgroup shuffling. if (device.IsGPU() && device.HasExtension(kKhronosIntelSubgroups)) { header_string += "#define USE_SUBGROUP_SHUFFLING 1\n"; header_string += "#define INTEL_SUBGROUP_EXTENSION 1\n"; } // For NVIDIA GPUs, use subgroup shuffling. if (device.IsGPU() && device.IsNVIDIA) { header_string += "#define USE_SUBGROUP_SHUFFLING 1\n"; header_string += "#define NVIDIA_WARPS_AS_SUBGROUPS 1\n"; ... // Volta stuff }

But now the USE_SUBGROUP_SHUFFLING is duplicated... oh well :-) Doesn't matter too much I guess, your solution is also fine.

CNugteren · 2018-07-13T19:53:23Z

src/kernels/level3/xgemm_part1.opencl

-// Intel subgroups (https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt)
-#ifndef USE_SUBGROUP_SHUFFLING
-  #define USE_SUBGROUP_SHUFFLING 0     // Optionally enables subgroup shuffling for Intel GPUs
+#ifndef NVIDIA_WARPS_AS_SUBGROUPS


I'm afraid to get lost in the subgroup defines.... could we rename them with a common start, e.g.:

SUBGROUP_SHUFFLING_INTEL SUBGROUP_SHUFFLING_NVIDIA_PRE_VOLTA SUBGROUP_SHUFFLING_NVIDIA_POST_VOLTA

And perhaps treat the two NVIDIA ones as separate things, either one of them will be set. I think this makes the host code clearer and also the kernel code.

Good idea. I'll work on this over the weekend. Thanks!

CNugteren · 2018-07-16T18:18:53Z

Thanks for the update of the PR. I think everything is all fine now. I'll just also run some tests myself on NVIDIA hardware to get some idea of how helpful it works, and also as a final correctness check.

CNugteren · 2018-07-16T19:58:16Z

One comment still about the code perhaps: You do have a backslash in one of the pre-processor macro's to break the line. Perhaps better to not do that and just keep it on one (longer) line? Not sure if all OpenCL compilers properly support that. And for some reason even g++ complains about it:

CLBlast-test/CLBlast/src/routines/level3/../../kernels/level3/xgemm_part1.opencl:137:76: warning: backslash and newline separated by space

I've then also done some tests on a V100 GPU. I changed the tuning parameters to match yours (I hope). Here is what I get without shuffling:

* Found 3 configuration(s)
* Parameters explored: GEMMK MWG NWG KWG MDIMC NDIMC MDIMA NDIMB KWI VWM VWN STRM STRN SA SB KREG 

|   ID | total |                                                                           param |       compiles |         time | GFLOPS |            status |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                               - |             OK |      1.68 ms |      - |      reference OK |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|    1 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    1 |   OK      7 ms |      0.96 ms | 2237.7 |     results match |
|    2 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    2 |   OK      6 ms |      1.03 ms | 2094.9 |     results match |
|    3 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    4 |   OK      6 ms |      1.34 ms | 1599.6 |     results match |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x

And here is with your additions:

|   ID | total |                                                                           param |       compiles |         time | GFLOPS |            status |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                               - |             OK |      1.66 ms |      - |      reference OK |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|    1 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    1 |   OK      8 ms |      0.90 ms | 2391.8 |     results match |
|    2 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    2 |   OK      7 ms |      0.88 ms | 2446.8 |     results match |
|    3 |     3 |    1   64   64    1   64    2   64    2    1    1    1    0    0    0    0    4 |   OK     10 ms |      0.89 ms | 2420.3 |     results match |
x------x-------x---------------------------------------------------------------------------------x----------------x--------------x--------x-------------------x

So quite some gain for the last case in particular. Can you confirm that I am doing the experiments correct, i.e. do you also get only 3 test cases with those settings?

tyler-utah · 2018-07-17T00:13:11Z

I did the backslash so that the line doesn't exceed 100 characters, but I'll put it back; you make a good point about compiler support.

And yes, I only got 3 test cases with the settings I played with. But I was going to ask: I'm not sure the current tuning configurations will ever trigger this feature. Do you think it would be worthwhile to add some new parameters to the gemm tuning? I don't understand all the options very well so I was wondering if you had some thoughts on this.

Thanks!

CNugteren · 2018-07-22T13:23:16Z

Sorry for the late reply. The code looks good to merge it in, thanks for your contribution.

About the constraints to enable subgroup shuffling, the following is the current constraint:

NVIDIA subgroup size is 32
Subgroup size has to: 1) match the NWI parameter (the amount of work per thread), thus NWI has to be 32, and 2) MDIMC has to be smaller than the subgroup size, thus MDIMC has to be less than 32.
NWI is defined as NWG / NDIMC. Here NDIMC is the amount of threads per workgroup in N-dimension, and NWG is the amount of 'work' done per workgroup in the N-dimension.
In summary, we'll thus need: an NWG that is 32 times as large as NDIMC, and an MDIMC which is smaller than 32.

In the current set of fixed tuning parameters this scenario is not possible it seems, but in the larger set of parameters to be explored randomly it is possible. We could change the first set, but I'm not sure if it is worth it. I guess also that the kernel with shuffling is not the fastest of all GEMM kernels in the library for those devices? Perhaps for smaller inputs?

I propose that for now we merge this in, and perhaps later we'll have more benefit from this addition than now? Perhaps another kernel will use shuffling in the future as well, and this way it'll have NVIDIA and Intel. Any thoughts?

tyler-utah · 2018-07-22T17:17:24Z

Sounds good to me; thanks for all the feedback!

If I have some time, maybe I'll look at the high-performing settings for Nvidia GPUs and see if shuffling can be incorporated.

I've also heard that shuffling for AMD can be incorporated in a similar way. I'm going to have a think about that too.

CNugteren · 2018-07-23T17:42:59Z

I've also heard that shuffling for AMD can be incorporated in a similar way. I'm going to have a think about that too.

From what I understood, Vincent's team at @StreamHPC already have something ready. Perhaps an idea to ask them first?

tyler-utah added 2 commits July 11, 2018 15:12

added inline ptx to support shuffle on Nvidia GPUs

7f2e98a

restored some of the changed tuning files for xgemm

3609342

CNugteren reviewed Jul 13, 2018

View reviewed changes

tyler-utah added 2 commits July 14, 2018 19:50

Applied feedback from Cedric from first pull request

7709a73

forgot to add test cases back in, oops

f4e5b1c

moved a two-line macro to a single line

0772d63

CNugteren merged commit f8fb707 into CNugteren:master Jul 23, 2018

CNugteren mentioned this pull request Nov 3, 2023

Navi card subgroup shuffle support for gemm #512

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inline PTX to support subgroup shuffle for Nvidia GPUs #297

inline PTX to support subgroup shuffle for Nvidia GPUs #297

tyler-utah commented Jul 11, 2018

CNugteren commented Jul 12, 2018

CNugteren Jul 13, 2018

CNugteren Jul 13, 2018

tyler-utah Jul 13, 2018

CNugteren Jul 13, 2018

tyler-utah Jul 13, 2018

CNugteren Jul 13, 2018

CNugteren Jul 13, 2018

CNugteren Jul 13, 2018

tyler-utah Jul 13, 2018

CNugteren commented Jul 16, 2018

CNugteren commented Jul 16, 2018

tyler-utah commented Jul 17, 2018

CNugteren commented Jul 22, 2018 •

edited

Loading

tyler-utah commented Jul 22, 2018

CNugteren commented Jul 23, 2018

inline PTX to support subgroup shuffle for Nvidia GPUs #297

inline PTX to support subgroup shuffle for Nvidia GPUs #297

Conversation

tyler-utah commented Jul 11, 2018

CNugteren commented Jul 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CNugteren commented Jul 16, 2018

CNugteren commented Jul 16, 2018

tyler-utah commented Jul 17, 2018

CNugteren commented Jul 22, 2018 • edited Loading

tyler-utah commented Jul 22, 2018

CNugteren commented Jul 23, 2018

CNugteren commented Jul 22, 2018 •

edited

Loading