-
-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inline PTX to support subgroup shuffle for Nvidia GPUs #297
Conversation
Thanks for your contribution! Looks good at first glance, I'll have a more detailed look either tomorrow evening or this weekend. I'll run some performance tests myself as well on NVIDIA hardware. Also, @StreamHPC might be interested to see this, as they are working on something similar. |
src/clpp11.hpp
Outdated
@@ -44,6 +44,7 @@ | |||
#include <numeric> // std::accumulate | |||
#include <cstring> // std::strlen | |||
#include <cstdio> // fprintf, stderr | |||
#include "assert.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
---> <assert.h>
@@ -355,6 +356,12 @@ class Device { | |||
std::string{"."} + std::to_string(GetInfo<cl_uint>(CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV)); | |||
} | |||
|
|||
// Returns if the Nvidia chip is a Volta or later archicture (sm_70 or higher) | |||
bool IsPostNVIDIAVolta() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CLBlast also has a CUDA back-end, which works because every function in this file is also implemented in cupp11.h
, so you'll have to mimic this behaviour with the same API in the other file as well to make the CUDA version still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't notice! I can do that.
#ifndef INTEL_SUBGROUP_EXTENSION | ||
#define INTEL_SUBGROUP_EXTENSION 0 | ||
#endif | ||
//#ifndef USE_SUBGROUP_SHUFFLING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out? Now USE_SUBGROUP_SHUFFLING
is always 0, or do I see it wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I was doing that to get some performance numbers. I will add it back in. Good catch
src/utilities/compile.cpp
Outdated
// For GPUs with subgroup support, use subgroup shuffling. | ||
// Currently these are Intel via an extension and Nvidia using inline PTX (restricted to 32 bit) | ||
if (device.IsGPU() && (device.HasExtension(kKhronosIntelSubgroups) || | ||
(device.IsNVIDIA() && static_cast<int>(precision) == 32))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to formulate static_cast<int>(precision) == 32
as precision == Precision::kSingle
.
src/utilities/compile.cpp
Outdated
header_string += "#define NVIDIA_POST_VOLTA 1\n"; | ||
} | ||
} | ||
else if (device.HasExtension(kKhronosIntelSubgroups)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a bit doubled now. What if we split it, i.e. keep the original code and just add something, e.g.:
// For Intel GPUs with subgroup support, use subgroup shuffling.
if (device.IsGPU() && device.HasExtension(kKhronosIntelSubgroups)) {
header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
header_string += "#define INTEL_SUBGROUP_EXTENSION 1\n";
}
// For NVIDIA GPUs, use subgroup shuffling.
if (device.IsGPU() && device.IsNVIDIA) {
header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
header_string += "#define NVIDIA_WARPS_AS_SUBGROUPS 1\n";
... // Volta stuff
}
But now the USE_SUBGROUP_SHUFFLING
is duplicated... oh well :-) Doesn't matter too much I guess, your solution is also fine.
// Intel subgroups (https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt) | ||
#ifndef USE_SUBGROUP_SHUFFLING | ||
#define USE_SUBGROUP_SHUFFLING 0 // Optionally enables subgroup shuffling for Intel GPUs | ||
#ifndef NVIDIA_WARPS_AS_SUBGROUPS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid to get lost in the subgroup defines.... could we rename them with a common start, e.g.:
SUBGROUP_SHUFFLING_INTEL
SUBGROUP_SHUFFLING_NVIDIA_PRE_VOLTA
SUBGROUP_SHUFFLING_NVIDIA_POST_VOLTA
And perhaps treat the two NVIDIA ones as separate things, either one of them will be set. I think this makes the host code clearer and also the kernel code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. I'll work on this over the weekend. Thanks!
Thanks for the update of the PR. I think everything is all fine now. I'll just also run some tests myself on NVIDIA hardware to get some idea of how helpful it works, and also as a final correctness check. |
One comment still about the code perhaps: You do have a backslash in one of the pre-processor macro's to break the line. Perhaps better to not do that and just keep it on one (longer) line? Not sure if all OpenCL compilers properly support that. And for some reason even g++ complains about it:
I've then also done some tests on a V100 GPU. I changed the tuning parameters to match yours (I hope). Here is what I get without shuffling:
And here is with your additions:
So quite some gain for the last case in particular. Can you confirm that I am doing the experiments correct, i.e. do you also get only 3 test cases with those settings? |
I did the backslash so that the line doesn't exceed 100 characters, but I'll put it back; you make a good point about compiler support. And yes, I only got 3 test cases with the settings I played with. But I was going to ask: I'm not sure the current tuning configurations will ever trigger this feature. Do you think it would be worthwhile to add some new parameters to the gemm tuning? I don't understand all the options very well so I was wondering if you had some thoughts on this. Thanks! |
Sorry for the late reply. The code looks good to merge it in, thanks for your contribution. About the constraints to enable subgroup shuffling, the following is the current constraint:
In the current set of fixed tuning parameters this scenario is not possible it seems, but in the larger set of parameters to be explored randomly it is possible. We could change the first set, but I'm not sure if it is worth it. I guess also that the kernel with shuffling is not the fastest of all GEMM kernels in the library for those devices? Perhaps for smaller inputs? I propose that for now we merge this in, and perhaps later we'll have more benefit from this addition than now? Perhaps another kernel will use shuffling in the future as well, and this way it'll have NVIDIA and Intel. Any thoughts? |
Sounds good to me; thanks for all the feedback! If I have some time, maybe I'll look at the high-performing settings for Nvidia GPUs and see if shuffling can be incorporated. I've also heard that shuffling for AMD can be incorporated in a similar way. I'm going to have a think about that too. |
From what I understood, Vincent's team at @StreamHPC already have something ready. Perhaps an idea to ask them first? |
Hi Cedric,
We spoke briefly at IWOCL about adding custom support for subgroups across some architectures, as currently only Intel is supported. I had a go at providing support for Nvidia GPUs using inline PTX:
https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html
I've checked on my Intel HD5500 and Nvidia 940m and the code compiles and runs as expected.
I tried doing some sgemm tuning on Nvidia, but it is really difficult because the current constraints don't allow enable Nvidia subgroups. From xgemm_part1.opencl it requires (NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE). Because Nvidia has a subgroup size of 32, this isn't triggered as far as I could tell in any provided tuning.
If you look at my first commit, I modified some of the tuning parameters to allow a small number of experiments around the Nvidia subgroups. I'm able to see a reliable performance difference of 92.5 GFLOPS vs 77.8 GFLOPS when subgroups are enabled on my 940m.
I would be interested in hearing your thoughts!