Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline PTX to support subgroup shuffle for Nvidia GPUs #297

Merged
merged 5 commits into from
Jul 23, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions src/clpp11.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
#include <numeric> // std::accumulate
#include <cstring> // std::strlen
#include <cstdio> // fprintf, stderr
#include "assert.h"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

---> <assert.h>


// OpenCL
#define CL_USE_DEPRECATED_OPENCL_1_1_APIS // to disable deprecation warnings
Expand Down Expand Up @@ -355,6 +356,12 @@ class Device {
std::string{"."} + std::to_string(GetInfo<cl_uint>(CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV));
}

// Returns if the Nvidia chip is a Volta or later archicture (sm_70 or higher)
bool IsPostNVIDIAVolta() const {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLBlast also has a CUDA back-end, which works because every function in this file is also implemented in cupp11.h, so you'll have to mimic this behaviour with the same API in the other file as well to make the CUDA version still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't notice! I can do that.

assert(HasExtension("cl_nv_device_attribute_query"));
return GetInfo<cl_uint>(CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV) >= 7;
}

// Retrieves the above extra information (if present)
std::string GetExtraInfo() const {
if (HasExtension("cl_amd_device_attribute_query")) { return AMDBoardName(); }
Expand Down
24 changes: 20 additions & 4 deletions src/kernels/level3/xgemm_part1.opencl
Original file line number Diff line number Diff line change
Expand Up @@ -114,13 +114,29 @@ R"(
#define GLOBAL_MEM_FENCE 0 // Global synchronisation barrier for potential better performance
#endif

// Intel subgroups (https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt)
#ifndef USE_SUBGROUP_SHUFFLING
#define USE_SUBGROUP_SHUFFLING 0 // Optionally enables subgroup shuffling for Intel GPUs
#ifndef NVIDIA_WARPS_AS_SUBGROUPS
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid to get lost in the subgroup defines.... could we rename them with a common start, e.g.:

SUBGROUP_SHUFFLING_INTEL
SUBGROUP_SHUFFLING_NVIDIA_PRE_VOLTA
SUBGROUP_SHUFFLING_NVIDIA_POST_VOLTA

And perhaps treat the two NVIDIA ones as separate things, either one of them will be set. I think this makes the host code clearer and also the kernel code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I'll work on this over the weekend. Thanks!

#define NVIDIA_WARPS_AS_SUBGROUPS 0
#endif
#ifndef NVIDIA_POST_VOLTA
#define NVIDIA_POST_VOLTA 0
#endif
#if USE_SUBGROUP_SHUFFLING == 1
#ifndef INTEL_SUBGROUP_EXTENSION
#define INTEL_SUBGROUP_EXTENSION 0
#endif
//#ifndef USE_SUBGROUP_SHUFFLING
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out? Now USE_SUBGROUP_SHUFFLING is always 0, or do I see it wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I was doing that to get some performance numbers. I will add it back in. Good catch

#define USE_SUBGROUP_SHUFFLING 0 // Optionally enables subgroup shuffling for Intel GPUs
//#endif

// Intel subgroups (https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_subgroups.txt)
#if USE_SUBGROUP_SHUFFLING == 1 && INTEL_SUBGROUP_EXTENSION
#define SUBGROUP_SIZE 8 // Assumes subgroup size is always 8 on Intel GPUs
#endif

// NVIDIA warps as subgroups using inline PTX (https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html)
#if USE_SUBGROUP_SHUFFLING == 1 && NVIDIA_WARPS_AS_SUBGROUPS
#define SUBGROUP_SIZE 32 // Assumes subgroup size is always 32 on NVIDIA GPUs
#endif

#if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
#undef USE_SUBGROUP_SHUFFLING
#define USE_SUBGROUP_SHUFFLING 0 // Disables subgroups in case the assumptions don't hold
Expand Down
42 changes: 40 additions & 2 deletions src/kernels/level3/xgemm_part3.opencl
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,44 @@ R"(

// =================================================================================================

// A common interface for subgroup functions

#if USE_SUBGROUP_SHUFFLING == 1

INLINE_FUNC int clblast_get_sub_group_local_id() {

// Intel extension
#if INTEL_SUBGROUP_EXTENSION == 1
return get_sub_group_local_id();

// Nvidia inline PTX
#elif NVIDIA_WARPS_AS_SUBGROUPS == 1
int ret;
asm volatile("mov.u32 %0, %%laneid;" : "=r"(ret) );
return ret;
#endif
}

INLINE_FUNC realN clblast_sub_group_shuffle(realN reg, int src) {

// Intel extension
#if INTEL_SUBGROUP_EXTENSION == 1
return intel_sub_group_shuffle(reg, src);

// Nvidia inline PTX
// Volta and later requires .sync shuffle instructions with an extra mask arg
#elif NVIDIA_WARPS_AS_SUBGROUPS == 1
realN ret;
#if NVIDIA_POST_VOLTA == 1
asm volatile("shfl.sync.idx.b32 %0, %1, %2, 0x1f, 0xffffffff;" : "=f"(ret): "f"(reg), "r"(src));
#else
asm volatile("shfl.idx.b32 %0, %1, %2, 0x1f;" : "=f"(ret): "f"(reg), "r"(src));
#endif
return ret;
#endif
}
#endif

// Main body of the matrix-multiplication algorithm. It calls various (inlined) functions.
INLINE_FUNC void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK,
const __global realM* restrict agm, const __global realN* restrict bgm,
Expand Down Expand Up @@ -130,7 +168,7 @@ INLINE_FUNC void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK,
#elif GEMMK == 1
// Loads data: 2D global --> 2D private (matrix A). Partly, shuffled later among subgroups
#if USE_SUBGROUP_SHUFFLING == 1
const int _ni = get_sub_group_local_id();
const int _ni = clblast_get_sub_group_local_id();
#pragma unroll
for (int _ki = 0; _ki < KREG/VWN; _ki += 1) {
apm[_ki] = GlobalToPrivateA2D(a_ptr, tid_y, _ni, kSizeK, idk, _ki);
Expand Down Expand Up @@ -202,7 +240,7 @@ INLINE_FUNC void XgemmBody(const int kSizeM, const int kSizeN, const int kSizeK,
for (int _ki = 0; _ki < KREG/VWN; _ki += 1) {
const int index = _ni * (MWI/VWM) + _mi;
#if USE_SUBGROUP_SHUFFLING == 1
const realN aval = intel_sub_group_shuffle(apm[_ki], _ni);
const realN aval = clblast_sub_group_shuffle(apm[_ki], _ni);
#else
const realN aval = apm[_ni * (KREG/VWN) + _ki];
#endif
Expand Down
22 changes: 19 additions & 3 deletions src/utilities/compile.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,27 @@ std::shared_ptr<Program> CompileFromSource(
header_string += "#define GLOBAL_MEM_FENCE 1\n";
}

// For Intel GPUs with subgroup support, use subgroup shuffling.
if (device.IsGPU() && device.HasExtension(kKhronosIntelSubgroups)) {
// For GPUs with subgroup support, use subgroup shuffling.
// Currently these are Intel via an extension and Nvidia using inline PTX (restricted to 32 bit)
if (device.IsGPU() && (device.HasExtension(kKhronosIntelSubgroups) ||
(device.IsNVIDIA() && static_cast<int>(precision) == 32))) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to formulate static_cast<int>(precision) == 32 as precision == Precision::kSingle.

header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
}

// Define the flavor of subgroup
if (device.IsNVIDIA()) {
header_string += "#define NVIDIA_WARPS_AS_SUBGROUPS 1\n";

// Nvidia additionally needs to check pre or post volta due to new
// shuffle commands
if (device.IsPostNVIDIAVolta()) {
header_string += "#define NVIDIA_POST_VOLTA 1\n";
}
}
else if (device.HasExtension(kKhronosIntelSubgroups)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is a bit doubled now. What if we split it, i.e. keep the original code and just add something, e.g.:

  // For Intel GPUs with subgroup support, use subgroup shuffling.
  if (device.IsGPU() && device.HasExtension(kKhronosIntelSubgroups)) {
    header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
    header_string += "#define INTEL_SUBGROUP_EXTENSION 1\n";
  }

  // For NVIDIA GPUs, use subgroup shuffling.
  if (device.IsGPU() && device.IsNVIDIA) {
    header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
    header_string += "#define NVIDIA_WARPS_AS_SUBGROUPS 1\n";
     ... // Volta stuff
  }

But now the USE_SUBGROUP_SHUFFLING is duplicated... oh well :-) Doesn't matter too much I guess, your solution is also fine.

header_string += "#define INTEL_SUBGROUP_EXTENSION 1\n";
}
}

// Optionally adds a translation header from OpenCL kernels to CUDA kernels
#ifdef CUDA_API
header_string +=
Expand Down