Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

Merged
merged 6 commits into from
Nov 30, 2024

Conversation

netrunnereve
Copy link
Collaborator

As promised in #10206 here's a K-quant mul_mat_vec shader which supports variable subgroup sizes. This one has been tested to work with subgroup sizes between 16 and 128.

I'm leaving this as a draft for two reasons:

  1. The existing optimized algorithm requires that the subgroup size be a multiple of 16 to work properly. Ideally I would like to be able to set a fallback local_size_x to handle weird subgroup sizes but it seems like Vulkan requires it to be either hardcoded or grabbed from the specialization constant. I guess I could have a check in ggml_vulkan.cpp for this, or maybe there's a better way?
  2. I didn't bother supporting K_QUANTS_PER_ITERATION=1 as inference ran 40% slower for me with that flag on. I'm curious if anyone is seeing an improvement with this set to 1 versus the default of 2.

On a RX 570:

model size params backend ngl threads test t/s
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 tg128 8.21 ± 0.08
llama 7B Q6_K (PR) 5.53 GiB 7.24 B Vulkan 100 8 tg128 9.34 ± 0.03

Honestly I was hoping for a bigger improvement but at least Q6_K is properly optimized for GCN now. I'm new to Vulkan so bear with me if I make some stupid mistakes 😉.

scalable version

tested for subgroup sizes 16-128
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 27, 2024
@jeffbolznv
Copy link
Collaborator

The existing optimized algorithm requires that the subgroup size be a multiple of 16 to work properly. Ideally I would like to be able to set a fallback local_size_x to handle weird subgroup sizes but it seems like Vulkan requires it to be either hardcoded or grabbed from the specialization constant. I guess I could have a check in ggml_vulkan.cpp for this, or maybe there's a better way?

Really, this algorithm just requires the workgroup size be a multiple of 16. ggml-vulkan.cpp just happens to be setting the workgroup size (via the spec constant) to equal the subgroup size. We should at least change it to std::max(device->subgroup_size, 16u) to handle implementations with a smaller subgroup size.

I didn't bother supporting K_QUANTS_PER_ITERATION=1 as inference ran 40% slower for me with that flag on. I'm curious if anyone is seeing an improvement with this set to 1 versus the default of 2.

Agreed, I'd be surprised if this were to be faster.

@sorasoras
Copy link

it would be interesting to mix group size on RDNA2/3.

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 28, 2024

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

@sorasoras
Copy link

sorasoras commented Nov 28, 2024

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu
https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top

In fact, RDNA has native 32wave native hardware.

Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 28, 2024

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top

In fact, RDNA has native 32wave native hardware.

Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

@sorasoras
Copy link

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top
In fact, RDNA has native 32wave native hardware.
Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

yes, but it's not necessary mean it's optimal. RDNA improve gaming by allowing shader running shader in wave32 if necessary to avoid stall. On the other hand, GCN need to run in Wave64.

@0cc4m
Copy link
Collaborator

0cc4m commented Nov 28, 2024

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

yes, but it's not necessary mean it's optimal. RDNA improve gaming by allowing shader running shader in wave32 if necessary to avoid stall. On the other hand, GCN need to run in Wave64.

You're right, testing them on 32 and 64 would be interesting. I'm just not sure if just forcing the size down is enough or if we would need to implement support for VK_EXT_subgroup_size_control.

Copy link
Collaborator

@jeffbolznv jeffbolznv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and I touch tested it on RTX 4070.

@netrunnereve
Copy link
Collaborator Author

You're right, testing them on 32 and 64 would be interesting. I'm just not sure if just forcing the size down is enough or if we would need to implement support for VK_EXT_subgroup_size_control.

@sorasoras If you want you should be able to try this on your RDNA card using this PR. Just force device->subgroup_size to 32 or 64 in ggml-vulkan.cpp and see if there's any difference in performance.

As suggested in the review the shader now works with workgroup sizes larger than the subgroup size and a fixed 16 threads are used to calculate each superblock. This makes inference an additional 20% faster on my card.

model size params backend ngl threads test t/s
llama 7B Q6_K 5.53 GiB 7.24 B Vulkan 100 8 tg128 11.21 ± 0.04

Increasing the workgroup size to 256 or so slows things down a couple percent, and if I force it to 32 I get like 8.5 t/s. So yeah GCN likes it much more when the workgroup size is a multiple of 64.

@netrunnereve netrunnereve marked this pull request as ready for review November 29, 2024 01:45
@0cc4m
Copy link
Collaborator

0cc4m commented Nov 29, 2024

Performance looks good on AMD. No difference on Nvidia or Intel, as expected. RDNA2 seems to do well with either 32 or 64. GCN likes 64.

Radeon Pro VII (which is also GCN):

Master:

model size params backend ngl test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B Vulkan 99 tg128 59.73 ± 1.55
llama 13B Q6_K 9.36 GiB 12.25 B Vulkan 99 tg128 13.94 ± 0.01

With this PR:

model size params backend ngl test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B Vulkan 99 tg128 77.74 ± 0.07
llama 13B Q6_K 9.36 GiB 12.25 B Vulkan 99 tg128 21.58 ± 1.77

ROCm for comparison:

model size params backend ngl test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B ROCm 99 tg128 113.24 ± 0.21
llama 13B Q6_K 9.36 GiB 12.25 B ROCm 99 tg128 37.06 ± 0.09

AMD Radeon RX 6800 XT:

Master:

model size params backend ngl threads test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B Vulkan 99 8 tg128 143.98 ± 0.05
llama 13B Q6_K 9.36 GiB 12.25 B Vulkan 99 8 tg128 43.20 ± 0.01

With this PR:

model size params backend ngl threads test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B Vulkan 99 8 tg128 144.21 ± 0.06
llama 13B Q6_K 9.36 GiB 12.25 B Vulkan 99 8 tg128 43.84 ± 0.37

With this PR and Subgroup 32:

model size params backend ngl threads test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B Vulkan 99 8 tg128 144.81 ± 0.06
llama 13B Q6_K 9.36 GiB 12.25 B Vulkan 99 8 tg128 43.40 ± 0.01

ROCm for comparison:

model size params backend ngl threads test t/s
phi2 3B Q6_K 2.13 GiB 2.78 B ROCm 99 8 tg128 110.44 ± 0.11
llama 13B Q6_K 9.36 GiB 12.25 B ROCm 99 8 tg128 39.64 ± 0.02

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nitpick, otherwise looks good. If you don't have the time to update the code, I can merge as is. Let me know.

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @netrunnereve. It's really nice to see more people working on the backend.

@0cc4m 0cc4m merged commit 0533e7f into ggerganov:master Nov 30, 2024
50 checks passed
@netrunnereve netrunnereve deleted the vulkan branch November 30, 2024 19:31
@netrunnereve
Copy link
Collaborator Author

netrunnereve commented Nov 30, 2024

It's really nice to see more people working on the backend.

I'll probably be back for more at some point 😄. On the CPU side I've pretty much hit diminishing returns as I only get maybe 10-15% more performance from tuning the mat vec routines, and at this point I'm like the only dev that works on AVX1. IMO Vulkan has a lot more potential and I feel there are more things that could be optimized in the future. For example for Q6_K we can try doing multiple rows at a time, reading a full 32 bit word per thread, and so forth. And it would be interesting to see tensor parallelism working on those Bitcoin mining systems that no one wants anymore...

The problem here though is that it's so hard to debug and profile shaders, and it's mostly just tweaking and testing to see if things end up being faster or not. Mesa supports the RGP profiler but it doesn't work for compute apps, and the only other option is to basically look through the actual GPU assembly to see what's happening. Being a high level language GLSL is pretty easy to write but hard to estimate performance with.

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 1, 2024

The problem here though is that it's so hard to debug and profile shaders, and it's mostly just tweaking and testing to see if things end up being faster or not. Mesa supports the RGP profiler but it doesn't work for compute apps, and the only other option is to basically look through the actual GPU assembly to see what's happening. Being a high level language GLSL is pretty easy to write but hard to estimate performance with.

Oh, believe me I know. I haven't managed to get RGP to work at all. Maybe with the proprietary driver? But that's also annoying to set up.

The only traces I managed to generate have been with Nvidia Nsight, and the usefulness has also been limited.

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

* check for subgroup multiple of 16 and greater than 16

* subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45)

* force 16 sequential threads per block

* make 16 subgroup size a constant
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants