vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

netrunnereve · 2024-11-27T03:02:41Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

As promised in #10206 here's a K-quant mul_mat_vec shader which supports variable subgroup sizes. This one has been tested to work with subgroup sizes between 16 and 128.

I'm leaving this as a draft for two reasons:

The existing optimized algorithm requires that the subgroup size be a multiple of 16 to work properly. Ideally I would like to be able to set a fallback local_size_x to handle weird subgroup sizes but it seems like Vulkan requires it to be either hardcoded or grabbed from the specialization constant. I guess I could have a check in ggml_vulkan.cpp for this, or maybe there's a better way?
I didn't bother supporting K_QUANTS_PER_ITERATION=1 as inference ran 40% slower for me with that flag on. I'm curious if anyone is seeing an improvement with this set to 1 versus the default of 2.

On a RX 570:

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	tg128	8.21 ± 0.08
llama 7B Q6_K (PR)	5.53 GiB	7.24 B	Vulkan	100	8	tg128	9.34 ± 0.03

Honestly I was hoping for a bigger improvement but at least Q6_K is properly optimized for GCN now. I'm new to Vulkan so bear with me if I make some stupid mistakes 😉.

scalable version tested for subgroup sizes 16-128

jeffbolznv · 2024-11-27T06:25:55Z

The existing optimized algorithm requires that the subgroup size be a multiple of 16 to work properly. Ideally I would like to be able to set a fallback local_size_x to handle weird subgroup sizes but it seems like Vulkan requires it to be either hardcoded or grabbed from the specialization constant. I guess I could have a check in ggml_vulkan.cpp for this, or maybe there's a better way?

Really, this algorithm just requires the workgroup size be a multiple of 16. ggml-vulkan.cpp just happens to be setting the workgroup size (via the spec constant) to equal the subgroup size. We should at least change it to std::max(device->subgroup_size, 16u) to handle implementations with a smaller subgroup size.

I didn't bother supporting K_QUANTS_PER_ITERATION=1 as inference ran 40% slower for me with that flag on. I'm curious if anyone is seeing an improvement with this set to 1 versus the default of 2.

Agreed, I'd be surprised if this were to be faster.

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp

ggml/src/ggml-vulkan/ggml-vulkan.cpp

sorasoras · 2024-11-28T07:13:05Z

it would be interesting to mix group size on RDNA2/3.

0cc4m · 2024-11-28T07:31:56Z

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

sorasoras · 2024-11-28T08:23:04Z

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu
https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top

In fact, RDNA has native 32wave native hardware.

Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

0cc4m · 2024-11-28T08:27:31Z

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top

In fact, RDNA has native 32wave native hardware.

Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

sorasoras · 2024-11-28T16:03:01Z

it would be interesting to mix group size on RDNA2/3.

Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN.

Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu https://chipsandcheese.com/p/amds-rdna-2-shooting-for-the-top
In fact, RDNA has native 32wave native hardware.
Each WGP, or workgroup processor, features four SIMDs. Each SIMD has a 32-wide execution unit for the most common operations. RDNA 2 gets a few extra instructions for dot product operations to help accelerate machine learning. For example, V_DOT2_F32_F16 multiplies pairs of FP16 values, adds them, and adds a FP32 accumulator.

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

yes, but it's not necessary mean it's optimal. RDNA improve gaming by allowing shader running shader in wave32 if necessary to avoid stall. On the other hand, GCN need to run in Wave64.

0cc4m · 2024-11-28T16:45:13Z

I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs.

yes, but it's not necessary mean it's optimal. RDNA improve gaming by allowing shader running shader in wave32 if necessary to avoid stall. On the other hand, GCN need to run in Wave64.

You're right, testing them on 32 and 64 would be interesting. I'm just not sure if just forcing the size down is enough or if we would need to implement support for VK_EXT_subgroup_size_control.

jeffbolznv

LGTM, and I touch tested it on RTX 4070.

netrunnereve · 2024-11-29T01:45:21Z

You're right, testing them on 32 and 64 would be interesting. I'm just not sure if just forcing the size down is enough or if we would need to implement support for VK_EXT_subgroup_size_control.

@sorasoras If you want you should be able to try this on your RDNA card using this PR. Just force device->subgroup_size to 32 or 64 in ggml-vulkan.cpp and see if there's any difference in performance.

As suggested in the review the shader now works with workgroup sizes larger than the subgroup size and a fixed 16 threads are used to calculate each superblock. This makes inference an additional 20% faster on my card.

model	size	params	backend	ngl	threads	test	t/s
llama 7B Q6_K	5.53 GiB	7.24 B	Vulkan	100	8	tg128	11.21 ± 0.04

Increasing the workgroup size to 256 or so slows things down a couple percent, and if I force it to 32 I get like 8.5 t/s. So yeah GCN likes it much more when the workgroup size is a multiple of 64.

0cc4m · 2024-11-29T06:03:18Z

Performance looks good on AMD. No difference on Nvidia or Intel, as expected. RDNA2 seems to do well with either 32 or 64. GCN likes 64.

Radeon Pro VII (which is also GCN):

Master:

model	size	params	backend	ngl	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	Vulkan	99	tg128	59.73 ± 1.55
llama 13B Q6_K	9.36 GiB	12.25 B	Vulkan	99	tg128	13.94 ± 0.01

With this PR:

model	size	params	backend	ngl	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	Vulkan	99	tg128	77.74 ± 0.07
llama 13B Q6_K	9.36 GiB	12.25 B	Vulkan	99	tg128	21.58 ± 1.77

ROCm for comparison:

model	size	params	backend	ngl	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	ROCm	99	tg128	113.24 ± 0.21
llama 13B Q6_K	9.36 GiB	12.25 B	ROCm	99	tg128	37.06 ± 0.09

AMD Radeon RX 6800 XT:

Master:

model	size	params	backend	ngl	threads	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	Vulkan	99	8	tg128	143.98 ± 0.05
llama 13B Q6_K	9.36 GiB	12.25 B	Vulkan	99	8	tg128	43.20 ± 0.01

With this PR:

model	size	params	backend	ngl	threads	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	Vulkan	99	8	tg128	144.21 ± 0.06
llama 13B Q6_K	9.36 GiB	12.25 B	Vulkan	99	8	tg128	43.84 ± 0.37

With this PR and Subgroup 32:

model	size	params	backend	ngl	threads	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	Vulkan	99	8	tg128	144.81 ± 0.06
llama 13B Q6_K	9.36 GiB	12.25 B	Vulkan	99	8	tg128	43.40 ± 0.01

ROCm for comparison:

model	size	params	backend	ngl	threads	test	t/s
phi2 3B Q6_K	2.13 GiB	2.78 B	ROCm	99	8	tg128	110.44 ± 0.11
llama 13B Q6_K	9.36 GiB	12.25 B	ROCm	99	8	tg128	39.64 ± 0.02

0cc4m

Small nitpick, otherwise looks good. If you don't have the time to update the code, I can merge as is. Let me know.

ggml/src/ggml-vulkan/ggml-vulkan.cpp

0cc4m

Thank you @netrunnereve. It's really nice to see more people working on the backend.

netrunnereve · 2024-11-30T20:48:25Z

It's really nice to see more people working on the backend.

I'll probably be back for more at some point 😄. On the CPU side I've pretty much hit diminishing returns as I only get maybe 10-15% more performance from tuning the mat vec routines, and at this point I'm like the only dev that works on AVX1. IMO Vulkan has a lot more potential and I feel there are more things that could be optimized in the future. For example for Q6_K we can try doing multiple rows at a time, reading a full 32 bit word per thread, and so forth. And it would be interesting to see tensor parallelism working on those Bitcoin mining systems that no one wants anymore...

The problem here though is that it's so hard to debug and profile shaders, and it's mostly just tweaking and testing to see if things end up being faster or not. Mesa supports the RGP profiler but it doesn't work for compute apps, and the only other option is to basically look through the actual GPU assembly to see what's happening. Being a high level language GLSL is pretty easy to write but hard to estimate performance with.

0cc4m · 2024-12-01T06:30:18Z

The problem here though is that it's so hard to debug and profile shaders, and it's mostly just tweaking and testing to see if things end up being faster or not. Mesa supports the RGP profiler but it doesn't work for compute apps, and the only other option is to basically look through the actual GPU assembly to see what's happening. Being a high level language GLSL is pretty easy to write but hard to estimate performance with.

Oh, believe me I know. I haven't managed to get RGP to work at all. Maybe with the proprietary driver? But that's also annoying to set up.

The only traces I managed to generate have been with Nvidia Nsight, and the usefulness has also been limited.

* subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45) * force 16 sequential threads per block * make 16 subgroup size a constant

subgroup 64 version with subgroup add. 15% faster

0aa5fd0

scalable version tested for subgroup sizes 16-128

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Nov 27, 2024

jeffbolznv reviewed Nov 27, 2024

View reviewed changes

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

ggml/src/ggml-vulkan/vulkan-shaders/mul_mat_vec_q6_k.comp Outdated Show resolved Hide resolved

check for subgroup multiple of 16 and greater than 16

7c313b5

jeffbolznv reviewed Nov 28, 2024

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

netrunnereve added 3 commits November 28, 2024 15:50

Merge https://github.com/ggerganov/llama.cpp into vulkan

31a1d8a

subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45)

97e0c68

force 16 sequential threads per block

2bca812

jeffbolznv approved these changes Nov 29, 2024

View reviewed changes

netrunnereve marked this pull request as ready for review November 29, 2024 01:45

0cc4m approved these changes Nov 29, 2024

View reviewed changes

ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated Show resolved Hide resolved

make 16 subgroup size a constant

b65961b

0cc4m approved these changes Nov 30, 2024

View reviewed changes

0cc4m merged commit 0533e7f into ggerganov:master Nov 30, 2024
50 checks passed

netrunnereve deleted the vulkan branch November 30, 2024 19:31

netrunnereve mentioned this pull request Dec 10, 2024

vulkan: dynamic subgroup size for the remaining k quants #10745

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

netrunnereve commented Nov 27, 2024

jeffbolznv commented Nov 27, 2024

sorasoras commented Nov 28, 2024

0cc4m commented Nov 28, 2024

sorasoras commented Nov 28, 2024 •

edited

Loading

0cc4m commented Nov 28, 2024

sorasoras commented Nov 28, 2024

0cc4m commented Nov 28, 2024

jeffbolznv left a comment

netrunnereve commented Nov 29, 2024

0cc4m commented Nov 29, 2024 •

edited

Loading

0cc4m left a comment

0cc4m left a comment

netrunnereve commented Nov 30, 2024 •

edited

Loading

0cc4m commented Dec 1, 2024

vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536

Conversation

netrunnereve commented Nov 27, 2024

jeffbolznv commented Nov 27, 2024

sorasoras commented Nov 28, 2024

0cc4m commented Nov 28, 2024

sorasoras commented Nov 28, 2024 • edited Loading

0cc4m commented Nov 28, 2024

sorasoras commented Nov 28, 2024

0cc4m commented Nov 28, 2024

jeffbolznv left a comment

Choose a reason for hiding this comment

netrunnereve commented Nov 29, 2024

0cc4m commented Nov 29, 2024 • edited Loading

Radeon Pro VII (which is also GCN):

AMD Radeon RX 6800 XT:

0cc4m left a comment

Choose a reason for hiding this comment

0cc4m left a comment

Choose a reason for hiding this comment

netrunnereve commented Nov 30, 2024 • edited Loading

0cc4m commented Dec 1, 2024

sorasoras commented Nov 28, 2024 •

edited

Loading

0cc4m commented Nov 29, 2024 •

edited

Loading

netrunnereve commented Nov 30, 2024 •

edited

Loading