-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vulkan: Dynamic subgroup size support for Q6_K mat_vec #10536
Conversation
scalable version tested for subgroup sizes 16-128
Really, this algorithm just requires the workgroup size be a multiple of 16. ggml-vulkan.cpp just happens to be setting the workgroup size (via the spec constant) to equal the subgroup size. We should at least change it to std::max(device->subgroup_size, 16u) to handle implementations with a smaller subgroup size.
Agreed, I'd be surprised if this were to be faster. |
it would be interesting to mix group size on RDNA2/3. |
Mix? You mean trying both 32 and 64? I think by default RDNA uses group size 64 just like GCN. |
Rdna2 prefer 32 but can do 64. rdna3 can do both well enough. rdna2 is like dual-32cu In fact, RDNA has native 32wave native hardware.
|
I should have been more precise: Vulkan can tell you the subgroupSize, which should correspond to wave size, and it reports 64 on all AMD GPUs. |
yes, but it's not necessary mean it's optimal. RDNA improve gaming by allowing shader running shader in wave32 if necessary to avoid stall. On the other hand, GCN need to run in Wave64. |
You're right, testing them on 32 and 64 would be interesting. I'm just not sure if just forcing the size down is enough or if we would need to implement support for VK_EXT_subgroup_size_control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and I touch tested it on RTX 4070.
@sorasoras If you want you should be able to try this on your RDNA card using this PR. Just force As suggested in the review the shader now works with workgroup sizes larger than the subgroup size and a fixed 16 threads are used to calculate each superblock. This makes inference an additional 20% faster on my card.
Increasing the workgroup size to 256 or so slows things down a couple percent, and if I force it to 32 I get like 8.5 t/s. So yeah GCN likes it much more when the workgroup size is a multiple of 64. |
Performance looks good on AMD. No difference on Nvidia or Intel, as expected. RDNA2 seems to do well with either 32 or 64. GCN likes 64. Radeon Pro VII (which is also GCN):Master:
With this PR:
ROCm for comparison:
AMD Radeon RX 6800 XT:Master:
With this PR:
With this PR and Subgroup 32:
ROCm for comparison:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small nitpick, otherwise looks good. If you don't have the time to update the code, I can merge as is. Let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @netrunnereve. It's really nice to see more people working on the backend.
I'll probably be back for more at some point 😄. On the CPU side I've pretty much hit diminishing returns as I only get maybe 10-15% more performance from tuning the mat vec routines, and at this point I'm like the only dev that works on AVX1. IMO Vulkan has a lot more potential and I feel there are more things that could be optimized in the future. For example for Q6_K we can try doing multiple rows at a time, reading a full 32 bit word per thread, and so forth. And it would be interesting to see tensor parallelism working on those Bitcoin mining systems that no one wants anymore... The problem here though is that it's so hard to debug and profile shaders, and it's mostly just tweaking and testing to see if things end up being faster or not. Mesa supports the RGP profiler but it doesn't work for compute apps, and the only other option is to basically look through the actual GPU assembly to see what's happening. Being a high level language GLSL is pretty easy to write but hard to estimate performance with. |
Oh, believe me I know. I haven't managed to get RGP to work at all. Maybe with the proprietary driver? But that's also annoying to set up. The only traces I managed to generate have been with Nvidia Nsight, and the usefulness has also been limited. |
* subgroup 64 version with subgroup add. 15% faster scalable version tested for subgroup sizes 16-128 * check for subgroup multiple of 16 and greater than 16 * subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45) * force 16 sequential threads per block * make 16 subgroup size a constant
As promised in #10206 here's a K-quant mul_mat_vec shader which supports variable subgroup sizes. This one has been tested to work with subgroup sizes between 16 and 128.
I'm leaving this as a draft for two reasons:
local_size_x
to handle weird subgroup sizes but it seems like Vulkan requires it to be either hardcoded or grabbed from the specialization constant. I guess I could have a check inggml_vulkan.cpp
for this, or maybe there's a better way?K_QUANTS_PER_ITERATION=1
as inference ran 40% slower for me with that flag on. I'm curious if anyone is seeing an improvement with this set to 1 versus the default of 2.On a RX 570:
Honestly I was hoping for a bigger improvement but at least Q6_K is properly optimized for GCN now. I'm new to Vulkan so bear with me if I make some stupid mistakes 😉.