-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Mixture of Experts (MoE) support #7628
Conversation
Thanks for improving the Vulkan backend. Cross-platform support is very important to make LLM available for everyone. |
Compilation fails due to As for performance, it is noticeably slow: the more layers I offload, the slower tg and pp are. However, it finally works, thank you! Due to how MoE work, I had to download 2x7B in Q3_K_S to test this PR, and suddenly there's no difference in RAM consumption between CPU-only and Vulkan. Another 2x7B model in Q4_K_S still uses more RAM on Vulkan (too much for my system). I don't normally use Q3_K_S, so I tried three other, non-MoE, models in the same Q3_K_S, and it didn't seem to help - there's still a difference in RAM consumption. Seems like the combination of 2x7b with Q3_K_S works without memory overhead, both with 4096 and 8192 context sizes. |
Thank you for testing it. The occasions of |
I get this error with
|
The Lines 2924 to 2931 in 750f60c
|
@slaren Thank you, I fixed the offload_op code and fixed the split mode none + main gpu case for Vulkan. |
Thanks for the update - inference speed is quite good now! However, prompt processing is still slow, and I noticed that adding more threads (for example, from 3 to 6) boosts performance greatly. At the same time, I'm using OpenMP PR too, so maybe it affects the result. Memory is fixed for MoEs now - 2x7B in Q4_K_S uses the same amount of RAM in CPU-only and Vulkan. For non-MoE models the RAM difference is still there. UPD: Looks like it's all about context size - going from 4096 to 8192 costs much more memory on Vulkan, to the point of not fitting into 16GB RAM and 3GB VRAM. |
Here's a basic version that can run MoE models. There's some bottleneck in the matrix vector MUL_MAT_ID shaders (or another MoE-specific one) so that generation is rather slow, but at least it runs now.
I had to implement MUL_MAT_ID, SUM_ROWS and DIV for MoE to work. MUL_MAT_ID was the complicated one, obviously.