-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359
Comments
You can use the Marlin kernels for int4 inference We have a PR to automatically support GPTQ models with Marlin. Should be merged imminently |
Would this have support for 8 bit quants as well or just 4 bit? |
Right now, 4 bits only. But @alexm-nm is working on 8 bit version of Marlin ATM and should be done relatively soon Marlin supports |
@davidgxue We have initial correctness on 8bit marlin, will do some perf checks and more testing and will put PR in a couple of days. |
Btw, the new 8bit marlin will support all group_sizes and act_order |
Awesome!! Thank you guys for the hard work! |
@davidgxue here we add 8-bit support #4533 |
Thank you!! |
when can we expect awq models to be optimized for inference? |
+1 :) |
@vidhyat98 AWQ is added to marlin. |
Resolved! :) |
so it looks like we should pass |
@alexdauenhauer you don't need to pass any quantization argument, in fact it is best if you don't! vLLM will automatically choose the best kernel it can use for your quantized model |
@mgoin great to know thanks! |
🚀 The feature, motivation and pitch
While running the vLLM server with quantized models specifying the quantization type, the below mentioned Warning is shown:
Is the a feature in progress or is there any workaround that can be done to handle the same.
Let me know if any more details are required from my end.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: