Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

Closed
ShubhamVerma16 opened this issue Apr 25, 2024 · 15 comments

Comments

@ShubhamVerma16
Copy link

🚀 The feature, motivation and pitch

While running the vLLM server with quantized models specifying the quantization type, the below mentioned Warning is shown:

WARNING 04-25 12:26:07 config.py:169] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Is the a feature in progress or is there any workaround that can be done to handle the same.
Let me know if any more details are required from my end.

Alternatives

No response

Additional context

No response

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Apr 25, 2024

You can use the Marlin kernels for int4 inference

We have a PR to automatically support GPTQ models with Marlin. Should be merged imminently

#3922

@davidgxue
Copy link

Would this have support for 8 bit quants as well or just 4 bit?

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Apr 25, 2024

Right now, 4 bits only. But @alexm-nm is working on 8 bit version of Marlin ATM and should be done relatively soon

Marlin supports act_order=True and grouping as well

@alexm-redhat
Copy link
Collaborator

@davidgxue We have initial correctness on 8bit marlin, will do some perf checks and more testing and will put PR in a couple of days.

@alexm-redhat
Copy link
Collaborator

Btw, the new 8bit marlin will support all group_sizes and act_order

@davidgxue
Copy link

Awesome!! Thank you guys for the hard work!

@alexm-redhat
Copy link
Collaborator

@davidgxue here we add 8-bit support #4533

@davidgxue
Copy link

Thank you!!

@vidhyat98
Copy link

when can we expect awq models to be optimized for inference?

@jugodfroy
Copy link

+1 :)
(thank you for your work btw)

@alexm-redhat
Copy link
Collaborator

@vidhyat98 AWQ is added to marlin.

@mgoin
Copy link
Member

mgoin commented Jul 25, 2024

Resolved! :)

@mgoin mgoin closed this as completed Jul 25, 2024
@alexdauenhauer
Copy link

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

@mgoin
Copy link
Member

mgoin commented Jul 25, 2024

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

@alexdauenhauer you don't need to pass any quantization argument, in fact it is best if you don't! vLLM will automatically choose the best kernel it can use for your quantized model

@alexdauenhauer
Copy link

@mgoin great to know thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants