[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

ShubhamVerma16 · 2024-04-25T07:57:30Z

🚀 The feature, motivation and pitch

While running the vLLM server with quantized models specifying the quantization type, the below mentioned Warning is shown:

WARNING 04-25 12:26:07 config.py:169] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Is the a feature in progress or is there any workaround that can be done to handle the same.
Let me know if any more details are required from my end.

Alternatives

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

robertgshaw2-redhat · 2024-04-25T19:02:15Z

You can use the Marlin kernels for int4 inference

We have a PR to automatically support GPTQ models with Marlin. Should be merged imminently

#3922

davidgxue · 2024-04-25T19:09:12Z

Would this have support for 8 bit quants as well or just 4 bit?

robertgshaw2-redhat · 2024-04-25T19:14:06Z

Right now, 4 bits only. But @alexm-nm is working on 8 bit version of Marlin ATM and should be done relatively soon

Marlin supports act_order=True and grouping as well

alexm-redhat · 2024-04-25T20:45:38Z

@davidgxue We have initial correctness on 8bit marlin, will do some perf checks and more testing and will put PR in a couple of days.

alexm-redhat · 2024-04-25T20:46:51Z

Btw, the new 8bit marlin will support all group_sizes and act_order

davidgxue · 2024-04-26T00:18:38Z

Awesome!! Thank you guys for the hard work!

alexm-redhat · 2024-05-01T17:10:23Z

@davidgxue here we add 8-bit support #4533

davidgxue · 2024-05-01T17:17:04Z

Thank you!!

vidhyat98 · 2024-05-02T18:30:47Z

when can we expect awq models to be optimized for inference?

jugodfroy · 2024-06-03T15:11:22Z

+1 :)
(thank you for your work btw)

alexm-redhat · 2024-07-22T00:43:08Z

@vidhyat98 AWQ is added to marlin.

mgoin · 2024-07-25T19:51:53Z

Resolved! :)

alexdauenhauer · 2024-07-25T21:31:12Z

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

mgoin · 2024-07-25T21:44:03Z

so it looks like we should pass quantization='awq_marlin' for awq quantized models?

@alexdauenhauer you don't need to pass any quantization argument, in fact it is best if you don't! vLLM will automatically choose the best kernel it can use for your quantized model

alexdauenhauer · 2024-07-25T22:11:00Z

@mgoin great to know thanks!

ShubhamVerma16 added the feature request label Apr 25, 2024

Qubitium mentioned this issue Apr 27, 2024

[FEATURE] Backport vllm expanded Marlin kernel to autogptq. AutoGPTQ/AutoGPTQ#653

Open

mgoin closed this as completed Jul 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

ShubhamVerma16 commented Apr 25, 2024

robertgshaw2-redhat commented Apr 25, 2024 •

edited

Loading

davidgxue commented Apr 25, 2024

robertgshaw2-redhat commented Apr 25, 2024 •

edited

Loading

alexm-redhat commented Apr 25, 2024

alexm-redhat commented Apr 25, 2024

davidgxue commented Apr 26, 2024

alexm-redhat commented May 1, 2024

davidgxue commented May 1, 2024

vidhyat98 commented May 2, 2024

jugodfroy commented Jun 3, 2024

alexm-redhat commented Jul 22, 2024

mgoin commented Jul 25, 2024

alexdauenhauer commented Jul 25, 2024

mgoin commented Jul 25, 2024

alexdauenhauer commented Jul 25, 2024

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

[Feature]: GPTQ/AWQ quantization is not fully optimized yet. The speed can be slower than non-quantized models. #4359

Comments

ShubhamVerma16 commented Apr 25, 2024

🚀 The feature, motivation and pitch

Alternatives

Additional context

robertgshaw2-redhat commented Apr 25, 2024 • edited Loading

davidgxue commented Apr 25, 2024

robertgshaw2-redhat commented Apr 25, 2024 • edited Loading

alexm-redhat commented Apr 25, 2024

alexm-redhat commented Apr 25, 2024

davidgxue commented Apr 26, 2024

alexm-redhat commented May 1, 2024

davidgxue commented May 1, 2024

vidhyat98 commented May 2, 2024

jugodfroy commented Jun 3, 2024

alexm-redhat commented Jul 22, 2024

mgoin commented Jul 25, 2024

alexdauenhauer commented Jul 25, 2024

mgoin commented Jul 25, 2024

alexdauenhauer commented Jul 25, 2024

robertgshaw2-redhat commented Apr 25, 2024 •

edited

Loading

robertgshaw2-redhat commented Apr 25, 2024 •

edited

Loading