-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Quark quantization format upstream to VLLM #10294
Comments
In general we welcome contribution that converts quark format to the standardized format of LLM compressor https://github.com/vllm-project/llm-compressor, @robertgshaw2-neuralmagic @mgoin can help provide pointers. |
Hi @simon-mo Thanks for your reply. In the first stage, we will initially support Quark FP8 quantization in VLLM, which can currently be converted to the LLM compressor format. However, there may be some quantization configurations in the future that are not yet supported by LLM compressor and cannot be converted. Additionally, we previously used autofp8 as the output for FP8 models, but now autofp8 is being deprecated, which has had a certain impact on our work. If we integrate the Quark format, it can improve the maintainability and sustainability of our work. |
Our main concern
Finally, a prototype PR to visualize the change can help as well. |
Hi @simon-mo, Thanks for your detailed response.
To further demonstrate our plan, we will prepare a prototype PR to visualize the changes in vLLM and ensure the modifications are as minimal and clean as possible. We hope this addresses your concerns and look forward to collaborating further to make Quark format a valuable addition to vLLM. |
I have raised a PR #10765. Please feel free to leave your comments. |
Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.
Here is the introduction to Quark.
Currently, the format of the quantized model exported by Quark is different from the formats supported by VLLM, so we need to contribute codes to VLLM to add support for the Quark format.
Quark Format
Design
Add the quark format to ROCm/vllm repo by creating a directory for it in vllm/model_executor/layers/quantization and including the following files.
At the first stage, we will first integrate the FP8 quantification in Quark format into VLLM, and then integrate other Quark formats such as INT4/INT8 per_tensor/per_channel/per_group into VLLM later when needed.
The text was updated successfully, but these errors were encountered: