[Feature]: Quark quantization format upstream to VLLM #10294

kewang-xlnx · 2024-11-13T12:06:03Z

Quark is a comprehensive cross-platform toolkit designed to simplify and enhance the quantization of deep learning models. Supporting both PyTorch and ONNX models, Quark empowers developers to optimize their models for deployment on a wide range of hardware backends, achieving significant performance gains without compromising accuracy.
Here is the introduction to Quark.
Currently, the format of the quantized model exported by Quark is different from the formats supported by VLLM, so we need to contribute codes to VLLM to add support for the Quark format.

Quark Format

configuration file config.json of Quark format
key names and data types of Quark safetensors

model.layers.1.self_attn.k_proj.input_scale, 	torch.float16
model.layers.1.self_attn.k_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.k_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.o_proj.input_scale, 	torch.float16
model.layers.1.self_attn.o_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.o_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.q_proj.input_scale, 	torch.float16
model.layers.1.self_attn.q_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.q_proj.weight_scale, 	torch.float16
model.layers.1.self_attn.v_proj.input_scale, 	torch.float16
model.layers.1.self_attn.v_proj.weight, 	torch.float8_e4m3fn
model.layers.1.self_attn.v_proj.weight_scale, 	torch.float16

KV scale format if kv cache used

model.layers.1.self_attn.k_proj.output_scale, 	torch.float16
model.layers.1.self_attn.v_proj.output_scale, 	torch.float16

Design

Add the quark format to ROCm/vllm repo by creating a directory for it in vllm/model_executor/layers/quantization and including the following files.

quark.py: implements and manages quantization configurations and processing for quark quantization format for LLMs.
quark_moe.py: implements and manages quantization configurations and processing for quark quantization format for LLMs with MOE structure.
schemes/quark_scheme.py: an abstract base class for various quantization schemes in Quark, including the structure for weight creation, forward process, and post-loading weight processing.
schemes/quark_fp8.py: provides the implementation of the W8A8Fp8 quantization scheme within the Quark framework

At the first stage, we will first integrate the FP8 quantification in Quark format into VLLM, and then integrate other Quark formats such as INT4/INT8 per_tensor/per_channel/per_group into VLLM later when needed.

The text was updated successfully, but these errors were encountered:

simon-mo · 2024-11-15T21:11:40Z

In general we welcome contribution that converts quark format to the standardized format of LLM compressor https://github.com/vllm-project/llm-compressor, @robertgshaw2-neuralmagic @mgoin can help provide pointers.

kewang-xlnx · 2024-11-19T04:11:30Z

Hi @simon-mo Thanks for your reply.
Compared to converting to the LLM compressor format, our team prefers to directly integrate our Quark format.

In the first stage, we will initially support Quark FP8 quantization in VLLM, which can currently be converted to the LLM compressor format. However, there may be some quantization configurations in the future that are not yet supported by LLM compressor and cannot be converted.

Additionally, we previously used autofp8 as the output for FP8 models, but now autofp8 is being deprecated, which has had a certain impact on our work. If we integrate the Quark format, it can improve the maintainability and sustainability of our work.

simon-mo · 2024-11-21T18:38:30Z

Our main concern

If there's technical gaps between Quark and the open sourced LLM compressor and compressed tensor, we would like to be aware of the exact design gaps so we can close it.
We want to keep vLLM clean, how much changes do you expect to be inside vLLM and what's the envisioned maintenance load, is AMD maintaining this feature within 1 week turnaround time?
Regarding AutoFP8, we have standardized on LLM compressor as the standard and we don't expect major changes going forward. Additionally, compressed tensor is accepted as Hugging Face's endorsed solution as well.

Finally, a prototype PR to visualize the change can help as well.

kewang-xlnx · 2024-11-29T06:16:19Z

Hi @simon-mo,

Thanks for your detailed response.

We will dedicate engineers assigned to maintaining the Quark format in vLLM. If there are new requirements or updates needed, we will ensure they are addressed within a one-week turnaround time. Our team will closely monitor the integration to minimize any maintenance overhead for the vLLM project.
Currently, there are several technical differences between Quark and compressed tensors. These gaps will only widen as we introduce more quantization configurations and algorithms, which may not align with compressed tensors. Direct integration of Quark format ensures that we can immediately support these evolving requirements while enabling faster progress for our work. In such cases, we commit to contributing updates to vLLM to ensure compatibility and alignment.
In addition to vLLM, we are also working with Hugging Face to explore integration of the Quark format into their ecosystem.

To further demonstrate our plan, we will prepare a prototype PR to visualize the changes in vLLM and ensure the modifications are as minimal and clean as possible.

We hope this addresses your concerns and look forward to collaborating further to make Quark format a valuable addition to vLLM.

kewang-xlnx · 2024-11-29T06:21:24Z

I have raised a PR #10765. Please feel free to leave your comments.

kewang-xlnx added the feature request label Nov 13, 2024

kewang-xlnx mentioned this issue Nov 29, 2024

[Misc][Quark] Upstream Quark format to VLLM #10765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Quark quantization format upstream to VLLM #10294

[Feature]: Quark quantization format upstream to VLLM #10294

kewang-xlnx commented Nov 13, 2024 •

edited

Loading

simon-mo commented Nov 15, 2024

kewang-xlnx commented Nov 19, 2024

simon-mo commented Nov 21, 2024 •

edited

Loading

kewang-xlnx commented Nov 29, 2024

kewang-xlnx commented Nov 29, 2024

[Feature]: Quark quantization format upstream to VLLM #10294

[Feature]: Quark quantization format upstream to VLLM #10294

Comments

kewang-xlnx commented Nov 13, 2024 • edited Loading

Quark Format

Design

simon-mo commented Nov 15, 2024

kewang-xlnx commented Nov 19, 2024

simon-mo commented Nov 21, 2024 • edited Loading

kewang-xlnx commented Nov 29, 2024

kewang-xlnx commented Nov 29, 2024

kewang-xlnx commented Nov 13, 2024 •

edited

Loading

simon-mo commented Nov 21, 2024 •

edited

Loading