[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

mgoin · 2024-06-27T16:38:26Z

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Description:
When loading FP8 quantized models with merged linear modules (e.g., Phi-3 with merged qkv_proj and up_gate_proj), the scales for each shard are not handled correctly. This occurs because the vLLM FP8 config assumes separate scales for each shard, but merged layers have a single scale.

Steps to Reproduce:

Attempt to load an FP8 quantized Phi-3 model (e.g., https://huggingface.co/nm-testing/Phi-3-mini-128k-instruct-FP8)

Observe error due to shape mismatch:

param_data.shape=torch.Size([2]) loaded_weight.shape=torch.Size([])
param_data.shape=torch.Size([3]) loaded_weight.shape=torch.Size([])

Expected Behavior:
Scales should be correctly loaded for merged linear modules in FP8 checkpoints.

Proposed Fix:
Modify process_weights_after_loading in MergedColumnParallelLinear and QKVParallelLinear to repeat the merged scale during weight loading.

Temporary Workaround:
Apply the following patch in vllm/model_executor/layers/linear.py:

- assert param_data.shape == loaded_weight.shape
- param_data.copy_(loaded_weight)
+ temp = loaded_weight.repeat(param_data.shape)
+ assert param_data.shape == temp.shape
+ param_data.copy_(temp)

cc @robertgshaw2-neuralmagic @comaniac

The text was updated successfully, but these errors were encountered:

comaniac · 2024-06-27T16:47:19Z

I thought we handled this already? All the FP8 checkpoints have separated QKV scales and we merged them after weight loading. Is there anything special in Phi-3?

robertgshaw2-redhat · 2024-06-27T16:48:41Z

Im working on a fix for this right now.

Issue is phi3 has fused qkv on disk so theres already only one scale!

jon-chuang · 2024-08-09T20:34:23Z

This model cannot be loaded from huggingface now

Causing flakey CI. Could you take a look @robertgshaw2-neuralmagic

jon-chuang · 2024-08-09T20:35:10Z

Furthermore from web browser:

https://huggingface.co/nm-testing/Phi-3-mini-128k-instruct-FP8

mgoin · 2024-08-09T21:56:16Z

@jon-chuang should be resolved now

jon-chuang · 2024-08-09T22:24:42Z

Not yet

mgoin · 2024-08-09T23:36:21Z

@jon-chuang Please rebase on main, this was resolved with #7376

mgoin added the bug Something isn't working label Jun 27, 2024

robertgshaw2-redhat mentioned this issue Jun 27, 2024

[ Bugfix ] Enabling Loading Models With Fused QKV/MLP on Disk with FP8 #5921

Merged

robertgshaw2-redhat closed this as completed in #5921 Jun 28, 2024

jon-chuang mentioned this issue Aug 9, 2024

[Misc/Testing] Use torch.testing.assert_close #7324

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

mgoin commented Jun 27, 2024 •

edited

Loading

comaniac commented Jun 27, 2024

robertgshaw2-redhat commented Jun 27, 2024

jon-chuang commented Aug 9, 2024 •

edited

Loading

jon-chuang commented Aug 9, 2024 •

edited

Loading

mgoin commented Aug 9, 2024

jon-chuang commented Aug 9, 2024

mgoin commented Aug 9, 2024

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915

Comments

mgoin commented Jun 27, 2024 • edited Loading

Your current environment

🐛 Describe the bug

comaniac commented Jun 27, 2024

robertgshaw2-redhat commented Jun 27, 2024

jon-chuang commented Aug 9, 2024 • edited Loading

jon-chuang commented Aug 9, 2024 • edited Loading

mgoin commented Aug 9, 2024

jon-chuang commented Aug 9, 2024

mgoin commented Aug 9, 2024

mgoin commented Jun 27, 2024 •

edited

Loading

jon-chuang commented Aug 9, 2024 •

edited

Loading

jon-chuang commented Aug 9, 2024 •

edited

Loading