-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: FP8 checkpoints with fused linear modules fail to load scales correctly #5915
Labels
bug
Something isn't working
Comments
I thought we handled this already? All the FP8 checkpoints have separated QKV scales and we merged them after weight loading. Is there anything special in Phi-3? |
Im working on a fix for this right now. Issue is phi3 has fused qkv on disk so theres already only one scale! |
@jon-chuang should be resolved now |
@jon-chuang Please rebase on main, this was resolved with #7376 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Your current environment
🐛 Describe the bug
Description:
When loading FP8 quantized models with merged linear modules (e.g., Phi-3 with merged qkv_proj and up_gate_proj), the scales for each shard are not handled correctly. This occurs because the vLLM FP8 config assumes separate scales for each shard, but merged layers have a single scale.
Steps to Reproduce:
Expected Behavior:
Scales should be correctly loaded for merged linear modules in FP8 checkpoints.
Proposed Fix:
Modify
process_weights_after_loading
in MergedColumnParallelLinear and QKVParallelLinear to repeat the merged scale during weight loading.Temporary Workaround:
Apply the following patch in
vllm/model_executor/layers/linear.py
:cc @robertgshaw2-neuralmagic @comaniac
The text was updated successfully, but these errors were encountered: