-
-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter
and weight_loader_v2
#5874
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter
and weight_loader_v2
#5874
Conversation
ee060a6
to
6e71226
Compare
...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moving comments down
This is much better. I still think we have too much tied logic between I think the following two changes would make a better interface. Remove
|
BasevLLMParameter
and weight_loader_v2
Couple nits but LGTM |
5cbd8b6
to
4449d9b
Compare
1263b08
to
bd2ee38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review. Overall LGTM so approve to unblock this PR and follow-up tasks.
...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py
Show resolved
Hide resolved
...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_24.py
Show resolved
Hide resolved
...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py
Show resolved
Hide resolved
...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py
Show resolved
Hide resolved
...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py
Show resolved
Hide resolved
|
||
# WEIGHT SCALE | ||
layer_kwargs = {"weight_loader": weight_loader} | ||
# TODO: update create_xxx_parameter functions to return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We're not using the create_xxx_parameter methods here as they are used in places outside of compressed_tensors (e.g fp8). As a follow-up, once we've updated other quantization methods to use these new parameters, we can update the create_xx_parameter functions to return the vLLMParameters. They currently return torch.nn.parameters
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py
Show resolved
Hide resolved
channelwise = (self.group_size == -1) | ||
group_size = input_size if channelwise else self.group_size | ||
channelwise = self.group_size == -1 | ||
group_size = self.group_size if self.group_size != -1 else input_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second condition is just clearer as to what the group_size is and why
@@ -230,14 +225,16 @@ def _get_scheme_from_parts( | |||
group_size=weight_quant.group_size) | |||
|
|||
# Detect If Activation Quantization. | |||
# TODO @dsikka: clean-up conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. General follow-up on the state of these conditions
if is_activation_quantization_format(self.quant_format): | ||
if self._is_fp8_w8a8(weight_quant, input_quant): | ||
is_fp8_w8a8_supported = self._check_scheme_supported( | ||
CompressedTensorsW8A8Fp8.get_min_capability(), error=False) | ||
if is_fp8_w8a8_supported: | ||
return CompressedTensorsW8A8Fp8( | ||
strategy=weight_quant.strategy, | ||
is_static_input_scheme=(not input_quant.dynamic)) | ||
is_static_input_scheme=(input_quant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing? Won't be always have input_quant
if is_activation_quantization_format
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just an extra check the activation config details aren't None/parsed correctly.
Make sure to unblock the multi-gpu A100 model correctness tests. Nice job! |
Head branch was pushed to by a user without write access
…eter` and `weight_loader_v2` (vllm-project#5874)
…eter` and `weight_loader_v2` (vllm-project#5874)
…eter` and `weight_loader_v2` (vllm-project#5874)
…eter` and `weight_loader_v2` (vllm-project#5874) Signed-off-by: Alvant <alvasian@yandex.ru>
…eter` and `weight_loader_v2` (vllm-project#5874)
Summary
BasevLLMParameter
ModelWeightParameter
GroupQuantScaleParameter
ChannelQuantScaleParameter
PerTensorScaleParameter
PackedvLLMParameter
LinearBase
classesweight_loader
method in each of theLinearBase
classescompressed-tensors
quantization configs by adding aweight_loader_v2
method to each of theLinearBase
classes. All other quantization configurations are still using the original weight loader, as part of the scope of this PRFOLLOW UP: