[Misc] Refactor linear layer weight loading; introduce `BasevLLMParameter` and `weight_loader_v2` #5874

dsikka · 2024-06-26T22:27:30Z

Summary

Introduce a series of new parameters to handle the different weight loading cases for parameters loaded to linear layers.
Introduces:

BasevLLMParameter
ModelWeightParameter
GroupQuantScaleParameter
ChannelQuantScaleParameter
PerTensorScaleParameter
PackedvLLMParameter

Each of these parameters handle the weight-loading logic specific to the different LinearBase classes
Significantly cleans up the weight_loader method in each of the LinearBase classes
For now, changes are only made to compressed-tensors quantization configs by adding a weight_loader_v2 method to each of the LinearBase classes. All other quantization configurations are still using the original weight loader, as part of the scope of this PR

FOLLOW UP:

convert other integrations to use this framework once design is approved

vllm/model_executor/parameter.py

vllm/model_executor/__init__.py

vllm/model_executor/layers/linear.py

...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py

vllm/model_executor/parameter.py

robertgshaw2-redhat

moving comments down

vllm/model_executor/parameter.py

vllm/model_executor/layers/linear.py

robertgshaw2-redhat · 2024-07-01T12:37:32Z

This is much better.

I still think we have too much tied logic between linear.py and vLLMParameter as we still branching on booleans in vLLMParameter in linear.py

I think the following two changes would make a better interface.

Remove `use_col_loading`, `use_row_loading`, `use_col_shard_indexer`

We currently have a series of booleans in vLLMParameter (use_col_loading, use_row_loading, and use_col_shard_split). We then do if/else on these inside linear.py. This is still pretty confusing to follow.

I think that instead of using these booleans, we should just create 4 separate classes

WeightParameter
GroupedScaleParameter
ChannelwiseScaleParameter
PerTensorScaleParameter

These have the following:

The weights have use_col_loading=True, use_row_loading=True
The grouped scales have use_col_loading=True, use_row_loading=True
The channelwise scales have use_col_loading=True, use_row_loading=False
The per-tensor scales have use_col_shard_split=True

This will make it more explicit about what case we are in and why we are in each state, rather than implicitly via the booleans we have now.

Move `_default_loading` into vLLMParameter classes

Function signatures would look like:

class vLLMParameter:
    def load_merged_column_parallel_linear(layer: MergedColumnParallelLinear, loaded_weight: torch.tensors):
          pass
    def load_row_parallel_linear(layer: RowParallelLinear, loaded_weight: torch.tensor):
         pass
    ....

Then, with this update, in weight_loaderv2 instead of:

if param.use_column_loading:
            param_data, loaded_weight = self._default_loading(
                param=param,
                param_data=param_data,
                loaded_weight=loaded_weight,
                loaded_shard_id=loaded_shard_id)
elif param.use_metadata_loading:  # What case is this?
            shard_size = loaded_weight.shape[0]
            shard_index = ["q", "k", "v"].index(loaded_shard_id)
            param_data = param_data.narrow(0, shard_index * shard_size,
                                           shard_size)
elif param.use_col_shard_split:
      param_data, loaded_weight = param.col_shard_splitter(
            param_data=param_data,
            loaded_weight=loaded_weight,
            shard_id=loaded_shard_id)

It would just be:

param_data, loaded_weight = param.loaded_merged_column_parallel_linear(
       layer=self,
       loaded_weight=loaded_weight)

This would better encapsulate the logic and simplify the linear.py file

vllm/model_executor/layers/linear.py

vllm/model_executor/parameter.py

robertgshaw2-redhat · 2024-07-02T16:47:17Z

Couple nits but LGTM

comaniac

Sorry for the late review. Overall LGTM so approve to unblock this PR and follow-up tasks.

...el_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_unquantized.py

...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w4a16_24.py

...odel_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a16_fp8.py

...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py

robertgshaw2-redhat · 2024-08-05T21:33:50Z

...model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_w8a8_fp8.py


        # WEIGHT SCALE
-        layer_kwargs = {"weight_loader": weight_loader}
+        # TODO: update create_xxx_parameter functions to return


Is this still a TODO?

Yes. We're not using the create_xxx_parameter methods here as they are used in places outside of compressed_tensors (e.g fp8). As a follow-up, once we've updated other quantization methods to use these new parameters, we can update the create_xx_parameter functions to return the vLLMParameters. They currently return torch.nn.parameters

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

robertgshaw2-redhat · 2024-08-05T21:36:13Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

-        channelwise = (self.group_size == -1)
-        group_size = input_size if channelwise else self.group_size
+        channelwise = self.group_size == -1
+        group_size = self.group_size if self.group_size != -1 else input_size


why change this?

the second condition is just clearer as to what the group_size is and why

robertgshaw2-redhat · 2024-08-05T21:39:58Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

@@ -230,14 +225,16 @@ def _get_scheme_from_parts(
                    group_size=weight_quant.group_size)

        # Detect If Activation Quantization.
+        # TODO @dsikka: clean-up conditions


Is this still a TODO?

Yes. General follow-up on the state of these conditions

robertgshaw2-redhat · 2024-08-05T21:40:54Z

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

        if is_activation_quantization_format(self.quant_format):
            if self._is_fp8_w8a8(weight_quant, input_quant):
                is_fp8_w8a8_supported = self._check_scheme_supported(
                    CompressedTensorsW8A8Fp8.get_min_capability(), error=False)
                if is_fp8_w8a8_supported:
                    return CompressedTensorsW8A8Fp8(
                        strategy=weight_quant.strategy,
-                        is_static_input_scheme=(not input_quant.dynamic))
+                        is_static_input_scheme=(input_quant


Why is this changing? Won't be always have input_quant if is_activation_quantization_format?

just an extra check the activation config details aren't None/parsed correctly.

robertgshaw2-redhat · 2024-08-05T21:54:29Z

Make sure to unblock the multi-gpu A100 model correctness tests. Nice job!

…eter` and `weight_loader_v2` (vllm-project#5874)

…eter` and `weight_loader_v2` (vllm-project#5874) Signed-off-by: Alvant <alvasian@yandex.ru>

…eter` and `weight_loader_v2` (vllm-project#5874)

dsikka force-pushed the refactor_weight_loading branch from ee060a6 to 6e71226 Compare June 27, 2024 21:04