-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add modules_in_block_to_quantize
arg for gptq
#1585
add modules_in_block_to_quantize
arg for gptq
#1585
Conversation
optimum/gptq/quantizer.py
Outdated
@@ -123,6 +124,8 @@ def __init__( | |||
cache_block_outputs (`bool`, defaults to `True`): | |||
Whether to cache block outputs to reuse as inputs for the succeeding block. It allows optimization of non-standard models | |||
(e.g. ChatGLM) but can require more time. | |||
inside_layer_modules (`List[List[str]]`, *optional*, defaults to `None`): | |||
List of module names to quantize inside block_name_to_quantize. If not set, we will quantize all the linear layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
List of module names to quantize inside block_name_to_quantize. If not set, we will quantize all the linear layers. | |
List of module names to quantize inside block_name_to_quantize. This argument is useful to exclude certain linear modules from being quantized. If `None`, we will quantize all the linear layers. Example: `inside_layer_modules=[["self_attention.query_key_value"], ["mlp.dense_h_to_4h"]]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just use block_name_to_quantize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this will be a lot better compared to using the same one as AutoGPTQ
optimum/gptq/quantizer.py
Outdated
@@ -77,6 +77,7 @@ def __init__( | |||
exllama_config: Dict[str, Any] = None, | |||
max_input_length: Optional[int] = None, | |||
cache_block_outputs: Optional[bool] = True, | |||
inside_layer_modules: Optional[List[List[str]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a more explicit name like modules_in_block_to_quantize
? Why is it a List[List[str]]
and not list List[str]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to the quantization process. See comment below
tests/gptq/test_quantization.py
Outdated
@@ -300,6 +303,16 @@ class GPTQTestNoBlockCaching(GPTQTest): | |||
EXPECTED_OUTPUTS.add("Hello my name is John, I am a student in the University of") | |||
|
|||
|
|||
class GPTQTestInsideLayerModules(GPTQTest): | |||
# all layers are quantized apart from self_attention.dense | |||
inside_layer_modules = [["self_attention.query_key_value"], ["mlp.dense_h_to_4h"], ["mlp.dense_4h_to_h"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't it be
inside_layer_modules = [["self_attention.query_key_value"], ["mlp.dense_h_to_4h"], ["mlp.dense_4h_to_h"]] | |
inside_layer_modules = ["self_attention.query_key_value", "mlp.dense_h_to_4h", "mlp.dense_4h_to_h"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no since we want to quantize sequentially the layers. First update self_attention.query_key_value
, then mlp.dense_h_to_4h
and finally mlp.dense_4h_to_h
. Using a list is useful when we want for example to update the q,k,v layers at the same time since they are independent. However, for mlp.dense_h_to_4h
, we want to have q,k,v quantized before. We get better quantized model this way.
optimum/gptq/quantizer.py
Outdated
@@ -210,8 +215,13 @@ def convert_model(self, model: nn.Module): | |||
self.block_name_to_quantize = get_block_name_with_pattern(model) | |||
block_name = self.block_name_to_quantize | |||
layers_to_be_replaced = get_layers(model, prefix=block_name) | |||
if self.inside_layer_modules is not None: | |||
layers_to_keep = sum(self.inside_layer_modules, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We put everything in a single list as we have a list of list.
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
…m/SunMarc/optimum into add-inside_layer_modules-for-gptq
inside_layer_modules
for gptq modules_in_block_to_quantize
arg for gptq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM thanks!
What does this PR do?
This PR adds the
inside_layer_modules
arg for GPTQ quantization. This will enable the user to quantize specific modules inside a block. This is needed for quantizing model and run quantized model such as https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ