add `modules_in_block_to_quantize` arg for gptq #1585

SunMarc · 2023-12-11T21:05:39Z

What does this PR do?

This PR adds the inside_layer_modules arg for GPTQ quantization. This will enable the user to quantize specific modules inside a block. This is needed for quantizing model and run quantized model such as https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GPTQ

optimum/gptq/quantizer.py

fxmarty · 2023-12-12T16:39:20Z

optimum/gptq/quantizer.py

@@ -123,6 +124,8 @@ def __init__(
            cache_block_outputs (`bool`, defaults to `True`):
                Whether to cache block outputs to reuse as inputs for the succeeding block. It allows optimization of non-standard models
                (e.g. ChatGLM) but can require more time.
+            inside_layer_modules (`List[List[str]]`, *optional*, defaults to `None`):
+                List of module names to quantize inside block_name_to_quantize. If not set, we will quantize all the linear layers.


Suggested change

List of module names to quantize inside block_name_to_quantize. If not set, we will quantize all the linear layers.

List of module names to quantize inside block_name_to_quantize. This argument is useful to exclude certain linear modules from being quantized. If `None`, we will quantize all the linear layers. Example: `inside_layer_modules=[["self_attention.query_key_value"], ["mlp.dense_h_to_4h"]]`

Why can't we just use block_name_to_quantize?

Yeah this will be a lot better compared to using the same one as AutoGPTQ

fxmarty · 2023-12-12T16:43:50Z

optimum/gptq/quantizer.py

@@ -77,6 +77,7 @@ def __init__(
        exllama_config: Dict[str, Any] = None,
        max_input_length: Optional[int] = None,
        cache_block_outputs: Optional[bool] = True,
+        inside_layer_modules: Optional[List[List[str]]] = None,


Can we have a more explicit name like modules_in_block_to_quantize? Why is it a List[List[str]] and not list List[str]?

related to the quantization process. See comment below

fxmarty · 2023-12-12T16:44:15Z

tests/gptq/test_quantization.py

@@ -300,6 +303,16 @@ class GPTQTestNoBlockCaching(GPTQTest):
    EXPECTED_OUTPUTS.add("Hello my name is John, I am a student in the University of")


+class GPTQTestInsideLayerModules(GPTQTest):
+    # all layers are quantized apart from self_attention.dense
+    inside_layer_modules = [["self_attention.query_key_value"], ["mlp.dense_h_to_4h"], ["mlp.dense_4h_to_h"]]


Couldn't it be

Suggested change

inside_layer_modules = [["self_attention.query_key_value"], ["mlp.dense_h_to_4h"], ["mlp.dense_4h_to_h"]]

inside_layer_modules = ["self_attention.query_key_value", "mlp.dense_h_to_4h", "mlp.dense_4h_to_h"]

no since we want to quantize sequentially the layers. First update self_attention.query_key_value, then mlp.dense_h_to_4h and finally mlp.dense_4h_to_h. Using a list is useful when we want for example to update the q,k,v layers at the same time since they are independent. However, for mlp.dense_h_to_4h, we want to have q,k,v quantized before. We get better quantized model this way.

fxmarty · 2023-12-12T16:45:12Z

optimum/gptq/quantizer.py

@@ -210,8 +215,13 @@ def convert_model(self, model: nn.Module):
            self.block_name_to_quantize = get_block_name_with_pattern(model)
        block_name = self.block_name_to_quantize
        layers_to_be_replaced = get_layers(model, prefix=block_name)
+        if self.inside_layer_modules is not None:
+            layers_to_keep = sum(self.inside_layer_modules, [])


I don't understand this

We put everything in a single list as we have a list of list.

optimum/gptq/quantizer.py

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

…m/SunMarc/optimum into add-inside_layer_modules-for-gptq

tests/gptq/test_quantization.py

fxmarty

LGTM thanks!

SunMarc added 2 commits December 11, 2023 22:01

add inside_layer_modules

289408e

fix typing

2a33aae

SunMarc requested a review from fxmarty December 11, 2023 21:05

SunMarc added 3 commits December 11, 2023 23:16

add test

1fd9c58

fix style

39aab85

remove print

4f770d6

SunMarc mentioned this pull request Dec 11, 2023

add modules_in_block_to_quantize arg in GPTQconfig huggingface/transformers#27956

Merged

LaaZa mentioned this pull request Dec 12, 2023

Only make_quant on inside_layer_modules. AutoGPTQ/AutoGPTQ#479

Merged

fxmarty reviewed Dec 12, 2023

View reviewed changes

fxmarty mentioned this pull request Dec 12, 2023

Gptq tokenized dataset #1584

Merged

SunMarc and others added 8 commits December 12, 2023 19:52

change naming

270b60b

fix docstring

9f3ed28

fix

6249918

Update optimum/gptq/quantizer.py

b2dccef

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>

change name again with felix suggestion

bff693d

Merge branch 'add-inside_layer_modules-for-gptq' of https://github.co…

66e3489

…m/SunMarc/optimum into add-inside_layer_modules-for-gptq

fix log

1720bab

style

b2ecfe5

SunMarc changed the title ~~add inside_layer_modules for gptq~~ add modules_in_block_to_quantize arg for gptq Dec 12, 2023

SunMarc requested a review from fxmarty December 12, 2023 19:26

fxmarty reviewed Dec 13, 2023

View reviewed changes

tests/gptq/test_quantization.py Outdated Show resolved Hide resolved

remove print

b5ac600

SunMarc requested a review from fxmarty December 13, 2023 14:02

fxmarty approved these changes Dec 13, 2023

View reviewed changes

fxmarty merged commit dad6a8a into huggingface:main Dec 13, 2023
42 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `modules_in_block_to_quantize` arg for gptq #1585

add `modules_in_block_to_quantize` arg for gptq #1585

SunMarc commented Dec 11, 2023

fxmarty Dec 12, 2023

fxmarty Dec 12, 2023

SunMarc Dec 12, 2023

fxmarty Dec 12, 2023

SunMarc Dec 12, 2023

fxmarty Dec 12, 2023

SunMarc Dec 12, 2023 •

edited

Loading

fxmarty Dec 12, 2023

SunMarc Dec 12, 2023

fxmarty left a comment

	List of module names to quantize inside block_name_to_quantize. If not set, we will quantize all the linear layers.
	List of module names to quantize inside block_name_to_quantize. This argument is useful to exclude certain linear modules from being quantized. If `None`, we will quantize all the linear layers. Example: `inside_layer_modules=[["self_attention.query_key_value"], ["mlp.dense_h_to_4h"]]`

	inside_layer_modules = [["self_attention.query_key_value"], ["mlp.dense_h_to_4h"], ["mlp.dense_4h_to_h"]]
	inside_layer_modules = ["self_attention.query_key_value", "mlp.dense_h_to_4h", "mlp.dense_4h_to_h"]

add modules_in_block_to_quantize arg for gptq #1585

add modules_in_block_to_quantize arg for gptq #1585

Conversation

SunMarc commented Dec 11, 2023

What does this PR do?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SunMarc Dec 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fxmarty left a comment

Choose a reason for hiding this comment

add `modules_in_block_to_quantize` arg for gptq #1585

add `modules_in_block_to_quantize` arg for gptq #1585

SunMarc Dec 12, 2023 •

edited

Loading