Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix for "two devices" issue due to RoPE changes #630

Merged
merged 4 commits into from
Nov 14, 2024

Conversation

davedgd
Copy link
Contributor

@davedgd davedgd commented Oct 4, 2024

This PR moves the rotary embeddings to the appropriate device to avoid having tensors on multiple devices; adds compatibility with transformers versions 4.43.0 and newer. Developed thanks to the help of @SunMarc (see history in this closed transformers PR: huggingface/transformers#33742).

This fixes several issues here for AutoAWQ: #510, #558, #571, #585

Another relevant issue that this would close over in transformers is here: huggingface/transformers#32420

moves the rotary embeddings to the appropriate device to avoid having tensors on multiple devices; adds compatibility with transformers versions 4.43.0 and newer
@@ -542,6 +542,7 @@ def init_quant(self, n_samples=128, max_seq_len=512):
best_device = get_best_device()
modules[0] = modules[0].to(best_device)
self.awq_model.move_embed(self.model, best_device)
self.awq_model.model.model.rotary_emb.to(best_device)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for all models, you need to create a separate method just like for the embedding with move_embed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean now @SunMarc (you mentioned this over in the other PR). I'll go through each of the model files and check how this needs to be adjusted for each one. Sorry for not testing more fully: I tried this simple fix you proposed for several different models and it worked fine, but I see that it's not guaranteed to work depending on the model architecture.

Copy link
Contributor Author

@davedgd davedgd Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @SunMarc! I checked (as in tried quantizing) all 26 of the models currently supported (including several variants where relevant) and fixed each of these separately only in cases where the fixes were needed, ensuring the code was adjusted to respect the model-specific move_embed method.

Note that a couple models are broken for various reasons, and I also noticed a somewhat tangential issue with llava and lava_next (seemingly related inputs_embeds on this line in transformers for llava and llava_next -- this can be tested with llava-hf/llava-1.5-7b-hf and llava-hf/llava-v1.6-mistral-7b-hf, respectively [as long as device_map = None is used; I'm not sure how this ought to be handled on the AWQ side specifically).

Notes

(parentheses indicate models/variants tested)

fixed:

  • cohere (command r)
  • gpt_neox (but model not currently not working due to a separate issue: prev_op <class 'transformers.activations.FastGELUActivation'> not supported yet!)
  • llama (2, 3, 3.1)
  • qwen2 (2, 2.5)
  • stablelm (1; stablelm 2 not working for unrelated reasons)
  • starcoder2

no issues:

  • aquila
  • baichuan
  • gemma (1, 1.1)
  • gemma2
  • gpt_bigcode (bigcode/gpt_bigcode-santacoder)
  • gptj
  • internlm2
  • minicpm
  • mistral (0.1, 0.2, 0.3)
  • mixtral
  • opt
  • phi3 (3, 3.5)
  • qwen (1, 1.5)
  • yi (1, 1.5)

unfixed:

current issue not noted but model not working (i.e., currently not working for unrelated reasons):

Copy link
Contributor Author

@davedgd davedgd Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SunMarc: Turns out this issue is more complicated than I thought. I was testing this patch further, and unfortunately it does not work reliably as it can introduce a secondary error depending on the exact size of the model being quantized. For example, I didn't have issues with Qwen2.5 7B but 32B will fail about half-way through due to some issue with accelerate likely related to the device_map:

NotImplementedError: Cannot copy out of meta tensor; no data! (similar to the issue reported here on StackOverflow)

EDIT: I think I may have jumped to conclusions here -- I still get the NotImplementedError: Cannot copy out of meta tensor; no data! error even with the other PR in in place. I need to test this further, but it looks like there may be two different issues going on here at once, which obviously complicates testing. This one may be related to this recent change in AutoAWQ (#607), since setting device_map = None (the old behavior in the 0.2.6 release) seems to resolve the problem.

remove changes from quantizer.py and test with qwen2 first
update move_embed
cohere, gpt_neox, stablelm, starcoder2
@casper-hansen
Copy link
Owner

Hi @davedgd, I have not had the time to follow-up on this specific issue. My understanding is that it went away after we set device='auto' by default. Either way, I think it's fine to apply this patch since it will fix an edge case that Transformers is not likely to fix.

@casper-hansen casper-hansen merged commit 12c91b7 into casper-hansen:main Nov 14, 2024
@casper-hansen
Copy link
Owner

This PR causes the exact problem it was trying to fix. I think I will have to revert this and publish a new version. I wish I had a CI pipeline to check for this, but currently don't have GPUs to do that

Traceback (most recent call last):
  File "/workspace/AutoAWQ/examples/quantize.py", line 15, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/AutoAWQ/awq/models/base.py", line 218, in quantize
    self.quantizer = quantizer_cls(
                     ^^^^^^^^^^^^^^
  File "/workspace/AutoAWQ/awq/quantize/quantizer.py", line 69, in __init__
    self.modules, self.module_kwargs, self.inps = self.init_quant(
                                                  ^^^^^^^^^^^^^^^^
  File "/workspace/AutoAWQ/awq/quantize/quantizer.py", line 570, in init_quant
    self.model(samples.to(next(self.model.parameters()).device))
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 1164, in forward
    outputs = self.model(
              ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 871, in forward
    position_embeddings = self.rotary_emb(hidden_states, position_ids)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/transformers/models/qwen2/modeling_qwen2.py", line 163, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
             ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

@davedgd
Copy link
Contributor Author

davedgd commented Dec 2, 2024

@casper-hansen: That's interesting -- I did notice that error once as well since this was merged, but I was having no issues as long as I ran my conversions with device_map = None (which I need to ensure as much CPU offloading occurs as possible):

model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, use_cache=False, device_map=None)

I do have 2 x A6000 Ada 48GB I can test changes with, so please let me know if I can be of any help. I'm sorry for any inconvenience this has led to!

@casper-hansen
Copy link
Owner

@davedgd I think CPU offloading can be achieved with the auto setting AFAIK. Either way, we need a bit more testing to ensure these type of changes work in many scenarios, maybe it could be something triggered only when device_map is None.

@davedgd
Copy link
Contributor Author

davedgd commented Dec 2, 2024

@casper-hansen: To clarify, the reason I was using device_map=None was to avoid various issues I was running into (i.e., I wasn't able to do it succesfully using "auto").

In any case, I saw you already reverted the PR a few minutes ago -- I will update to that version and run some tests (using both None, "auto", etc. with some larger models and report back. There have been some changes to the transformers package as well since this issue cropped up, so it'll be good to know where things stand.

@davedgd
Copy link
Contributor Author

davedgd commented Dec 2, 2024

UPDATE 2

I went ahead and ran the conversion in full after noticing the issue where conversion can fail after a certain number of layers can complete. It does turn out that with the current latest pip version (i.e., v0.2.7.post2), using device_map = 'auto' leads to a failure after about a 2/3 of the layers are processed (i.e., during the switchover from GPU to CPU offloaded layers) with this error: NotImplementedError: Cannot copy out of meta tensor; no data!

To sum it up, the only approach that reliably works -- and this one I am certain of since I have converted dozens of models using it -- is using device_map = None with the pip version and going one layer at a time.

UPDATE 1

@casper-hansen: Update -- I was wrong earlier -- with the latest patch applied, single GPU conversion fails after being nearly 2/3 complete with device_map = 'auto'. This seems to happen once the on-GPU layers are processed and during the switch over to CPU layers. Long story short, with Commit 9f13358 applied, I would once again not be able to convert a model without fully loading it into GPUs (e.g., with Qwen2.5-72B-Instruct, I expect CPU offloading will fail even with both GPU1 and GPU2 being used simultaneously since even 96GB VRAM is insufficient to complete the task -- I'm testing this now)...

PS. I would argue that reverting this patch is worse in some ways. While it's true that you can complete quantization if you can fully load it into GPUs, this reversion also makes it impossible to do the conversion with CPU offloading if you don't have sufficient VRAM (i.e., it's win some lose some).

Initial Comments

@casper-hansen: Starting with a TLDR summary, you are correct that this patch is better off reverted, especially for mutli-GPU. When device_map = 'auto' is used, it's possible to succesfully run everything with either one or two GPUs (i.e., with and without CPU offloading). In contrast, with device_map = None, the quantization always fails in the newer commit.

With the prior (current) release, the device_map did not matter if a single GPU was being used, but this has the side effect of breaking multi-GPU completely, which in hindsight is to be expected with the way things were being moved around.

Long story short, yes, reverting it is better -- sorry for the headache...

PS. I will allow the multi-GPU run to fully process using the new commit, and then go back and confirm this with the single-GPU run as well. I will follow up in ~2 hours, but I fully expect everything below to be accurate based on significant prior testing (i.e., if the issue does not crop up by the 2nd or 3rd layer, it won't crop up at all). Follow-Up: Mutli-GPU completed succesfully; still waiting on single GPU results.


Following up with my findings (using Qwen2.5-32B-Instruct, which requires more than 48GB VRAM to process and either two GPUs or one GPU plus CPU offloading). All results use this setup (with device_map and number of GPUs varying, with both GPUs being A6000 Adas):

model = AutoAWQForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, use_cache=False, device_map=...)

Note that SUCCESS is based on reaching at least the third layer, which generally means everything will work (I did not wait out the full hour for each of these, but I plan to go back to do so).


AutoAWQ v0.2.7.post2

One GPU

  • device_map = 'auto': FAILURE -- after 43/64 layers completed (seemingly after the GPU loaded layers are fully emptied and the CPU offloaded ones start being needed) -- NotImplementedError: Cannot copy out of meta tensor; no data!
  • device_map = None: SUCCESS (with only one layer in GPU at a time)

Two GPUs

  • device_map = 'auto': FAILURE -- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
  • device_map = None: FAILURE -- Starts with GPU1 for the first layer and then fails on the second layer when starting to load onto GPU2 -- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

AutoAWQ v0.2.7.post2 with Commit 9f13358

One GPU

  • device_map = 'auto': FAILURE -- after 43/64 layers completed (seemingly after the GPU loaded layers are fully emptied and the CPU offloaded ones start being needed) -- NotImplementedError: Cannot copy out of meta tensor; no data!
  • device_map = None: FAILURE -- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Two GPUs

  • device_map = 'auto': SUCCESS (with VRAM fully utilized and split evenly across GPU1 and GPU2)
  • device_map = None: FAILURE -- RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

@davedgd
Copy link
Contributor Author

davedgd commented Dec 2, 2024

Adding the full failure for device_map = 'auto' using a single GPU (when the model is too big to fully fit into VRAM) -- which is prefaced by the following when initiating the run -- Some parameters are on the meta device because they were offloaded to the cpu. (same as 652):

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[2], line 13
     10 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
     12 # Quantize
---> 13 model.quantize(tokenizer, quant_config=quant_config)
     15 # Save quantized model
     16 model.save_quantized(quant_path)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/models/base.py:239, in BaseAWQForCausalLM.quantize(self, tokenizer, quant_config, calib_data, split, text_column, duo_scaling, export_compatible, apply_clip, n_parallel_calib_samples, max_calib_samples, max_calib_seq_len, max_chunk_memory, quantizer_cls, **kwargs)
    216     self.quant_config.modules_to_not_convert = self.modules_to_not_convert
    218 self.quantizer = quantizer_cls(
    219     self,
    220     self.model,
   (...)
    237     **kwargs,
    238 )
--> 239 self.quantizer.quantize()
    241 self.is_quantized = True

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:159, in AwqQuantizer.quantize(self)
    154 # Filter out the linear layers we don't want to exclude
    155 named_linears = exclude_layers_to_not_quantize(
    156     named_linears, self.modules_to_not_convert
    157 )
--> 159 input_feat = self._get_input_feat(self.modules[i], named_linears)
    160 clear_memory()
    162 # [STEP 2]: Compute and apply scale list

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:633, in AwqQuantizer._get_input_feat(self, layer, named_linears)
    626 # get output as next layer's input
    627 
    628 # Sanitize the kwargs in case we use transformers version that contains
    629 # kwargs that are not handled by the module.
    630 # Useful for trust_remote_code models.
    631 module_kwargs = self._sanitize_kwargs(self.module_kwargs, layer)
--> 633 self.inps = self._module_forward(self.inps, layer, module_kwargs)
    634 for h in handles:
    635     h.remove()

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:247, in AwqQuantizer._module_forward(self, x, module, module_kwargs)
    241 @torch.no_grad()
    242 def _module_forward(
    243     self, x: torch.Tensor, module: torch.nn.Module, module_kwargs: Dict
    244 ) -> torch.Tensor:
    245     if self.n_parallel_calib_samples is None:
    246         # runs through all samples at once
--> 247         module_output = module(x, **module_kwargs)
    248         if isinstance(module_output, tuple):
    249             module_output = module_output[0]

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164 def new_forward(module, *args, **kwargs):
--> 165     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    166     if module._hf_hook.no_grad:
    167         with torch.no_grad():

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/hooks.py:364, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    353             self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
    355         set_module_tensor_to_device(
    356             module,
    357             name,
   (...)
    361             tied_params_map=self.tied_params_map,
    362         )
--> 364 return send_to_device(args, self.execution_device), send_to_device(
    365     kwargs, self.execution_device, skip_keys=self.skip_keys
    366 )

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/utils/operations.py:175, in send_to_device(tensor, device, non_blocking, skip_keys)
    173         return tensor.to(device)
    174 elif isinstance(tensor, (tuple, list)):
--> 175     return honor_type(
    176         tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
    177     )
    178 elif isinstance(tensor, Mapping):
    179     if isinstance(skip_keys, str):

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/utils/operations.py:82, in honor_type(obj, generator)
     80     return type(obj)(*list(generator))
     81 else:
---> 82     return type(obj)(generator)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/utils/operations.py:176, in <genexpr>(.0)
    173         return tensor.to(device)
    174 elif isinstance(tensor, (tuple, list)):
    175     return honor_type(
--> 176         tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
    177     )
    178 elif isinstance(tensor, Mapping):
    179     if isinstance(skip_keys, str):

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/utils/operations.py:156, in send_to_device(tensor, device, non_blocking, skip_keys)
    154     device = "xpu:0"
    155 try:
--> 156     return tensor.to(device, non_blocking=non_blocking)
    157 except TypeError:  # .to() doesn't accept non_blocking as kwarg
    158     return tensor.to(device)

NotImplementedError: Cannot copy out of meta tensor; no data!

@davedgd
Copy link
Contributor Author

davedgd commented Dec 3, 2024

One more update...when trying to convert Qwen2.5-72B-Instruct using my two A6000 Adas, this results in some CPU offloading occuring, and the kernel dying every time with no processing occuring using device_map = 'auto'. For contrast, on the current release version (i.e., v0.2.7.post2 without the latest commit), I have been able to convert this model using a single GPU (never tried with multiple) using CPU offload, albeit with device_map = None.

I'm about to try constraining the run to a single GPU with device_map = 'auto' on the most recent commit to see what happens and will edit this later. Update: that failed quickly...

Long story short, here's where things stand:

  • AutoAWQ v0.2.7.post2: With a single GPU and sufficient system memory swap space, you can quantize models that won't fit in VRAM using either device_map. Multi-GPU is indeed broken.
  • AutoAWQ v0.2.7.post2 with Commit 9f13358: You can only quantize models that fully fit into available VRAM using device_map = 'auto'. If the model is bigger than your GPUs, it will always fail.

Personally, I feel the second situation (i.e., with the patch that reverts this pull) is far worse for me, as I can at least quantize larger models with CPU offload and some extra swap space, but with the latest commit, I'd no longer be able to work with models larger than 32B unless I had more than 96GB VRAM to spare. More broadly, I feel more people would be able to actually perform greater-than-VRAM conversions without the commit in place, as anyone with a single GPU could still do them, whereas with the commit, you're constrained to models that fully fit into your available VRAM.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[2], line 13
     10 tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
     12 # Quantize
---> 13 model.quantize(tokenizer, quant_config=quant_config)
     15 # Save quantized model
     16 model.save_quantized(quant_path)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/models/base.py:239, in BaseAWQForCausalLM.quantize(self, tokenizer, quant_config, calib_data, split, text_column, duo_scaling, export_compatible, apply_clip, n_parallel_calib_samples, max_calib_samples, max_calib_seq_len, max_chunk_memory, quantizer_cls, **kwargs)
    216     self.quant_config.modules_to_not_convert = self.modules_to_not_convert
    218 self.quantizer = quantizer_cls(
    219     self,
    220     self.model,
   (...)
    237     **kwargs,
    238 )
--> 239 self.quantizer.quantize()
    241 self.is_quantized = True

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:166, in AwqQuantizer.quantize(self)
    162 # [STEP 2]: Compute and apply scale list
    163 module_config: List[Dict] = self.awq_model.get_layers_for_scaling(
    164     self.modules[i], input_feat, self.module_kwargs
    165 )
--> 166 scales_list = [
    167     self._search_best_scale(self.modules[i], **layer)
    168     for layer in module_config
    169 ]
    170 apply_scale(self.modules[i], scales_list, input_feat_dict=input_feat)
    171 scales_list = append_str_prefix(
    172     scales_list, get_op_name(self.model, self.modules[i]) + "."
    173 )

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:167, in <listcomp>(.0)
    162 # [STEP 2]: Compute and apply scale list
    163 module_config: List[Dict] = self.awq_model.get_layers_for_scaling(
    164     self.modules[i], input_feat, self.module_kwargs
    165 )
    166 scales_list = [
--> 167     self._search_best_scale(self.modules[i], **layer)
    168     for layer in module_config
    169 ]
    170 apply_scale(self.modules[i], scales_list, input_feat_dict=input_feat)
    171 scales_list = append_str_prefix(
    172     scales_list, get_op_name(self.model, self.modules[i]) + "."
    173 )

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:330, in AwqQuantizer._search_best_scale(self, module, prev_op, layers, inp, module2inspect, kwargs)
    327     fp16_output = self._module_forward(inp, module2inspect, module_kwargs)
    329 # [STEP 4]: Compute loss
--> 330 best_scales = self._compute_best_scale(
    331     inp, w_mean, x_mean, module2inspect, layers, fp16_output, module_kwargs
    332 )
    334 return (
    335     get_op_name(module, prev_op),
    336     tuple([get_op_name(module, m) for m in layers]),
    337     best_scales,
    338 )

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:395, in AwqQuantizer._compute_best_scale(self, x, w_mean, x_mean, module2inspect, linears2scale, fp16_output, kwargs)
    390     fc.weight.data = (
    391         self.pseudo_quantize_tensor(fc.weight.data)[0] / scales_view
    392     )
    394 # W * X
--> 395 int_w_output = self._module_forward(x, module2inspect, kwargs)
    397 # compute mean squared error (L2 norm)
    398 loss = self._compute_loss(fp16_output, int_w_output, device)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/awq/quantize/quantizer.py:247, in AwqQuantizer._module_forward(self, x, module, module_kwargs)
    241 @torch.no_grad()
    242 def _module_forward(
    243     self, x: torch.Tensor, module: torch.nn.Module, module_kwargs: Dict
    244 ) -> torch.Tensor:
    245     if self.n_parallel_calib_samples is None:
    246         # runs through all samples at once
--> 247         module_output = module(x, **module_kwargs)
    248         if isinstance(module_output, tuple):
    249             module_output = module_output[0]

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/miniforge3/envs/awq/lib/python3.10/site-packages/transformers/models/qwen2/modeling_qwen2.py:223, in Qwen2MLP.forward(self, hidden_state)
    222 def forward(self, hidden_state):
--> 223     return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))

OutOfMemoryError: CUDA out of memory. Tried to allocate 1.66 GiB. GPU 0 has a total capacity of 47.40 GiB of which 1.33 GiB is free. Including non-PyTorch memory, this process has 46.07 GiB memory in use. Of the allocated memory 44.12 GiB is allocated by PyTorch, and 1.45 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@casper-hansen
Copy link
Owner

casper-hansen commented Dec 3, 2024

Ok, I see that both solutions are not sufficient. The best solution would be if we can revert to device_map=None with the old behavior that I designed AutoAWQ for - quantizing one layer at a time with other layers on any device.

The question is how to achieve this after HF broke this behavior?

  • with your PR, multi-GPU is broken, but single-GPU works in both device_map cases
  • without your PR, multi-GPU works but model must fit into GPU memory

So the question is really how we can use your PR while fixing the multi-GPU usage and switch the device_map default to None?

@davedgd
Copy link
Contributor Author

davedgd commented Dec 3, 2024

@casper-hansen: Your summary is spot on, and I agree that one-layer at a time is the safe bet in many ways, since it allows for the lowest GPU requirement to still process large models.

Quick update: in my earlier post, I only evaluated the runs for a couple of layers since each conversion takes ~50 minutes. Long story short, as per the update above, please note that device_map = 'auto' does fail with the pip release due to the NotImplementedError: Cannot copy out of meta tensor; no data! error noted in an earlier comment, which seems to go to accelerate.

For the time being, the solution I'm certain works reliably for both smaller- and larger-than-VRAM models is device_map = None with the pip release, which I've tested dozens of times across a variety of models.

In terms of a temporary workaround, I would encourage recommending a default of device_map = None on the pip release, with a single GPU (and plenty of system RAM and/or swap space). In a system with multi-GPU, it is necessary to restrict it to only one, e.g.,

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

(or calling python in the command line in a similar fashion, e.g., CUDA_VISIBLE_DEVICES=0 python ...; failing to restrict with this configuration leads to: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!)

Lastly, regarding the broken HF behavior, I'm curious if @SunMarc can take another look at this, as I'm also at a bit of a loss on how this can ultimately be fixed to allow for more variety in terms of multi-GPU configurations, device_map = 'auto', etc. All of this worked fine prior to the HF change from a few month back. =|

@casper-hansen
Copy link
Owner

i have a fix ready #668. just need to add support for the remaining models

@wellcasa
Copy link

wellcasa commented Dec 14, 2024

I am using the latest package autoawq-0.2.7.post3 to quantify QWEN72B, but I still encounter this issue.
I used 4 H20 quantifications.
initialization
model = AutoAWQForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto')

@casper-hansen
Copy link
Owner

Please use the updated example found in examples/quantize.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants