Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

Open
ShelterWFF opened this issue Jul 28, 2024 · 23 comments

Comments

@ShelterWFF
Copy link

ShelterWFF commented Jul 28, 2024

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
import torch

model_path = ''
quant_path = ''
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    # trust_remote_code=True,
    low_cpu_mem_usage=True,
    use_cache=False,
    # device_map='cuda:0',
    # torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')
@ShelterWFF ShelterWFF changed the title Quantitative model report wrong Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device Jul 28, 2024
@ShelterWFF ShelterWFF changed the title Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device Jul 28, 2024
@ShelterWFF
Copy link
Author

reduce transformers verison

@r4dm
Copy link

r4dm commented Jul 29, 2024

the same problem. version 0.2.6 installs transformers 4.43.3 which gives an error during quantization. In this case, it is the quantization code that does not work, but the inference code works fine. Tested on two different machines. It is solved by reinstalling transformers version 4.42.4

It should not be like this @casper-hansen

@casper-hansen
Copy link
Owner

The default loading of the model in transformers seems to have changed recently. For now, you can just use device_map when needed.

@r4dm
Copy link

r4dm commented Jul 29, 2024

This also did not help in my case. I quantize the 70b model on one A100 and with default settings this used to happen normally. With new version autoawq and transformers If I specify the device map on the CPU, then another error appears

Traceback (most recent call last):
  File "/home/jupyter/training/to_awq.py", line 17, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/home/administrator/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'quantize'. Did you mean: 'dequantize'?

And if I specify the device map on the GPU - OOM

As indicated above, the problem is solved by downgrading the transformers, for me this is not a problem, but it seems that for general use this is not normal

@casper-hansen casper-hansen reopened this Jul 29, 2024
@FoolMark
Copy link

FoolMark commented Aug 1, 2024

Similar issue with following environments:

transfermers 4.42.4
AutoAWQ 0.2.6+cu118 
AutoAWQ_Kernels 0.0.6+cu118

loading with device_map auto

model = AutoAWQForCausalLM.from_pretrained(config.model_path, device_map="auto", safetensors=True)

error solved by specifying device

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

But what if the model is larger than 80GB(e.g. qwen2-72b)?

@billvsme
Copy link

billvsme commented Aug 1, 2024

convert meta-llama/Meta-Llama-3.1-70B-Instruct transformers must be upgraded to 4.43.x. When I use 4.43.3, I get the same error.

@seolhokim
Copy link

@billvsme I'm using meta-llama/Meta-Llama-3.1-70B-Instruct and i got the same error even i tried transformers==4.43.3 and 4.44.0. do i need to specify my entire env?

@supa-thibaud
Copy link

same issue
@r4dm solution doesn't work for me as I m trying to quantize a llama3.1 fine-tuned model.

@William-Wildridge
Copy link

Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling.

ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers.

@r4dm
Copy link

r4dm commented Aug 19, 2024

temporary solution works only with llama 3, not 3.1

Because support for 3.1 was added in transformers v4.43.0

@davedgd
Copy link
Contributor

davedgd commented Sep 5, 2024

For anyone watching this, consider also tracking this issue in transformers: #32420

@bkutasi
Copy link

bkutasi commented Sep 12, 2024

Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know

@davedgd
Copy link
Contributor

davedgd commented Sep 27, 2024

I have a potential fix that may remedy both the "two devices" error and the rope_scaling issue (by way of allowing for a newer transformers version). Feel free to try out the patch here:

https://github.com/davedgd/transformers/tree/patch-1

e.g.,

pip install git+https://github.com/davedgd/transformers@patch-1

@ArlanCooper
Copy link

+1

@steveepreston
Copy link

Same issue

@davedgd
Copy link
Contributor

davedgd commented Jan 10, 2025

Same issue

This was fully fixed on recent versions. Can you confirm what version of autoawq you are using and provide a code sample? I can probably help you resolve it.

@steveepreston
Copy link

steveepreston commented Jan 10, 2025

Hey @davedgd

my mistake, i'm not using AutoAWQ
i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

@davedgd
Copy link
Contributor

davedgd commented Jan 10, 2025

Hey @davedgd

my mistake, i'm not using AutoAWQ i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

No worries — glad to hear you figured it out!

@steveepreston
Copy link

@davedgd Thank you!

@yupbank
Copy link

yupbank commented Jan 20, 2025

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code 
model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
        model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

@davedgd
Copy link
Contributor

davedgd commented Jan 21, 2025

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code 
model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
       model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Try it without device_map="auto" in AutoAWQForCausalLM.from_pretrained, e.g.,

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)

You can probably use your other args, but they shouldn't be needed. Definitely avoid device_map.

@yupbank
Copy link

yupbank commented Jan 21, 2025

without device_map="auto" worked... which is mysterious to me.. how come?

@davedgd
Copy link
Contributor

davedgd commented Jan 21, 2025

without device_map="auto" worked... which is mysterious to me.. how come?

The answer is technical, but long story short, the adjustment was made in the multi-gpu fix by @casper-hansen from a few versions back in the 0.2.7 releases. Not using device_map="auto" aligns with the current examples as well:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests