Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

ShelterWFF · 2024-07-28T17:42:13Z

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
from transformers import AwqConfig, AutoConfig
import torch

model_path = ''
quant_path = ''
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    # trust_remote_code=True,
    low_cpu_mem_usage=True,
    use_cache=False,
    # device_map='cuda:0',
    # torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

ShelterWFF · 2024-07-28T18:31:15Z

reduce transformers verison

r4dm · 2024-07-29T07:35:37Z

the same problem. version 0.2.6 installs transformers 4.43.3 which gives an error during quantization. In this case, it is the quantization code that does not work, but the inference code works fine. Tested on two different machines. It is solved by reinstalling transformers version 4.42.4

It should not be like this @casper-hansen

casper-hansen · 2024-07-29T08:17:55Z

The default loading of the model in transformers seems to have changed recently. For now, you can just use device_map when needed.

r4dm · 2024-07-29T09:43:52Z

This also did not help in my case. I quantize the 70b model on one A100 and with default settings this used to happen normally. With new version autoawq and transformers If I specify the device map on the CPU, then another error appears

Traceback (most recent call last):
  File "/home/jupyter/training/to_awq.py", line 17, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/home/administrator/miniconda3/envs/lmdeploy/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaForCausalLM' object has no attribute 'quantize'. Did you mean: 'dequantize'?

And if I specify the device map on the GPU - OOM

As indicated above, the problem is solved by downgrading the transformers, for me this is not a problem, but it seems that for general use this is not normal

FoolMark · 2024-08-01T06:39:18Z

Similar issue with following environments:

transfermers 4.42.4
AutoAWQ 0.2.6+cu118 
AutoAWQ_Kernels 0.0.6+cu118

loading with device_map auto

model = AutoAWQForCausalLM.from_pretrained(config.model_path, device_map="auto", safetensors=True)

error solved by specifying device

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

But what if the model is larger than 80GB(e.g. qwen2-72b）?

billvsme · 2024-08-01T07:16:31Z

convert meta-llama/Meta-Llama-3.1-70B-Instruct transformers must be upgraded to 4.43.x. When I use 4.43.3, I get the same error.

seolhokim · 2024-08-11T10:31:02Z

@billvsme I'm using meta-llama/Meta-Llama-3.1-70B-Instruct and i got the same error even i tried transformers==4.43.3 and 4.44.0. do i need to specify my entire env?

supa-thibaud · 2024-08-19T09:40:15Z

same issue
@r4dm solution doesn't work for me as I m trying to quantize a llama3.1 fine-tuned model.

William-Wildridge · 2024-08-19T10:04:27Z

Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling.

ValueError: rope_scaling must be a dictionary with two fields, type and factor, got {'factor': 8.0, 'high_freq_factor': 4.0, 'low_freq_factor': 1.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}

Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers.

r4dm · 2024-08-19T12:26:36Z

temporary solution works only with llama 3, not 3.1

Because support for 3.1 was added in transformers v4.43.0

davedgd · 2024-09-05T21:52:42Z

For anyone watching this, consider also tracking this issue in transformers: #32420

bkutasi · 2024-09-12T14:57:01Z

Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know

davedgd · 2024-09-27T01:50:54Z

I have a potential fix that may remedy both the "two devices" error and the rope_scaling issue (by way of allowing for a newer transformers version). Feel free to try out the patch here:

https://github.com/davedgd/transformers/tree/patch-1

e.g.,

pip install git+https://github.com/davedgd/transformers@patch-1

ArlanCooper · 2024-12-03T13:04:22Z

+1

steveepreston · 2025-01-10T17:43:55Z

Same issue

davedgd · 2025-01-10T17:46:09Z

Same issue

This was fully fixed on recent versions. Can you confirm what version of autoawq you are using and provide a code sample? I can probably help you resolve it.

steveepreston · 2025-01-10T18:03:46Z

Hey @davedgd

my mistake, i'm not using AutoAWQ
i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

davedgd · 2025-01-10T18:06:06Z

Hey @davedgd

my mistake, i'm not using AutoAWQ i was trying to model.generate() on an input that was on CPU, then i used pipeline() and it fixed.

No worries — glad to hear you figured it out!

steveepreston · 2025-01-10T19:18:07Z

@davedgd Thank you!

yupbank · 2025-01-20T21:18:14Z

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code

model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
        model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

davedgd · 2025-01-21T00:28:17Z

still facing this problem :(

  File "/usr/local/lib/python3.10/dist-packages/transformers/models/qwen2_vl/modeling_qwen2_vl.py", line 183, in forward
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(2, 3)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
``` and this is my code

model_path = 'Qwen/Qwen2-VL-7B-Instruct'
torch.cuda.empty_cache()

# Load model
model = AutoAWQForCausalLM.from_pretrained(
       model_path, safetensors=True, torch_dtype=torch.float16, device_map="auto",

)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

Try it without device_map="auto" in AutoAWQForCausalLM.from_pretrained, e.g.,

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)

You can probably use your other args, but they shouldn't be needed. Definitely avoid device_map.

yupbank · 2025-01-21T14:47:17Z

without device_map="auto" worked... which is mysterious to me.. how come?

davedgd · 2025-01-21T15:01:17Z

without device_map="auto" worked... which is mysterious to me.. how come?

The answer is technical, but long story short, the adjustment was made in the multi-gpu fix by @casper-hansen from a few versions back in the 0.2.7 releases. Not using device_map="auto" aligns with the current examples as well:

https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py

ShelterWFF changed the title ~~Quantitative model report wrong~~ Quantitative model report wrong， RuntimeError: Expected all tensors to be on the same device Jul 28, 2024

ShelterWFF changed the title ~~Quantitative model report wrong， RuntimeError: Expected all tensors to be on the same device~~ Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device Jul 28, 2024

ShelterWFF closed this as completed Jul 28, 2024

casper-hansen reopened this Jul 29, 2024

sindhuvahinis mentioned this issue Aug 2, 2024

[ci] fix auto awq integration test failure deepjavalibrary/djl-serving#2271

Merged

casper-hansen mentioned this issue Aug 5, 2024

Recent changes is causing "found at least two devices" huggingface/transformers#32420

Closed

4 tasks

r4dm mentioned this issue Aug 15, 2024

Quantize a 70B model on a 80GB VRAM VM #585

Open

a-ys mentioned this issue Sep 4, 2024

[Neo][vLLM] Fix quantization failure caused by improperly loaded model. deepjavalibrary/djl-serving#2360

Merged

This was referenced Oct 3, 2024

Fix tensors on "two devices" issue #32420 huggingface/transformers#33742

Closed

fix for "two devices" issue due to RoPE changes #630

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

ShelterWFF commented Jul 28, 2024 •

edited

Loading

ShelterWFF commented Jul 28, 2024

r4dm commented Jul 29, 2024 •

edited

Loading

casper-hansen commented Jul 29, 2024

r4dm commented Jul 29, 2024

FoolMark commented Aug 1, 2024

billvsme commented Aug 1, 2024 •

edited

Loading

seolhokim commented Aug 11, 2024

supa-thibaud commented Aug 19, 2024

William-Wildridge commented Aug 19, 2024

r4dm commented Aug 19, 2024 •

edited

Loading

davedgd commented Sep 5, 2024

bkutasi commented Sep 12, 2024 •

edited

Loading

davedgd commented Sep 27, 2024

ArlanCooper commented Dec 3, 2024

steveepreston commented Jan 10, 2025

davedgd commented Jan 10, 2025

steveepreston commented Jan 10, 2025 •

edited

Loading

davedgd commented Jan 10, 2025

steveepreston commented Jan 10, 2025

yupbank commented Jan 20, 2025 •

edited

Loading

davedgd commented Jan 21, 2025

yupbank commented Jan 21, 2025

davedgd commented Jan 21, 2025

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558

Comments

ShelterWFF commented Jul 28, 2024 • edited Loading

ShelterWFF commented Jul 28, 2024

r4dm commented Jul 29, 2024 • edited Loading

casper-hansen commented Jul 29, 2024

r4dm commented Jul 29, 2024

FoolMark commented Aug 1, 2024

billvsme commented Aug 1, 2024 • edited Loading

seolhokim commented Aug 11, 2024

supa-thibaud commented Aug 19, 2024

William-Wildridge commented Aug 19, 2024

r4dm commented Aug 19, 2024 • edited Loading

davedgd commented Sep 5, 2024

bkutasi commented Sep 12, 2024 • edited Loading

davedgd commented Sep 27, 2024

ArlanCooper commented Dec 3, 2024

steveepreston commented Jan 10, 2025

davedgd commented Jan 10, 2025

steveepreston commented Jan 10, 2025 • edited Loading

davedgd commented Jan 10, 2025

steveepreston commented Jan 10, 2025

yupbank commented Jan 20, 2025 • edited Loading

davedgd commented Jan 21, 2025

yupbank commented Jan 21, 2025

davedgd commented Jan 21, 2025

ShelterWFF commented Jul 28, 2024 •

edited

Loading

r4dm commented Jul 29, 2024 •

edited

Loading

billvsme commented Aug 1, 2024 •

edited

Loading

r4dm commented Aug 19, 2024 •

edited

Loading

bkutasi commented Sep 12, 2024 •

edited

Loading

steveepreston commented Jan 10, 2025 •

edited

Loading

yupbank commented Jan 20, 2025 •

edited

Loading