-
Notifications
You must be signed in to change notification settings - Fork 252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device #558
Comments
reduce transformers verison |
the same problem. version 0.2.6 installs transformers 4.43.3 which gives an error during quantization. In this case, it is the quantization code that does not work, but the inference code works fine. Tested on two different machines. It is solved by reinstalling transformers version 4.42.4 It should not be like this @casper-hansen |
The default loading of the model in |
This also did not help in my case. I quantize the 70b model on one A100 and with default settings this used to happen normally. With new version autoawq and transformers If I specify the device map on the CPU, then another error appears
And if I specify the device map on the GPU - OOM As indicated above, the problem is solved by downgrading the transformers, for me this is not a problem, but it seems that for general use this is not normal |
Similar issue with following environments:
loading with device_map auto
error solved by specifying device
But what if the model is larger than 80GB(e.g. qwen2-72b)? |
convert |
@billvsme I'm using |
same issue |
Unfortunately simply installing transformers==4.42.4 doesn't work for Llama3.1 as this reintroduces an issue with rope_scaling. ValueError: Setting device_map="auto" in the model loading unfortunately doesn't work with latest transformers. |
temporary solution works only with llama 3, not 3.1 Because support for 3.1 was added in transformers v4.43.0 |
For anyone watching this, consider also tracking this issue in transformers: #32420 |
Same issue, but if you have enough vram or multi-gpu you can set device_map="auto" then it should work. CPU+GPU quantization for llama 3.1 is still broken as far as I know |
I have a potential fix that may remedy both the "two devices" error and the https://github.com/davedgd/transformers/tree/patch-1 e.g.,
|
+1 |
Same issue |
This was fully fixed on recent versions. Can you confirm what version of autoawq you are using and provide a code sample? I can probably help you resolve it. |
Hey @davedgd my mistake, i'm not using |
No worries — glad to hear you figured it out! |
@davedgd Thank you! |
still facing this problem :(
|
Try it without # Load model
model = AutoAWQForCausalLM.from_pretrained(model_path) You can probably use your other args, but they shouldn't be needed. Definitely avoid |
without |
The answer is technical, but long story short, the adjustment was made in the multi-gpu fix by @casper-hansen from a few versions back in the 0.2.7 releases. Not using https://github.com/casper-hansen/AutoAWQ/blob/main/examples/quantize.py |
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)
The text was updated successfully, but these errors were encountered: