Recent changes is causing "found at least two devices" #32420

casper-hansen · 2024-08-05T05:18:21Z

System Info

transformers 4.43.3, python 3.10, linux

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I have received multiple reports that model loading behaviour recently changed which is causing a device error. This can usually be fixed by specifying the device_map, but prior to recent changes (I don't know when this happened), the model was loaded and could inference without any issues on multiple GPUs.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0

Referenced issues:
casper-hansen/AutoAWQ#510
casper-hansen/AutoAWQ#558
casper-hansen/AutoAWQ#571

Expected behavior

The expected behavior is that we do not see these errors with the default settings of device_map=None. I am generally not sure what exactly changed, so it is hard to be more precise

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-08-26T15:16:56Z

Thanks for reporting, could you try with the latest release?
Otherwise, sorry for the inconvenience and cc @SunMarc !

davedgd · 2024-08-28T02:40:34Z

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

I tested this today, and it's still an issue with the latest transformers release (v4.44.2) at time of writing.

ArthurZucker · 2024-08-28T09:34:16Z

😢 I don't know if this is accelerate or not so pinging @muellerzr as well

muellerzr · 2024-08-28T10:57:57Z

Looks like AWQ is another model that can't be fast-loaded. Will put in a fix

muellerzr · 2024-08-28T11:07:49Z

Potentially. I'm not too familiar with the AWQ codebase. The PR that likely broke this is here: #31771

In the model definition we need to set _supports_param_buffer_assignment = False, which needs to be done on the AWQ side

casper-hansen · 2024-08-29T10:26:15Z

@muellerzr For reference, this error occurs when we want to run inference to quantize the model. I have not received similar reports for inference of quantized models. So in other words, this is happening when running models with BF16/FP16.

EDIT: Model loading is quite standard use of the auto class in transformers.

    @classmethod
    def from_pretrained(
        self,
        model_path: Annotated[str, Doc("A Huggingface path or local path to a model.")],
        model_type: Annotated[str, Doc("The model type, loaded from config.json.")],
        torch_dtype: Annotated[
            torch.dtype,
            Doc(
                "The dtype to load the model as. May not work with other values than float16."
            ),
        ] = torch.float16,
        trust_remote_code: Annotated[
            bool,
            Doc(
                "Useful for Huggingface repositories that have not been integrated into transformers yet."
            ),
        ] = True,
        safetensors: Annotated[
            bool, Doc("Whether to download/load safetensors instead of torch weights.")
        ] = True,
        device_map: Annotated[
            Union[str, Dict],
            Doc(
                "A device map that will be passed onto the model loading method from transformers."
            ),
        ] = None,
        download_kwargs: Annotated[
            Dict,
            Doc("Used for configure download model"),
        ] = None,
        **model_init_kwargs: Annotated[
            Dict,
            Doc(
                "Additional kwargs that are passed to the model during initialization."
            ),
        ],
    ):
        """A method for initialization of pretrained models, usually in FP16."""
        # Get weights path and quant config
        model_weights_path, config, quant_config = self._load_config(
            self,
            model_path,
            "",
            safetensors,
            trust_remote_code=trust_remote_code,
            download_kwargs=download_kwargs,
        )

        target_cls_name = TRANSFORMERS_AUTO_MAPPING_DICT[config.model_type]
        target_cls = getattr(transformers, target_cls_name)

        processor = None
        if target_cls_name == "AutoModelForVision2Seq":
            processor = AutoProcessor.from_pretrained(model_weights_path)
            processor: CLIPImageProcessor = processor.image_processor

        # If not quantized, must load with AutoModelForCausalLM
        model = target_cls.from_pretrained(
            model_weights_path,
            trust_remote_code=trust_remote_code,
            torch_dtype=torch_dtype,
            use_safetensors=safetensors,
            device_map=device_map,
            **model_init_kwargs,
        )

        model.eval()

        return self(
            model,
            model_type,
            is_quantized=False,
            config=config,
            quant_config=quant_config,
            processor=processor,
        )

github-actions · 2024-09-23T08:06:22Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

JeevanBhoot · 2024-09-26T13:13:28Z

Has there been any fix for this? This is also affecting autogptq: #729

Fixes huggingface#32420 by placing both inv_freq_expanded and position_ids_expanded on the same device. This avoids the following error on this line: freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2) Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm) Allows autoawq and other packages to correctly perform CPU offloading during quantization.

davedgd · 2024-09-27T01:42:55Z

I've just added a pull request for a patch that I believe resolves this issue. Feel free to try installing this patch to confirm the fix: https://github.com/davedgd/transformers/tree/patch-1

@muellerzr: Please note I did try to set _supports_param_buffer_assignment = False on the AWQ side based on your suggestion, but this appeared to be a red herring in my testing.

trevor-m · 2024-10-07T23:59:41Z

I am also encountering this issue when using dynamic rope scaling and here is what's happening:

During LlamaAttention.__init__(), the LlamaRotaryEmbedding module is initialized. No device arg is provided:

transformers/src/transformers/models/llama/modeling_llama.py

Line 304 in d6ba1ac

self.rotary_emb = LlamaRotaryEmbedding(config=self.config)
In LlamaRotaryEmbedding.__init__(), the inv_freq and original_inv_freq tensors are created and since device is not provided they are placed on the cpu

During execution, in _dynamic_frequency_update() if inv_freq is growing, then the tensor will be recomputed and placed correctly on the specified device (cuda for me) .

transformers/src/transformers/models/llama/modeling_llama.py

Lines 135 to 136 in d6ba1ac

    
           inv_freq, self.attention_scaling = self.rope_init_fn( 
        
               self.config, device, seq_len=seq_len, **self.rope_kwargs

However, if it needs to reset, the original_inv_freq is used which as I mentioned was placed on the cpu. This would cause the two device error in the forward() call.

transformers/src/transformers/models/llama/modeling_llama.py

Line 142 in d6ba1ac

self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)

This can be fixed with a simple change like this: trevor-m@1a7e62a

SunMarc · 2024-10-08T14:27:57Z

Thanks for the nice report !It seems that there is indeed a device mismatch here. However, one point I don't get is why the original_inv_freq still stay on the cpu if we move the whole model to the cuda ? Could you share a reproducer ? That would be very helpful !
Then, if the model is shared between different devices thanks to accelerate hooks, we shouldn't have issues. The only issue happens when the rope module is still on the cpu while the rest of the model is on cuda without accelerate hooks. Then it is normal that we get device mismatch.

trevor-m · 2024-10-10T19:50:41Z

Let me try to make a small reproducer. I wonder if using register_buffer for the original_inv_freq would allow it to move alongside the whole model when we change devices.

davedgd · 2024-10-13T20:35:23Z

Please ignore my earlier comment (I deleted it to avoid confusion) -- it turns out there are multiple issues with AutoAWQ that are complicating my testing of relevant fixes, and I need to more thoroughly evaluate what's going on to figure out the best solution(s).

github-actions · 2024-11-07T08:08:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArlanCooper · 2024-12-03T13:47:48Z

+1, qwen2.5-72b-instruct

ArlanCooper · 2024-12-03T13:48:59Z

has solve it?

ArthurZucker · 2025-01-13T08:37:51Z

Is it the same as #35505? 🤗

casper-hansen added the bug label Aug 5, 2024

davedgd mentioned this issue Sep 5, 2024

Quantitative model report wrong, RuntimeError: Expected all tensors to be on the same device casper-hansen/AutoAWQ#558

Open

davedgd mentioned this issue Sep 27, 2024

Fix tensors on "two devices" issue #32420 #33742

Closed

5 tasks

davedgd mentioned this issue Oct 4, 2024

fix for "two devices" issue due to RoPE changes casper-hansen/AutoAWQ#630

Merged

jambayk mentioned this issue Nov 12, 2024

AWQ: Patch for mismatched devices in RotaryEmbedding microsoft/Olive#1480

Merged

6 tasks

github-actions bot closed this as completed Nov 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent changes is causing "found at least two devices" #32420

Recent changes is causing "found at least two devices" #32420

casper-hansen commented Aug 5, 2024

ArthurZucker commented Aug 26, 2024

davedgd commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

muellerzr commented Aug 28, 2024

muellerzr commented Aug 28, 2024

casper-hansen commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Sep 23, 2024

JeevanBhoot commented Sep 26, 2024

davedgd commented Sep 27, 2024

trevor-m commented Oct 7, 2024 •

edited

Loading

SunMarc commented Oct 8, 2024 •

edited

Loading

trevor-m commented Oct 10, 2024

davedgd commented Oct 13, 2024

github-actions bot commented Nov 7, 2024

ArlanCooper commented Dec 3, 2024

ArlanCooper commented Dec 3, 2024

ArthurZucker commented Jan 13, 2025

Recent changes is causing "found at least two devices" #32420

Recent changes is causing "found at least two devices" #32420

Comments

casper-hansen commented Aug 5, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Aug 26, 2024

davedgd commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

muellerzr commented Aug 28, 2024

muellerzr commented Aug 28, 2024

casper-hansen commented Aug 29, 2024 • edited Loading

github-actions bot commented Sep 23, 2024

JeevanBhoot commented Sep 26, 2024

davedgd commented Sep 27, 2024

trevor-m commented Oct 7, 2024 • edited Loading

SunMarc commented Oct 8, 2024 • edited Loading

trevor-m commented Oct 10, 2024

davedgd commented Oct 13, 2024

github-actions bot commented Nov 7, 2024

ArlanCooper commented Dec 3, 2024

ArlanCooper commented Dec 3, 2024

ArthurZucker commented Jan 13, 2025

casper-hansen commented Aug 29, 2024 •

edited

Loading

trevor-m commented Oct 7, 2024 •

edited

Loading

SunMarc commented Oct 8, 2024 •

edited

Loading