Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent changes is causing "found at least two devices" #32420

Closed
1 of 4 tasks
casper-hansen opened this issue Aug 5, 2024 · 17 comments
Closed
1 of 4 tasks

Recent changes is causing "found at least two devices" #32420

casper-hansen opened this issue Aug 5, 2024 · 17 comments
Labels

Comments

@casper-hansen
Copy link

System Info

transformers 4.43.3, python 3.10, linux

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I have received multiple reports that model loading behaviour recently changed which is causing a device error. This can usually be fixed by specifying the device_map, but prior to recent changes (I don't know when this happened), the model was loaded and could inference without any issues on multiple GPUs.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0

Referenced issues:
casper-hansen/AutoAWQ#510
casper-hansen/AutoAWQ#558
casper-hansen/AutoAWQ#571

Expected behavior

The expected behavior is that we do not see these errors with the default settings of device_map=None. I am generally not sure what exactly changed, so it is hard to be more precise

@ArthurZucker
Copy link
Collaborator

Thanks for reporting, could you try with the latest release?
Otherwise, sorry for the inconvenience and cc @SunMarc !

@davedgd
Copy link

davedgd commented Aug 28, 2024

Thanks for reporting, could you try with the latest release? Otherwise, sorry for the inconvenience and cc @SunMarc !

I tested this today, and it's still an issue with the latest transformers release (v4.44.2) at time of writing.

@ArthurZucker
Copy link
Collaborator

😢 I don't know if this is accelerate or not so pinging @muellerzr as well

@muellerzr
Copy link
Contributor

Looks like AWQ is another model that can't be fast-loaded. Will put in a fix

@muellerzr
Copy link
Contributor

Potentially. I'm not too familiar with the AWQ codebase. The PR that likely broke this is here: #31771

In the model definition we need to set _supports_param_buffer_assignment = False, which needs to be done on the AWQ side

@casper-hansen
Copy link
Author

casper-hansen commented Aug 29, 2024

@muellerzr For reference, this error occurs when we want to run inference to quantize the model. I have not received similar reports for inference of quantized models. So in other words, this is happening when running models with BF16/FP16.

EDIT: Model loading is quite standard use of the auto class in transformers.

    @classmethod
    def from_pretrained(
        self,
        model_path: Annotated[str, Doc("A Huggingface path or local path to a model.")],
        model_type: Annotated[str, Doc("The model type, loaded from config.json.")],
        torch_dtype: Annotated[
            torch.dtype,
            Doc(
                "The dtype to load the model as. May not work with other values than float16."
            ),
        ] = torch.float16,
        trust_remote_code: Annotated[
            bool,
            Doc(
                "Useful for Huggingface repositories that have not been integrated into transformers yet."
            ),
        ] = True,
        safetensors: Annotated[
            bool, Doc("Whether to download/load safetensors instead of torch weights.")
        ] = True,
        device_map: Annotated[
            Union[str, Dict],
            Doc(
                "A device map that will be passed onto the model loading method from transformers."
            ),
        ] = None,
        download_kwargs: Annotated[
            Dict,
            Doc("Used for configure download model"),
        ] = None,
        **model_init_kwargs: Annotated[
            Dict,
            Doc(
                "Additional kwargs that are passed to the model during initialization."
            ),
        ],
    ):
        """A method for initialization of pretrained models, usually in FP16."""
        # Get weights path and quant config
        model_weights_path, config, quant_config = self._load_config(
            self,
            model_path,
            "",
            safetensors,
            trust_remote_code=trust_remote_code,
            download_kwargs=download_kwargs,
        )

        target_cls_name = TRANSFORMERS_AUTO_MAPPING_DICT[config.model_type]
        target_cls = getattr(transformers, target_cls_name)

        processor = None
        if target_cls_name == "AutoModelForVision2Seq":
            processor = AutoProcessor.from_pretrained(model_weights_path)
            processor: CLIPImageProcessor = processor.image_processor

        # If not quantized, must load with AutoModelForCausalLM
        model = target_cls.from_pretrained(
            model_weights_path,
            trust_remote_code=trust_remote_code,
            torch_dtype=torch_dtype,
            use_safetensors=safetensors,
            device_map=device_map,
            **model_init_kwargs,
        )

        model.eval()

        return self(
            model,
            model_type,
            is_quantized=False,
            config=config,
            quant_config=quant_config,
            processor=processor,
        )

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@JeevanBhoot
Copy link

Has there been any fix for this? This is also affecting autogptq: #729

davedgd added a commit to davedgd/transformers that referenced this issue Sep 27, 2024
Fixes huggingface#32420 by placing both inv_freq_expanded and position_ids_expanded on the same device. This avoids the following error on this line:

freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)

Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

Allows autoawq and other packages to correctly perform CPU offloading during quantization.
@davedgd
Copy link

davedgd commented Sep 27, 2024

I've just added a pull request for a patch that I believe resolves this issue. Feel free to try installing this patch to confirm the fix: https://github.com/davedgd/transformers/tree/patch-1

@muellerzr: Please note I did try to set _supports_param_buffer_assignment = False on the AWQ side based on your suggestion, but this appeared to be a red herring in my testing.

@trevor-m
Copy link

trevor-m commented Oct 7, 2024

I am also encountering this issue when using dynamic rope scaling and here is what's happening:

  1. During LlamaAttention.__init__(), the LlamaRotaryEmbedding module is initialized. No device arg is provided:
    self.rotary_emb = LlamaRotaryEmbedding(config=self.config)
  2. In LlamaRotaryEmbedding.__init__(), the inv_freq and original_inv_freq tensors are created and since device is not provided they are placed on the cpu
  3. During execution, in _dynamic_frequency_update() if inv_freq is growing, then the tensor will be recomputed and placed correctly on the specified device (cuda for me) .
    inv_freq, self.attention_scaling = self.rope_init_fn(
    self.config, device, seq_len=seq_len, **self.rope_kwargs
  4. However, if it needs to reset, the original_inv_freq is used which as I mentioned was placed on the cpu. This would cause the two device error in the forward() call.
    self.register_buffer("inv_freq", self.original_inv_freq, persistent=False)

This can be fixed with a simple change like this: trevor-m@1a7e62a

@SunMarc
Copy link
Member

SunMarc commented Oct 8, 2024

Thanks for the nice report !It seems that there is indeed a device mismatch here. However, one point I don't get is why the original_inv_freq still stay on the cpu if we move the whole model to the cuda ? Could you share a reproducer ? That would be very helpful !
Then, if the model is shared between different devices thanks to accelerate hooks, we shouldn't have issues. The only issue happens when the rope module is still on the cpu while the rest of the model is on cuda without accelerate hooks. Then it is normal that we get device mismatch.

@trevor-m
Copy link

Let me try to make a small reproducer. I wonder if using register_buffer for the original_inv_freq would allow it to move alongside the whole model when we change devices.

@davedgd
Copy link

davedgd commented Oct 13, 2024

Please ignore my earlier comment (I deleted it to avoid confusion) -- it turns out there are multiple issues with AutoAWQ that are complicating my testing of relevant fixes, and I need to more thoroughly evaluate what's going on to figure out the best solution(s).

Copy link

github-actions bot commented Nov 7, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@ArlanCooper
Copy link

+1, qwen2.5-72b-instruct

@ArlanCooper
Copy link

has solve it?

@ArthurZucker
Copy link
Collaborator

Is it the same as #35505? 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants