Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Open
2 of 4 tasks
Terrencezzj opened this issue Nov 12, 2024 · 17 comments
Labels
Accelerate Big Model Inference Problems related to the Big Model Inference capabilities provided by Accelerate bug

Comments

@Terrencezzj
Copy link

System Info

  • transformers version: 4.47.0.dev0
  • Platform: Linux-5.15.0-1052-oracle-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.25.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.1.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.0+cu124 (True)
  • Tensorflow version (GPU?): 2.9.1 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA H100 80GB HBM3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

model_id = 'google/gemma-2-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = "Any Context"
input_ids = tokenizer.encode(messages, return_tensors="pt").to("cuda")

gen_tokens = model(input_ids)

Expected behavior

Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.09s/it]
Traceback (most recent call last):
  File "/host/ckpts/transformers/script.py", line 34, in <module>
    gen_tokens = model(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host/ckpts/transformers/src/transformers/models/gemma2/modeling_gemma2.py", line 1052, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host/ckpts/transformers/src/transformers/models/gemma2/modeling_gemma2.py", line 785, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

If set one gpu visible, no error

@LysandreJik
Copy link
Member

Can you take a look at this @SunMarc or @MekkCyber ?

@LysandreJik LysandreJik added Big Model Inference Problems related to the Big Model Inference capabilities provided by Accelerate Accelerate labels Nov 15, 2024
@Terrencezzj
Copy link
Author

v4.43.4 doesn't have the issue

@hchings
Copy link
Contributor

hchings commented Nov 15, 2024

Hi @LysandreJik , this potential feature regression in Transformers have caused issues in our library (NVIDIA TensorRT Model Optimizer) when our users want to run quantization on multi-GPUs. Currently, our users need to revert back to Transformers v4.43.4. We appreciate if you can help prioritize this. Thanks!

@MekkCyber
Copy link
Contributor

Hi @hchings @Terrencezzj, thanks for specifying that v4.43.4 does work, I am looking into the problem

@kameshkanna
Copy link

kameshkanna commented Nov 18, 2024

Hi, I found that there was no accelerate config. You can create an accelerate config file with multigpu config and try running the below hoping you loaded the model and tokenizer before.

Refer here to create config.yaml file to support multiGPU Inference
https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/accelerator#accelerate.Accelerator
https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/utilities#accelerate.DistributedType


`accelerator= Accelerator()
model = accelerator.prepare(model)
input_ids = tokenizer.encode("Any Context", return_tensors="pt").to(accelerator.device)
gen_tokens = model.generate(input_ids)` 

@Terrencezzj
Copy link
Author

Hi, I found that there was no accelerate config. You can create an accelerate config file with multigpu config and try running the below hoping you loaded the model and tokenizer before.

Refer here to create config.yaml file to support multiGPU Inference https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/accelerator#accelerate.Accelerator https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/utilities#accelerate.DistributedType


`accelerator= Accelerator()
model = accelerator.prepare(model)
input_ids = tokenizer.encode("Any Context", return_tensors="pt").to(accelerator.device)
gen_tokens = model.generate(input_ids)` 

Hi, I got the same error with Accelerator. Please noticed that my script was not inference, it's model(input_ids)

@MekkCyber
Copy link
Contributor

The issue arises from how accelerate computes the device_map using infer_auto_device_map. When the model is small but a large number of GPUs are used (such as 8 GPUs in this case), the embed_tokens layer cannot fit on the first GPUs. As a result, the entire model ends up being placed on the last GPU, causing an incorrect device_map. We are currently working on a fix on the accelerate side.

@alexrs-cohere
Copy link
Contributor

In case this is relevant, if the model is in training mode, it seems to work

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'google/gemma-2-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
model.train()
messages = "Any Context"
input_ids = tokenizer.encode(messages, return_tensors="pt").to("cuda")

gen_tokens = model(input_ids)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@alexrs-cohere
Copy link
Contributor

This issue has not been resolved.

@SunMarc SunMarc reopened this Jan 2, 2025
@aklemen
Copy link

aklemen commented Jan 22, 2025

Can confirm that this is still an issue. Exception is raised when I run this (almost same as example from docs, only changed model 2B -> 9B and called model directly):

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model(**input_ids)

Using model.generate(**input_ids) works for some reason.

System Info

  • transformers version: 4.45.1
  • Platform: Linux-5.15.0-1070-nvidia-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.27.0
  • Safetensors version: 0.4.5
  • Accelerate version: 1.2.1
  • Accelerate config: /
  • PyTorch version: 2.5.0a0+e000cf0ad9.nv24.10
  • GPU type: NVIDIA A100 80GB (2x)

The issue arises from how accelerate computes the device_map using infer_auto_device_map. When the model is small but a large number of GPUs are used (such as 8 GPUs in this case), the embed_tokens layer cannot fit on the first GPUs. As a result, the entire model ends up being placed on the last GPU, causing an incorrect device_map. We are currently working on a fix on the accelerate side.

@MekkCyber should I create an issue on accelerate repository or is it already there (I couldn't find it)?

@SunMarc
Copy link
Member

SunMarc commented Jan 22, 2025

This should be fixed by this PR from @zucchini-nlp ! Please give it a try cc @alexrs-cohere

@MekkCyber
Copy link
Contributor

@alexrs-cohere sorry my mistake, the issue with accelerate is not the root cause here. The root cause was related to how the cache was initialized. With the fix from @zucchini-nlp, the cache is now initialized on the meta device before being moved to the appropriate device. This should resolve the issue even if the model is loaded on a single GPU.

@HichemAK
Copy link

HichemAK commented Feb 4, 2025

The issue is still not fixed. I reused the code from @aklemen using "google/gemma-2-2b":

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model(**input_ids)

Error:

{
	"name": "RuntimeError",
	"message": "indices should be either on cpu or on the same device as the indexed tensor (cuda:0)",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 12
      9 input_text = \"Write me a poem about Machine Learning.\"
     10 input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")
---> 12 outputs = model(**input_ids)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:876, in Gemma2ForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep, **loss_kwargs)
    874 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    875 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 876 outputs = self.model(
    877     input_ids=input_ids,
    878     attention_mask=attention_mask,
    879     position_ids=position_ids,
    880     past_key_values=past_key_values,
    881     inputs_embeds=inputs_embeds,
    882     use_cache=use_cache,
    883     output_attentions=output_attentions,
    884     output_hidden_states=output_hidden_states,
    885     return_dict=return_dict,
    886     cache_position=cache_position,
    887     **loss_kwargs,
    888 )
    890 hidden_states = outputs[0]
    891 # Only compute necessary logits, and do not upcast them to float if we are not computing the loss

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:662, in Gemma2Model.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, last_cache_position, **flash_attn_kwargs)
    649     layer_outputs = self._gradient_checkpointing_func(
    650         decoder_layer.__call__,
    651         hidden_states,
   (...)
    659         last_cache_position,
    660     )
    661 else:
--> 662     layer_outputs = decoder_layer(
    663         hidden_states,
    664         position_embeddings=position_embeddings,
    665         attention_mask=causal_mask,
    666         position_ids=position_ids,
    667         past_key_value=past_key_values,
    668         output_attentions=output_attentions,
    669         use_cache=use_cache,
    670         cache_position=cache_position,
    671         last_cache_position=last_cache_position,
    672         **flash_attn_kwargs,
    673     )
    675 hidden_states = layer_outputs[0]
    677 if output_attentions:

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:319, in Gemma2DecoderLayer.forward(self, hidden_states, position_embeddings, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, last_cache_position, **kwargs)
    316 hidden_states = self.input_layernorm(hidden_states)
    318 # Self Attention
--> 319 hidden_states, self_attn_weights = self.self_attn(
    320     hidden_states=hidden_states,
    321     position_embeddings=position_embeddings,
    322     attention_mask=attention_mask,
    323     position_ids=position_ids,
    324     past_key_value=past_key_value,
    325     output_attentions=output_attentions,
    326     use_cache=use_cache,
    327     cache_position=cache_position,
    328     **kwargs,
    329 )
    330 hidden_states = self.post_attention_layernorm(hidden_states)
    331 hidden_states = residual + hidden_states

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:229, in Gemma2Attention.forward(self, hidden_states, position_embeddings, attention_mask, past_key_value, cache_position, **kwargs)
    221 if past_key_value is not None:
    222     # sin and cos are specific to RoPE models; cache_position needed for the static cache
    223     cache_kwargs = {
    224         \"sin\": sin,
    225         \"cos\": cos,
    226         \"cache_position\": cache_position,
    227         \"sliding_window\": self.sliding_window,
    228     }
--> 229     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
    231     # Here we need to slice as we use a static cache by default, but FA2 does not support it
    232     if attention_mask is not None and self.config._attn_implementation == \"flash_attention_2\":

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/cache_utils.py:1717, in HybridCache.update(self, key_states, value_states, layer_idx, cache_kwargs)
   1714 else:
   1715     update_fn = self._static_update
-> 1717 return update_fn(
   1718     cache_position,
   1719     layer_idx,
   1720     key_states,
   1721     value_states,
   1722     k_out,
   1723     v_out,
   1724     k_out.shape[2],
   1725 )

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/cache_utils.py:1680, in HybridCache._sliding_update(self, cache_position, layer_idx, key_states, value_states, k_out, v_out, max_cache_len)
   1678 to_shift = cache_position >= max_cache_len - 1
   1679 indices = (slicing + to_shift[-1].int() - 1) % max_cache_len
-> 1680 k_out = k_out[:, :, indices]
   1681 v_out = v_out[:, :, indices]
   1683 k_out[:, :, cache_position] = key_states

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)"
}

System Info

transformers version: 4.48.2
Platform: Linux
Python version: 3.11.9
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.3.0
PyTorch version: 2.5.1+cu124
GPU type: RTX3090 (4x)

@zucchini-nlp
Copy link
Member

@HichemAK the fix was not part of the last release, so you can install from main for now with !pip install --upgrade git+https://github.com/huggingface/transformers.git

@HichemAK
Copy link

HichemAK commented Feb 5, 2025

Oh my bad ! Thank you for the response

@ariG23498
Copy link
Contributor

Maybe I am missing a something but using model.device instead of "cuda" for inputs seems to work

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) # <---- Change here

outputs = model(**input_ids)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accelerate Big Model Inference Problems related to the Big Model Inference capabilities provided by Accelerate bug
Projects
None yet
Development

No branches or pull requests