Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Terrencezzj · 2024-11-12T19:34:18Z

System Info

transformers version: 4.47.0.dev0
Platform: Linux-5.15.0-1052-oracle-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.25.2
Safetensors version: 0.4.5
Accelerate version: 1.1.1
Accelerate config: not found
PyTorch version (GPU?): 2.5.0+cu124 (True)
Tensorflow version (GPU?): 2.9.1 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 80GB HBM3

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

model_id = 'google/gemma-2-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

messages = "Any Context"
input_ids = tokenizer.encode(messages, return_tensors="pt").to("cuda")

gen_tokens = model(input_ids)

Expected behavior

Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 3/3 [00:03<00:00,  1.09s/it]
Traceback (most recent call last):
  File "/host/ckpts/transformers/script.py", line 34, in <module>
    gen_tokens = model(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host/ckpts/transformers/src/transformers/models/gemma2/modeling_gemma2.py", line 1052, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/host/ckpts/transformers/src/transformers/models/gemma2/modeling_gemma2.py", line 785, in forward
    inputs_embeds = self.embed_tokens(input_ids)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 190, in forward
    return F.embedding(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2551, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

If set one gpu visible, no error

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-11-15T10:22:14Z

Can you take a look at this @SunMarc or @MekkCyber ?

Terrencezzj · 2024-11-15T21:10:53Z

v4.43.4 doesn't have the issue

hchings · 2024-11-15T22:47:40Z

Hi @LysandreJik , this potential feature regression in Transformers have caused issues in our library (NVIDIA TensorRT Model Optimizer) when our users want to run quantization on multi-GPUs. Currently, our users need to revert back to Transformers v4.43.4. We appreciate if you can help prioritize this. Thanks!

MekkCyber · 2024-11-16T11:16:56Z

Hi @hchings @Terrencezzj, thanks for specifying that v4.43.4 does work, I am looking into the problem

kameshkanna · 2024-11-18T06:41:32Z

Hi, I found that there was no accelerate config. You can create an accelerate config file with multigpu config and try running the below hoping you loaded the model and tokenizer before.

Refer here to create config.yaml file to support multiGPU Inference
https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/accelerator#accelerate.Accelerator
https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/utilities#accelerate.DistributedType


`accelerator= Accelerator()
model = accelerator.prepare(model)
input_ids = tokenizer.encode("Any Context", return_tensors="pt").to(accelerator.device)
gen_tokens = model.generate(input_ids)`

Terrencezzj · 2024-11-18T17:21:32Z

Hi, I found that there was no accelerate config. You can create an accelerate config file with multigpu config and try running the below hoping you loaded the model and tokenizer before.

Refer here to create config.yaml file to support multiGPU Inference https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/accelerator#accelerate.Accelerator https://huggingface.co/docs/accelerate/v1.1.0/en/package_reference/utilities#accelerate.DistributedType
`accelerator= Accelerator()
model = accelerator.prepare(model)
input_ids = tokenizer.encode("Any Context", return_tensors="pt").to(accelerator.device)
gen_tokens = model.generate(input_ids)` 

Hi, I got the same error with Accelerator. Please noticed that my script was not inference, it's model(input_ids)

MekkCyber · 2024-11-18T19:33:03Z

The issue arises from how accelerate computes the device_map using infer_auto_device_map. When the model is small but a large number of GPUs are used (such as 8 GPUs in this case), the embed_tokens layer cannot fit on the first GPUs. As a result, the entire model ends up being placed on the last GPU, causing an incorrect device_map. We are currently working on a fix on the accelerate side.

alexrs-cohere · 2024-11-28T13:24:21Z

In case this is relevant, if the model is in training mode, it seems to work

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = 'google/gemma-2-2b'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
model.train()
messages = "Any Context"
input_ids = tokenizer.encode(messages, return_tensors="pt").to("cuda")

gen_tokens = model(input_ids)

github-actions · 2024-12-23T08:04:43Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

alexrs-cohere · 2025-01-02T08:38:19Z

This issue has not been resolved.

aklemen · 2025-01-22T12:02:08Z

Can confirm that this is still an issue. Exception is raised when I run this (almost same as example from docs, only changed model 2B -> 9B and called model directly):

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-9b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model(**input_ids)

Using model.generate(**input_ids) works for some reason.

System Info

transformers version: 4.45.1
Platform: Linux-5.15.0-1070-nvidia-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.27.0
Safetensors version: 0.4.5
Accelerate version: 1.2.1
Accelerate config: /
PyTorch version: 2.5.0a0+e000cf0ad9.nv24.10
GPU type: NVIDIA A100 80GB (2x)

The issue arises from how accelerate computes the device_map using infer_auto_device_map. When the model is small but a large number of GPUs are used (such as 8 GPUs in this case), the embed_tokens layer cannot fit on the first GPUs. As a result, the entire model ends up being placed on the last GPU, causing an incorrect device_map. We are currently working on a fix on the accelerate side.

@MekkCyber should I create an issue on accelerate repository or is it already there (I couldn't find it)?

SunMarc · 2025-01-22T14:37:33Z

This should be fixed by this PR from @zucchini-nlp ! Please give it a try cc @alexrs-cohere

MekkCyber · 2025-01-22T15:39:54Z

@alexrs-cohere sorry my mistake, the issue with accelerate is not the root cause here. The root cause was related to how the cache was initialized. With the fix from @zucchini-nlp, the cache is now initialized on the meta device before being moved to the appropriate device. This should resolve the issue even if the model is loaded on a single GPU.

HichemAK · 2025-02-04T15:19:42Z

The issue is still not fixed. I reused the code from @aklemen using "google/gemma-2-2b":

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model(**input_ids)

Error:

{
	"name": "RuntimeError",
	"message": "indices should be either on cpu or on the same device as the indexed tensor (cuda:0)",
	"stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 12
      9 input_text = \"Write me a poem about Machine Learning.\"
     10 input_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")
---> 12 outputs = model(**input_ids)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:876, in Gemma2ForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep, **loss_kwargs)
    874 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    875 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 876 outputs = self.model(
    877     input_ids=input_ids,
    878     attention_mask=attention_mask,
    879     position_ids=position_ids,
    880     past_key_values=past_key_values,
    881     inputs_embeds=inputs_embeds,
    882     use_cache=use_cache,
    883     output_attentions=output_attentions,
    884     output_hidden_states=output_hidden_states,
    885     return_dict=return_dict,
    886     cache_position=cache_position,
    887     **loss_kwargs,
    888 )
    890 hidden_states = outputs[0]
    891 # Only compute necessary logits, and do not upcast them to float if we are not computing the loss

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:662, in Gemma2Model.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, last_cache_position, **flash_attn_kwargs)
    649     layer_outputs = self._gradient_checkpointing_func(
    650         decoder_layer.__call__,
    651         hidden_states,
   (...)
    659         last_cache_position,
    660     )
    661 else:
--> 662     layer_outputs = decoder_layer(
    663         hidden_states,
    664         position_embeddings=position_embeddings,
    665         attention_mask=causal_mask,
    666         position_ids=position_ids,
    667         past_key_value=past_key_values,
    668         output_attentions=output_attentions,
    669         use_cache=use_cache,
    670         cache_position=cache_position,
    671         last_cache_position=last_cache_position,
    672         **flash_attn_kwargs,
    673     )
    675 hidden_states = layer_outputs[0]
    677 if output_attentions:

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:319, in Gemma2DecoderLayer.forward(self, hidden_states, position_embeddings, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, last_cache_position, **kwargs)
    316 hidden_states = self.input_layernorm(hidden_states)
    318 # Self Attention
--> 319 hidden_states, self_attn_weights = self.self_attn(
    320     hidden_states=hidden_states,
    321     position_embeddings=position_embeddings,
    322     attention_mask=attention_mask,
    323     position_ids=position_ids,
    324     past_key_value=past_key_value,
    325     output_attentions=output_attentions,
    326     use_cache=use_cache,
    327     cache_position=cache_position,
    328     **kwargs,
    329 )
    330 hidden_states = self.post_attention_layernorm(hidden_states)
    331 hidden_states = residual + hidden_states

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1736, in Module._wrapped_call_impl(self, *args, **kwargs)
   1734     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1735 else:
-> 1736     return self._call_impl(*args, **kwargs)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1747, in Module._call_impl(self, *args, **kwargs)
   1742 # If we don't have any hooks, we want to skip the rest of the logic in
   1743 # this function, and just call forward.
   1744 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1745         or _global_backward_pre_hooks or _global_backward_hooks
   1746         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1747     return forward_call(*args, **kwargs)
   1749 result = None
   1750 called_always_called_hooks = set()

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py:229, in Gemma2Attention.forward(self, hidden_states, position_embeddings, attention_mask, past_key_value, cache_position, **kwargs)
    221 if past_key_value is not None:
    222     # sin and cos are specific to RoPE models; cache_position needed for the static cache
    223     cache_kwargs = {
    224         \"sin\": sin,
    225         \"cos\": cos,
    226         \"cache_position\": cache_position,
    227         \"sliding_window\": self.sliding_window,
    228     }
--> 229     key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
    231     # Here we need to slice as we use a static cache by default, but FA2 does not support it
    232     if attention_mask is not None and self.config._attn_implementation == \"flash_attention_2\":

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/cache_utils.py:1717, in HybridCache.update(self, key_states, value_states, layer_idx, cache_kwargs)
   1714 else:
   1715     update_fn = self._static_update
-> 1717 return update_fn(
   1718     cache_position,
   1719     layer_idx,
   1720     key_states,
   1721     value_states,
   1722     k_out,
   1723     v_out,
   1724     k_out.shape[2],
   1725 )

File ~/knowledge-editing/.venv/lib/python3.11/site-packages/transformers/cache_utils.py:1680, in HybridCache._sliding_update(self, cache_position, layer_idx, key_states, value_states, k_out, v_out, max_cache_len)
   1678 to_shift = cache_position >= max_cache_len - 1
   1679 indices = (slicing + to_shift[-1].int() - 1) % max_cache_len
-> 1680 k_out = k_out[:, :, indices]
   1681 v_out = v_out[:, :, indices]
   1683 k_out[:, :, cache_position] = key_states

RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cuda:0)"
}

System Info

transformers version: 4.48.2
Platform: Linux
Python version: 3.11.9
Huggingface_hub version: 0.26.2
Safetensors version: 0.4.5
Accelerate version: 1.3.0
PyTorch version: 2.5.1+cu124
GPU type: RTX3090 (4x)

zucchini-nlp · 2025-02-04T15:31:29Z

@HichemAK the fix was not part of the last release, so you can install from main for now with !pip install --upgrade git+https://github.com/huggingface/transformers.git

HichemAK · 2025-02-05T07:35:25Z

Oh my bad ! Thank you for the response

ariG23498 · 2025-02-20T03:24:02Z

Maybe I am missing a something but using model.device instead of "cuda" for inputs seems to work

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b")
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b",
    device_map="auto",
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device) # <---- Change here

outputs = model(**input_ids)

Terrencezzj added the bug label Nov 12, 2024

LysandreJik added Big Model Inference Problems related to the Big Model Inference capabilities provided by Accelerate Accelerate labels Nov 15, 2024

MekkCyber mentioned this issue Nov 18, 2024

Fix : get_balanced_memory when using multi gpus with small models or quantized models with a large vocabulary huggingface/accelerate#3244

Closed

5 tasks

github-actions bot closed this as completed Dec 31, 2024

SunMarc reopened this Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Terrencezzj commented Nov 12, 2024

LysandreJik commented Nov 15, 2024

Terrencezzj commented Nov 15, 2024

hchings commented Nov 15, 2024 •

edited

Loading

MekkCyber commented Nov 16, 2024

kameshkanna commented Nov 18, 2024 •

edited

Loading

Terrencezzj commented Nov 18, 2024

MekkCyber commented Nov 18, 2024

alexrs-cohere commented Nov 28, 2024

github-actions bot commented Dec 23, 2024

alexrs-cohere commented Jan 2, 2025

aklemen commented Jan 22, 2025 •

edited

Loading

SunMarc commented Jan 22, 2025 •

edited

Loading

MekkCyber commented Jan 22, 2025

HichemAK commented Feb 4, 2025

zucchini-nlp commented Feb 4, 2025

HichemAK commented Feb 5, 2025

ariG23498 commented Feb 20, 2025

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Gemma2: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:7 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select) #34706

Comments

Terrencezzj commented Nov 12, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Nov 15, 2024

Terrencezzj commented Nov 15, 2024

hchings commented Nov 15, 2024 • edited Loading

MekkCyber commented Nov 16, 2024

kameshkanna commented Nov 18, 2024 • edited Loading

Terrencezzj commented Nov 18, 2024

MekkCyber commented Nov 18, 2024

alexrs-cohere commented Nov 28, 2024

github-actions bot commented Dec 23, 2024

alexrs-cohere commented Jan 2, 2025

aklemen commented Jan 22, 2025 • edited Loading

System Info

SunMarc commented Jan 22, 2025 • edited Loading

MekkCyber commented Jan 22, 2025

HichemAK commented Feb 4, 2025

zucchini-nlp commented Feb 4, 2025

HichemAK commented Feb 5, 2025

ariG23498 commented Feb 20, 2025

hchings commented Nov 15, 2024 •

edited

Loading

kameshkanna commented Nov 18, 2024 •

edited

Loading

aklemen commented Jan 22, 2025 •

edited

Loading

SunMarc commented Jan 22, 2025 •

edited

Loading