Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MusicGen] SDPA gives nans/infs during sampling #30020

Closed
1 of 4 tasks
sanchit-gandhi opened this issue Apr 3, 2024 · 5 comments
Closed
1 of 4 tasks

[MusicGen] SDPA gives nans/infs during sampling #30020

sanchit-gandhi opened this issue Apr 3, 2024 · 5 comments

Comments

@sanchit-gandhi
Copy link
Contributor

System Info

  • transformers version: 4.40.0.dev0
  • Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.22.1
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: - compute_environment: LOCAL_MACHINE
    • distributed_type: NO
    • mixed_precision: bf16
    • use_cpu: False
    • debug: False
    • num_processes: 1
    • machine_rank: 0
    • num_machines: 1
    • gpu_ids: 0
    • rdzv_backend: static
    • same_network: True
    • main_training_function: main
    • downcast_bf16: no
    • tpu_use_cluster: False
    • tpu_use_sudo: False
    • tpu_env: []
  • PyTorch version (GPU?): 2.2.1+cu121 (True)
  • Tensorflow version (GPU?): 2.13.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.2 (cpu)
  • Jax version: 0.4.13
  • JaxLib version: 0.4.13
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Following #29939, running the following gives an overflow error:

from transformers import MusicgenForConditionalGeneration

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small", attn_implementation="sdpa")

unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

Traceback

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[11], line 3
      1 unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
----> 3 audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

File [~/hf/lib/python3.8/site-packages/torch/utils/_contextlib.py:115](http://localhost:4000/hf/lib/python3.8/site-packages/torch/utils/_contextlib.py#line=114), in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File [~/transformers/src/transformers/models/musicgen/modeling_musicgen.py:2822](http://localhost:4000/transformers/src/transformers/models/musicgen/modeling_musicgen.py#line=2821), in MusicgenForConditionalGeneration.generate(self, inputs, generation_config, logits_processor, stopping_criteria, synced_gpus, streamer, **kwargs)
   2814     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2815         input_ids=input_ids,
   2816         expand_size=generation_config.num_return_sequences,
   2817         is_encoder_decoder=self.config.is_encoder_decoder,
   2818         **model_kwargs,
   2819     )
   2821     # 12. run sample
-> 2822     outputs = self._sample(
   2823         input_ids,
   2824         logits_processor=logits_processor,
   2825         logits_warper=logits_warper,
   2826         stopping_criteria=stopping_criteria,
   2827         pad_token_id=generation_config.pad_token_id,
   2828         eos_token_id=generation_config.eos_token_id,
   2829         output_scores=generation_config.output_scores,
   2830         return_dict_in_generate=generation_config.return_dict_in_generate,
   2831         synced_gpus=synced_gpus,
   2832         streamer=streamer,
   2833         **model_kwargs,
   2834     )
   2836 else:
   2837     raise ValueError(
   2838         "Got incompatible mode for generation, should be one of greedy or sampling. "
   2839         "Ensure that beam search is de-activated by setting `num_beams=1` and `num_beam_groups=1`."
   2840     )

File [~/transformers/src/transformers/generation/utils.py:2771](http://localhost:4000/transformers/src/transformers/generation/utils.py#line=2770), in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2769 # sample
   2770 probs = nn.functional.softmax(next_token_scores, dim=-1)
-> 2771 next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
   2773 # finished sentences should have their next token be a padding token
   2774 if eos_token_id is not None:

RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Expected behavior

With eager, the code functions as expected:

from transformers import MusicgenForConditionalGeneration

model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small", attn_implementation="eager")

unconditional_inputs = model.get_unconditional_inputs(num_samples=1)
audio_values = model.generate(**unconditional_inputs, do_sample=True, max_new_tokens=256)

Could you have a quick look to see if there's a bug in the sdpa implementation @ylacombe? We could also add an integration test that confirms we get sensible outputs with the checkpoint "facebook/musicgen-small".

@ylacombe
Copy link
Contributor

ylacombe commented Apr 3, 2024

Hey @sanchit-gandhi, thanks for opening the issue!
It's working on my environment, but it might be explained by the torch version I'm using (2.2).
Nonetheless, before I dive deeper, could you verify that you still get nans/infs when using a GPU and/or when using torch.dtype=torch.float16 ?

@cjekel
Copy link

cjekel commented Apr 3, 2024

@ylacombe Is there a known issue with GPU + float16 and SDPA? I was searching and could not find anything, yet I'm having issues with other models (mistral, mixtral) sampling with SDPA. Happy to make a separate issue if it has not been reported.

@ylacombe
Copy link
Contributor

ylacombe commented Apr 4, 2024

hey @cjekel, not that I'm aware of! The current issue is without GPU and with fp32!
Feel free to open an issue for the other models with a reproducing script (and to tag me as well) !

@pranav-bot
Copy link

pranav-bot commented Jun 2, 2024

@cjekel
@sanchit-gandhi
@vanpelt
@ylacombe
There have been reported issues with using mixed precision training (float16) with certain operations, such as the scaled dot-product attention (SDPA), on some GPUs.
What I think gives an error here is the Numerical Instability: SDPA involves matrix multiplications and softmax operations, which can lead to numerical instability when using low precision (float16). This instability can manifest as NaN (Not a Number) or infinity values, especially if the gradients become too small or too large.

There is one possible fix I can think of -
Softmax Stability: When calculating the softmax probabilities for multinomial sampling, consider using a stable implementation to avoid overflow or underflow issues. One common approach is to subtract the maximum value from logits before applying the softmax function.

                if do_sample:
                    # Apply gradient scaling
                    next_token_scores_scaled = next_token_scores / next_token_scores.max(dim=-1, keepdim=True).values
                    probs = F.softmax(next_token_scores_scaled, dim=-1)
                    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
                else:
                    next_tokens = torch.argmax(next_token_scores, dim=-1)

or
a technique called temperature scaling can be used. This technique involves dividing the logits by a temperature parameter before applying the softmax operation. Using this technique would increase the parameters of the original function returning a lot of refactoring changes.

This problem can scale if the input is too small or too large majorly because of gradient issues especially while working with half precision.

@huggingface huggingface deleted a comment from github-actions bot Jun 10, 2024
@ylacombe
Copy link
Contributor

Should be fixed by #31208!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants