mt5 getting nans with fp16 #10819

dorost1234 · 2021-03-20T06:44:36Z

Environment info

transformers version: 4.4.2
Platform: linux
Python version: 3.7
PyTorch version (GPU?): 1.8
Tensorflow version (GPU?): -
Using GPU in script?: -
Using distributed or parallel set-up in script?: -

Who can help

t5: @patrickvonplaten, @patil-suraj

Information

I am using mt5-small model:

the problem arises when using fp16 with mt5

The tasks I am working on is:

translation

To reproduce

Steps to reproduce the behavior:

python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir test/tst-translation --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --max_train_samples 100 --fp16

outputs:

***** eval metrics *****
  epoch                     =     3.0
  eval_bleu                 =  0.0039
  eval_gen_len              =    2.95
  eval_loss                 =     nan
  eval_mem_cpu_alloc_delta  =     4MB
  eval_mem_cpu_peaked_delta =     5MB
  eval_mem_gpu_alloc_delta  =     0MB
  eval_mem_gpu_peaked_delta =  1080MB
  eval_runtime              = 72.1865
  eval_samples              =    1999
  eval_samples_per_second   =  27.692

Expected behavior

being able to use fp16 with mt5 models. Thank you very much for your help, this is really crucial for me to be able to run these models with fp16 to be able to fit more data into old GPUs I have access to and I appreciate a lot your help.

The text was updated successfully, but these errors were encountered:

patrickvonplaten · 2021-03-22T08:30:57Z

Duplicate of #10830

dorost1234 · 2021-03-22T09:10:59Z

Hi @patrickvonplaten this is not exact duplicate, I am using mt5-small and the other user in #10830 is using t5-large, I appreciate considering both thank you

stas00 · 2021-03-29T22:46:11Z

@dorost1234, please kindly test if this PR fixes the problem: #10956

dorost1234 · 2021-03-31T08:45:48Z

@stas00 thank you very much for the contributions, it now works for me for the mt5-small, I am running some more experiments with it and update.

dorost1234 · 2021-03-31T21:29:34Z

Dear @stas00
I tested more codes, without deepspeed, it works fine with setting the feedforward layer to float32, as suggested in the PR, but the moment I switch to deepspeed I still get nan issue in my codes. I greatly appreciate if you can spare some moments from your precious time and provide me with a suggestion for the case of deepspeed for the same problem. Thank you very much

I also used your debug codes:

^M  0%|          | 0/38600 [00:00<?, ?it/s]WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 5 has inf
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerFF has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerFF has inf
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Stack loop end has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Stack loop start has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm variance has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states before return has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerSelfAttention has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block before T5LayerFF has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm variance has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states before return has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 1 has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 2 has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 3 has nans

stas00 · 2021-03-31T21:54:14Z

I was just thinking about it, so thank you for confirming that.

Deepspeed is not using autocast so in essence the proposed fixed makes no difference under Deepspeed as we aren't running under autocast in the first place. Let's ask the DeepSpeed developers microsoft/DeepSpeed#908

Though let's continue the discussion on the deepspeed in the other issue you opened, since these are related but different problems. That's we may fix one but not the other, or the fixes may come at different times, so it's easier to track separate issues.

Or if there is not one specific issue to t5/mt5+deepspeed please open one. Thank you.

dorost1234 · 2021-03-31T21:59:02Z

Dear @stas00
Sure, thank you very much for coming back to me. Having your permission I will open up an issue on this.
Thank you very much.

stas00 · 2021-03-31T22:06:15Z

I already did - please see the link in my last comment. Please do not worry, we will surely find one way or another to resolve this.

dorost1234 · 2021-03-31T22:07:27Z

oh, great, thank you very much

dorost1234 · 2021-04-02T14:11:03Z

Dear @stas00
I tested the code more (without deepspeed) on larger scale and when I train on opus100 (I train on 20 languages of it), after 2000 iterations with mt5-small, after applying the fix, this gets nan still. I will share with you a reproducible code soon. thanks a lot for all the great work.

github-actions · 2021-05-30T15:08:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dorost1234 changed the title ~~mt5 getting nans with deepspeed~~ mt5 getting nans with fp16 Mar 20, 2021

This was referenced Mar 20, 2021

[trainer] loss = NaN with label_smoothing and full-fp16 eval #10674

Closed

getting nans with t5-large + fix #10830

Closed

patrickvonplaten marked this as a duplicate of #10830 Mar 22, 2021

stas00 mentioned this issue Mar 29, 2021

[T5/MT5] resolve inf/nan under amp (mixed precision) #10956

Closed

1 task

github-actions bot closed this as completed May 5, 2021

huggingface deleted a comment from github-actions bot May 5, 2021

stas00 reopened this May 5, 2021

github-actions bot closed this as completed Jun 7, 2021

FeiWang96 mentioned this issue Jul 5, 2022

use fp16 to train the t2t mT5 model alexa/massive#26

Closed

Tahmid04 mentioned this issue Dec 5, 2022

Slow inference using HF checkpoint csebuetnlp/xl-sum#9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mt5 getting nans with fp16 #10819

mt5 getting nans with fp16 #10819

dorost1234 commented Mar 20, 2021 •

edited

Loading

patrickvonplaten commented Mar 22, 2021

dorost1234 commented Mar 22, 2021

stas00 commented Mar 29, 2021

dorost1234 commented Mar 31, 2021

dorost1234 commented Mar 31, 2021 •

edited

Loading

stas00 commented Mar 31, 2021 •

edited

Loading

dorost1234 commented Mar 31, 2021 •

edited

Loading

stas00 commented Mar 31, 2021 •

edited

Loading

dorost1234 commented Mar 31, 2021

dorost1234 commented Apr 2, 2021

github-actions bot commented May 30, 2021

mt5 getting nans with fp16 #10819

mt5 getting nans with fp16 #10819

Comments

dorost1234 commented Mar 20, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

patrickvonplaten commented Mar 22, 2021

dorost1234 commented Mar 22, 2021

stas00 commented Mar 29, 2021

dorost1234 commented Mar 31, 2021

dorost1234 commented Mar 31, 2021 • edited Loading

stas00 commented Mar 31, 2021 • edited Loading

dorost1234 commented Mar 31, 2021 • edited Loading

stas00 commented Mar 31, 2021 • edited Loading

dorost1234 commented Mar 31, 2021

dorost1234 commented Apr 2, 2021

github-actions bot commented May 30, 2021

dorost1234 commented Mar 20, 2021 •

edited

Loading

dorost1234 commented Mar 31, 2021 •

edited

Loading

stas00 commented Mar 31, 2021 •

edited

Loading

dorost1234 commented Mar 31, 2021 •

edited

Loading

stas00 commented Mar 31, 2021 •

edited

Loading