Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mt5 getting nans with fp16 #10819

Closed
dorost1234 opened this issue Mar 20, 2021 · 11 comments
Closed

mt5 getting nans with fp16 #10819

dorost1234 opened this issue Mar 20, 2021 · 11 comments

Comments

@dorost1234
Copy link

dorost1234 commented Mar 20, 2021

Environment info

  • transformers version: 4.4.2
  • Platform: linux
  • Python version: 3.7
  • PyTorch version (GPU?): 1.8
  • Tensorflow version (GPU?): -
  • Using GPU in script?: -
  • Using distributed or parallel set-up in script?: -

Who can help

t5: @patrickvonplaten, @patil-suraj

Information

I am using mt5-small model:

  • the problem arises when using fp16 with mt5

The tasks I am working on is:

  • translation

To reproduce

Steps to reproduce the behavior:

python run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir test/tst-translation --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --max_train_samples 100 --fp16

outputs:

***** eval metrics *****
  epoch                     =     3.0
  eval_bleu                 =  0.0039
  eval_gen_len              =    2.95
  eval_loss                 =     nan
  eval_mem_cpu_alloc_delta  =     4MB
  eval_mem_cpu_peaked_delta =     5MB
  eval_mem_gpu_alloc_delta  =     0MB
  eval_mem_gpu_peaked_delta =  1080MB
  eval_runtime              = 72.1865
  eval_samples              =    1999
  eval_samples_per_second   =  27.692

Expected behavior

being able to use fp16 with mt5 models. Thank you very much for your help, this is really crucial for me to be able to run these models with fp16 to be able to fit more data into old GPUs I have access to and I appreciate a lot your help.

@dorost1234 dorost1234 changed the title mt5 getting nans with deepspeed mt5 getting nans with fp16 Mar 20, 2021
@patrickvonplaten
Copy link
Contributor

Duplicate of #10830

@patrickvonplaten patrickvonplaten marked this as a duplicate of #10830 Mar 22, 2021
@dorost1234
Copy link
Author

Hi @patrickvonplaten this is not exact duplicate, I am using mt5-small and the other user in #10830 is using t5-large, I appreciate considering both thank you

@stas00
Copy link
Contributor

stas00 commented Mar 29, 2021

@dorost1234, please kindly test if this PR fixes the problem: #10956

@dorost1234
Copy link
Author

@stas00 thank you very much for the contributions, it now works for me for the mt5-small, I am running some more experiments with it and update.

@dorost1234
Copy link
Author

dorost1234 commented Mar 31, 2021

Dear @stas00
I tested more codes, without deepspeed, it works fine with setting the feedforward layer to float32, as suggested in the PR, but the moment I switch to deepspeed I still get nan issue in my codes. I greatly appreciate if you can spare some moments from your precious time and provide me with a suggestion for the case of deepspeed for the same problem. Thank you very much

I also used your debug codes:

^M  0%|          | 0/38600 [00:00<?, ?it/s]WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 5 has inf
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerFF has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerFF has inf
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Stack loop end has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Stack loop start has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm variance has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states before return has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block after T5LayerSelfAttention has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5Block before T5LayerFF has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm variance has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:T5LayerNorm hidden_states before return has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 1 has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 2 has nans
WARNING:seq2seq.third_party.models.t5.debug_utils:gelu 3 has nans

@stas00
Copy link
Contributor

stas00 commented Mar 31, 2021

I was just thinking about it, so thank you for confirming that.

Deepspeed is not using autocast so in essence the proposed fixed makes no difference under Deepspeed as we aren't running under autocast in the first place. Let's ask the DeepSpeed developers microsoft/DeepSpeed#908

Though let's continue the discussion on the deepspeed in the other issue you opened, since these are related but different problems. That's we may fix one but not the other, or the fixes may come at different times, so it's easier to track separate issues.

Or if there is not one specific issue to t5/mt5+deepspeed please open one. Thank you.

@dorost1234
Copy link
Author

dorost1234 commented Mar 31, 2021

Dear @stas00
Sure, thank you very much for coming back to me. Having your permission I will open up an issue on this.
Thank you very much.

@stas00
Copy link
Contributor

stas00 commented Mar 31, 2021

I already did - please see the link in my last comment. Please do not worry, we will surely find one way or another to resolve this.

@dorost1234
Copy link
Author

oh, great, thank you very much

@dorost1234
Copy link
Author

Dear @stas00
I tested the code more (without deepspeed) on larger scale and when I train on opus100 (I train on 20 languages of it), after 2000 iterations with mt5-small, after applying the fix, this gets nan still. I will share with you a reproducible code soon. thanks a lot for all the great work.

@github-actions github-actions bot closed this as completed May 5, 2021
@huggingface huggingface deleted a comment from github-actions bot May 5, 2021
@stas00 stas00 reopened this May 5, 2021
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants