Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_ner.py example MobileBERT FP16 returns nan loss #11327

Closed
2 of 4 tasks
tblattner opened this issue Apr 19, 2021 · 3 comments
Closed
2 of 4 tasks

run_ner.py example MobileBERT FP16 returns nan loss #11327

tblattner opened this issue Apr 19, 2021 · 3 comments

Comments

@tblattner
Copy link
Contributor

tblattner commented Apr 19, 2021

Environment info

  • transformers version: 4.5.1
  • Platform: Linux-5.8.0-44-generic-x86_64-with-glibc2.10
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.7.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: Yes (RTX 2080 Ti)
  • Using distributed or parallel set-up in script?: No

Who can help

@sgugger @stas00 @patil-suraj

Information

Model I am using MobileBERT:

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name) conll2003
  • my own task or dataset: (give details below)

To reproduce

Using the example: https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py

Steps to reproduce the behavior:

  1. Add training_args.fp16 = True to main() after initializing training_args
  2. parameters to run_ner:
--model_name_or_path
google/mobilebert-uncased
--dataset_name
conll2003
--output_dir
/path/to/output
--do_eval
--do_train
--do_predict

3.loss will return nan

First observed nans popping up from the encoder within the forward call in the MobileBertModel class:
https://huggingface.co/transformers/_modules/transformers/modeling_mobilebert.html

Expected behavior

When running without FP16, the model trains as expected. Other models that I have tested did not have this issue and converge well with fp16 enabled: RoBERTa, BERT, and DistilBERT.

@sgugger
Copy link
Collaborator

sgugger commented Apr 20, 2021

It looks like mobileBERT was pretrained on TPUs using bfloat16, which then often result in NaNs when using FP16 for further fine-tuning (see #11076 or #10956). You'll be best off training in FP32 or use another model compatible with FP16.

@tblattner
Copy link
Contributor Author

Makes sense! That's interesting that affects the training on GPUs! I will pass this info on to my colleague who deals with reproducibility! And for now I shall stick with FP32 when fine-tuning the MobileBERT model!

Many thanks for the reply!

@stas00
Copy link
Contributor

stas00 commented Apr 20, 2021

You'll be best off training in FP32 or use another model compatible with FP16.

And at some point we should also add --bf16 mode to Trainer, for those who want to do finetuning and inference on hardware that supports it . e.g. high-end Ampere RTX-3090 and A100 should already support it, and of course TPU v2+.

Does it make sense?

FYI, bf16 AMP is being discussed here: pytorch/pytorch#55374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants