run_ner.py example MobileBERT FP16 returns nan loss #11327

tblattner · 2021-04-19T22:01:17Z

Environment info

transformers version: 4.5.1
Platform: Linux-5.8.0-44-generic-x86_64-with-glibc2.10
Python version: 3.8.8
PyTorch version (GPU?): 1.7.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes (RTX 2080 Ti)
Using distributed or parallel set-up in script?: No

Who can help

@sgugger @stas00 @patil-suraj

Information

Model I am using MobileBERT:

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name) conll2003
my own task or dataset: (give details below)

To reproduce

Using the example: https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner.py

Steps to reproduce the behavior:

Add training_args.fp16 = True to main() after initializing training_args
parameters to run_ner:

--model_name_or_path
google/mobilebert-uncased
--dataset_name
conll2003
--output_dir
/path/to/output
--do_eval
--do_train
--do_predict

3.loss will return nan

First observed nans popping up from the encoder within the forward call in the MobileBertModel class:
https://huggingface.co/transformers/_modules/transformers/modeling_mobilebert.html

Expected behavior

When running without FP16, the model trains as expected. Other models that I have tested did not have this issue and converge well with fp16 enabled: RoBERTa, BERT, and DistilBERT.

The text was updated successfully, but these errors were encountered:

sgugger · 2021-04-20T00:34:31Z

It looks like mobileBERT was pretrained on TPUs using bfloat16, which then often result in NaNs when using FP16 for further fine-tuning (see #11076 or #10956). You'll be best off training in FP32 or use another model compatible with FP16.

tblattner · 2021-04-20T00:42:21Z

Makes sense! That's interesting that affects the training on GPUs! I will pass this info on to my colleague who deals with reproducibility! And for now I shall stick with FP32 when fine-tuning the MobileBERT model!

Many thanks for the reply!

stas00 · 2021-04-20T00:47:57Z

You'll be best off training in FP32 or use another model compatible with FP16.

And at some point we should also add --bf16 mode to Trainer, for those who want to do finetuning and inference on hardware that supports it . e.g. high-end Ampere RTX-3090 and A100 should already support it, and of course TPU v2+.

Does it make sense?

FYI, bf16 AMP is being discussed here: pytorch/pytorch#55374

tblattner closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run_ner.py example MobileBERT FP16 returns nan loss #11327

run_ner.py example MobileBERT FP16 returns nan loss #11327

tblattner commented Apr 19, 2021 •

edited

Loading

sgugger commented Apr 20, 2021

tblattner commented Apr 20, 2021

stas00 commented Apr 20, 2021 •

edited

Loading

run_ner.py example MobileBERT FP16 returns nan loss #11327

run_ner.py example MobileBERT FP16 returns nan loss #11327

Comments

tblattner commented Apr 19, 2021 • edited Loading

Environment info

Who can help

Information

To reproduce

Expected behavior

sgugger commented Apr 20, 2021

tblattner commented Apr 20, 2021

stas00 commented Apr 20, 2021 • edited Loading

tblattner commented Apr 19, 2021 •

edited

Loading

stas00 commented Apr 20, 2021 •

edited

Loading