Running run_translation.py with mt5 model, but loss is always 0.0 #22467

SefaZeng · 2023-03-30T10:02:57Z

System Info

transformers version 4.28.0.dev

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

training scripts:

python3 -m torch.distributed.launch --nproc_per_node=8 \
  --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT ${code_dir}/run_translation.py \
    --model_name_or_path ${work_dir}/../pretrain_models/mt0-base \
    --train_file ${data_dir}/ja2zh.json \
    --validation_file ${data_dir}/ja2zh-head10.json \
    --source_lang ja \
    --target_lang zh \
    --source_prefix "translate Japanese to Chinese: " \
    --warmup_ratio 0.1 \
    --save_total_limit 10 \
    --save_steps 5000 \
    --logging_steps 1 \
    --weight_decay 0.001 \
    --adam_beta2 0.98 \
    --learning_rate 2e-4 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --cache_dir ${data_dir}/cache/ \
    --do_train \
    --do_eval \
    --fp16 \
    --output_dir ${ckpt_dir}/hf \
    --preprocessing_num_workers 40 \
2>&1 |tee ${LOG_FILE}

mt0-base is cloned from the huggingface. And the loss is always 0.0:

[INFO|trainer.py:598] 2023-03-30 09:56:13,151 >> Using cuda_amp half precision backend
/home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1743] 2023-03-30 09:56:13,677 >> ***** Running training *****
[INFO|trainer.py:1744] 2023-03-30 09:56:13,677 >>   Num examples = 31729970
[INFO|trainer.py:1745] 2023-03-30 09:56:13,677 >>   Num Epochs = 1
[INFO|trainer.py:1746] 2023-03-30 09:56:13,677 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1747] 2023-03-30 09:56:13,677 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1748] 2023-03-30 09:56:13,677 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1749] 2023-03-30 09:56:13,677 >>   Total optimization steps = 991562
[INFO|trainer.py:1750] 2023-03-30 09:56:13,680 >>   Number of trainable parameters = 1229581312
[WARNING|logging.py:280] 2023-03-30 09:56:19,819 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-03-30 09:56:20,010 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}

But if I try to train mt5 model from scratch with my mt data, the loss looks good. Did I miss something?
Any advice is appreciated! Thx in advance!

Expected behavior

Loss is larger than 0.0 and the model parameter will update.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-04-29T15:01:54Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Victordongy · 2023-08-26T19:51:46Z

It seems this issue still persists. Maybe we could consider re-opening this issue. I also have the similar issue with loss being 0 after running one iteration using 8 bit or fp16, the transformer version is 4.32.0. @younesbelkada to help take a look at this issue.

My system info is as follows:

transformers version: 4.32.0
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.20.3
Accelerate config: not found
PyTorch version (GPU?): 2.0.0+cu117 (True)
Tensorflow version (GPU?): 2.12.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

ArthurZucker · 2023-08-28T13:12:51Z

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

Victordongy · 2023-08-28T17:11:07Z

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

Hi @ArthurZucker, as quote from #10956 (comment). It sees the experimental change has not been merged, and also without too much related performance experiments. However, from this pr #20760 I noticed that the 8bit workaround is first converting partial of the modules to be fp16 with the other unchanged. I wonder this might also seem to be feasible solution for fp16 training ?

github-actions bot closed this as completed May 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

SefaZeng commented Mar 30, 2023 •

edited

Loading

github-actions bot commented Apr 29, 2023

Victordongy commented Aug 26, 2023 •

edited

Loading

ArthurZucker commented Aug 28, 2023

Victordongy commented Aug 28, 2023

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

Comments

SefaZeng commented Mar 30, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented Apr 29, 2023

Victordongy commented Aug 26, 2023 • edited Loading

ArthurZucker commented Aug 28, 2023

Victordongy commented Aug 28, 2023

SefaZeng commented Mar 30, 2023 •

edited

Loading

Victordongy commented Aug 26, 2023 •

edited

Loading