Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

Closed
2 of 4 tasks
SefaZeng opened this issue Mar 30, 2023 · 4 comments
Closed
2 of 4 tasks

Running run_translation.py with mt5 model, but loss is always 0.0 #22467

SefaZeng opened this issue Mar 30, 2023 · 4 comments

Comments

@SefaZeng
Copy link

SefaZeng commented Mar 30, 2023

System Info

transformers version 4.28.0.dev

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. training scripts:
python3 -m torch.distributed.launch --nproc_per_node=8 \
  --nnodes=${WORLD_SIZE} --node_rank=${RANK} --master_addr=$MASTER_ADDR \
  --master_port=$MASTER_PORT ${code_dir}/run_translation.py \
    --model_name_or_path ${work_dir}/../pretrain_models/mt0-base \
    --train_file ${data_dir}/ja2zh.json \
    --validation_file ${data_dir}/ja2zh-head10.json \
    --source_lang ja \
    --target_lang zh \
    --source_prefix "translate Japanese to Chinese: " \
    --warmup_ratio 0.1 \
    --save_total_limit 10 \
    --save_steps 5000 \
    --logging_steps 1 \
    --weight_decay 0.001 \
    --adam_beta2 0.98 \
    --learning_rate 2e-4 \
    --num_train_epochs 1 \
    --gradient_accumulation_steps 1 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --cache_dir ${data_dir}/cache/ \
    --do_train \
    --do_eval \
    --fp16 \
    --output_dir ${ckpt_dir}/hf \
    --preprocessing_num_workers 40 \
2>&1 |tee ${LOG_FILE}

mt0-base is cloned from the huggingface. And the loss is always 0.0:

[INFO|trainer.py:598] 2023-03-30 09:56:13,151 >> Using cuda_amp half precision backend
/home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/user/miniconda/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
[INFO|trainer.py:1743] 2023-03-30 09:56:13,677 >> ***** Running training *****
[INFO|trainer.py:1744] 2023-03-30 09:56:13,677 >>   Num examples = 31729970
[INFO|trainer.py:1745] 2023-03-30 09:56:13,677 >>   Num Epochs = 1
[INFO|trainer.py:1746] 2023-03-30 09:56:13,677 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1747] 2023-03-30 09:56:13,677 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1748] 2023-03-30 09:56:13,677 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1749] 2023-03-30 09:56:13,677 >>   Total optimization steps = 991562
[INFO|trainer.py:1750] 2023-03-30 09:56:13,680 >>   Number of trainable parameters = 1229581312
[WARNING|logging.py:280] 2023-03-30 09:56:19,819 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[WARNING|logging.py:280] 2023-03-30 09:56:20,010 >> You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[W reducer.cpp:1303] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}
{'loss': 0.0, 'learning_rate': 0.0, 'epoch': 0.0}

But if I try to train mt5 model from scratch with my mt data, the loss looks good. Did I miss something?
Any advice is appreciated! Thx in advance!

Expected behavior

Loss is larger than 0.0 and the model parameter will update.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this as completed May 7, 2023
@Victordongy
Copy link

Victordongy commented Aug 26, 2023

It seems this issue still persists. Maybe we could consider re-opening this issue. I also have the similar issue with loss being 0 after running one iteration using 8 bit or fp16, the transformer version is 4.32.0. @younesbelkada to help take a look at this issue.

My system info is as follows:

  • transformers version: 4.32.0
  • Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.20.3
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.0+cu117 (True)
  • Tensorflow version (GPU?): 2.12.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No

@ArthurZucker
Copy link
Collaborator

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

@Victordongy
Copy link

Inviting you to read #10956 which has very detailed explanation and a potential solution for you 😉

Hi @ArthurZucker, as quote from #10956 (comment). It sees the experimental change has not been merged, and also without too much related performance experiments. However, from this pr #20760 I noticed that the 8bit workaround is first converting partial of the modules to be fp16 with the other unchanged. I wonder this might also seem to be feasible solution for fp16 training ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants