-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resume from checkpoint does not work #5
Comments
Actually the issue was caused by wrong checkpoint location. Previously we gave the location like this. It's working fine with the correct checkpoint path. $ python run_qa.py --model_name_or_path roberta-base --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 24 --per_device_eval_batch_size 8 --use_habana --use_lazy_mode --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./output/ --resume_from_checkpoint ./output/ |
Actually it's supposed to work with giving last saved checkpoint folder. We found that there's an issue in trainer side. |
Currently, the checkpoint resume does not work if the training run ends abruptly amidst an epoch. It does not pick up the global last saved checkpoint step. Instead, it picks up the last step that ended gracefully. |
Could you tell me if you still encounter this issue with an up to date version of the package? |
can't reproduce after pull request 11 |
Add latest main
Error Message:
Command used to run training :
Method for reproducing the issue:
Attached Log file:
albert_xxlarge_bf16_squad_continued.log
The text was updated successfully, but these errors were encountered: