Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resume from checkpoint does not work #5

Closed
MohitIntel opened this issue Mar 29, 2022 · 5 comments
Closed

Resume from checkpoint does not work #5

MohitIntel opened this issue Mar 29, 2022 · 5 comments

Comments

@MohitIntel
Copy link
Collaborator

MohitIntel commented Mar 29, 2022

Error Message:

Traceback (most recent call last):
  File "examples/question-answering/run_qa.py", line 664, in <module>
    main()
  File "examples/question-answering/run_qa.py", line 605, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/optimum/habana/trainer.py", line 517, in train
    self._load_optimizer_and_scheduler(resume_from_checkpoint)
  File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1795, in _load_optimizer_and_scheduler
    torch.load(os.path.join(checkpoint, OPTIMIZER_NAME), map_location=map_location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 882, in _load
    result = unpickler.load()
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 846, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 827, in restore_location
    return default_restore_location(storage, str(map_location))
  File "/usr/local/lib/python3.8/dist-packages/torch/serialization.py", line 178, in default_restore_location
    raise RuntimeError("don't know how to restore data location of "
RuntimeError: don't know how to restore data location of torch.FloatStorage (tagged with hpu)

Command used to run training :

python examples/question-answering/run_qa.py --model_name_or_path albert-xxlarge-v1 --dataset_name squad  --do_train --do_eval --per_device_train_batch_size=12 --learning_rate=5e-06 --num_train_epochs 2 --save_steps 5000 --seed 42 --doc_stride 128 --max_seq_length 384 --per_device_eval_batch_size 2 --use_lazy_mode  --use_habana --output_dir=./albert_xxlarge_bf16_squad 2>&1 | tee albert_xxlarge_bf16_squad_continued.log

Method for reproducing the issue:

  1. Use above command to run the training.
  2. Halt the training after few steps/epochs.
  3. Resume the training using the same command with --resume_from_checkpoint flag pointing to the output directory of the above command.
  4. Above error is encountered.

Attached Log file:
albert_xxlarge_bf16_squad_continued.log

@yeonsily
Copy link
Collaborator

yeonsily commented Mar 31, 2022

Actually the issue was caused by wrong checkpoint location. Previously we gave the location like this.
'--resume_from_checkpoint ./output/checkpoint-3500' but it's supposed to be just ./output

It's working fine with the correct checkpoint path.
This is an example command to verify it.

$ python run_qa.py --model_name_or_path roberta-base --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 24 --per_device_eval_batch_size 8 --use_habana --use_lazy_mode --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./output/ --resume_from_checkpoint ./output/

@yeonsily
Copy link
Collaborator

yeonsily commented Apr 1, 2022

Actually it's supposed to work with giving last saved checkpoint folder.
e.g. --resume_from_checkpoint ./output/checkpoint-3500

We found that there's an issue in trainer side.

@yeonsily yeonsily reopened this Apr 1, 2022
@MohitIntel
Copy link
Collaborator Author

Currently, the checkpoint resume does not work if the training run ends abruptly amidst an epoch. It does not pick up the global last saved checkpoint step. Instead, it picks up the last step that ended gracefully.

@regisss
Copy link
Collaborator

regisss commented Apr 4, 2022

Could you tell me if you still encounter this issue with an up to date version of the package?

@libinta libinta closed this as completed Apr 7, 2022
@libinta
Copy link
Collaborator

libinta commented Apr 7, 2022

can't reproduce after pull request 11

regisss pushed a commit that referenced this issue Dec 3, 2024
Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com>
regisss pushed a commit that referenced this issue Dec 3, 2024
Co-authored-by: Urszula Golowicz <urszula.golowicz@intel.com>
asafkar pushed a commit to asafkar/optimum-habana that referenced this issue Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants