fix redundant checkpointing in example training scripts #33131

eminorhan · 2024-08-26T19:29:35Z

What does this PR do?

Briefly, in several of the example training scripts, running with gradient_accumulation_steps > 1 currently causes gradient_accumulation_steps times redundant checkpointing in the step-based checkpointing mode. This PR fixes the issue by adding a clause to the step-based checkpointing condition to make sure saving is done only once at the appropriate checkpointing step.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker

LysandreJik

Thanks for your PR @eminorhan!

Just a quick ping @SunMarc as you work with the Trainer and example scripts much more than I do and might see something I don't

SunMarc

Thanks for finding and fixing the issue @eminorhan ! I left a suggestion to propagate to other scripts if you think it is better !

SunMarc · 2024-08-27T12:33:23Z

examples/pytorch/image-classification/run_image_classification_no_trainer.py

@@ -544,7 +544,7 @@ def collate_fn(examples):
                completed_steps += 1

            if isinstance(checkpointing_steps, int):
-                if completed_steps % checkpointing_steps == 0:
+                if completed_steps % checkpointing_steps == 0 and step % args.gradient_accumulation_steps == 0:


Suggested change

if completed_steps % checkpointing_steps == 0 and step % args.gradient_accumulation_steps == 0:

if completed_steps % checkpointing_steps == 0 and accelerator.sync_gradients:

@SunMarc your suggestion looks good to me, but I just noticed after committing your change that it doesn't change all scripts. Is there a quick way to fix this?

…_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

SunMarc

Here you go ! I added the remaining suggestion

examples/pytorch/image-pretraining/run_mim_no_trainer.py

examples/pytorch/instance-segmentation/run_instance_segmentation_no_trainer.py

examples/pytorch/language-modeling/run_clm_no_trainer.py

examples/pytorch/language-modeling/run_fim_no_trainer.py

examples/pytorch/language-modeling/run_mlm_no_trainer.py

examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py

examples/pytorch/summarization/run_summarization_no_trainer.py

examples/pytorch/text-classification/run_glue_no_trainer.py

examples/pytorch/token-classification/run_ner_no_trainer.py

examples/pytorch/translation/run_translation_no_trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

…on_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

…on_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

…ner.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

HuggingFaceDocBuilderDev · 2024-08-27T14:10:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…33131) * fix redundant checkpointing in example scripts * Update examples/pytorch/image-classification/run_image_classification_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/translation/run_translation_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/token-classification/run_ner_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/text-classification/run_glue_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/summarization/run_summarization_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/semantic-segmentation/run_semantic_segmentation_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/language-modeling/run_mlm_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/language-modeling/run_fim_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/language-modeling/run_clm_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/image-pretraining/run_mim_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/instance-segmentation/run_instance_segmentation_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/multiple-choice/run_swag_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/question-answering/run_qa_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/object-detection/run_object_detection_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update examples/pytorch/question-answering/run_qa_beam_search_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

fix redundant checkpointing in example scripts

94c314c

eminorhan changed the title ~~fix redundant checkpointing in example scripts~~ fix redundant checkpointing in example training scripts Aug 26, 2024

eminorhan mentioned this pull request Aug 26, 2024

redundant checkpointing in example scripts #32653

Closed

4 tasks

LysandreJik approved these changes Aug 27, 2024

View reviewed changes

SunMarc approved these changes Aug 27, 2024

View reviewed changes

Update examples/pytorch/image-classification/run_image_classification…

1c57a74

…_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

SunMarc approved these changes Aug 27, 2024

View reviewed changes

eminorhan and others added 14 commits August 27, 2024 09:27

Update examples/pytorch/translation/run_translation_no_trainer.py

10f3aed

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/token-classification/run_ner_no_trainer.py

c533d41

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/text-classification/run_glue_no_trainer.py

cd7b366

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/summarization/run_summarization_no_trainer.py

01e941b

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/semantic-segmentation/run_semantic_segmentati…

8eaa23c

…on_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/language-modeling/run_mlm_no_trainer.py

ac63abe

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/language-modeling/run_fim_no_trainer.py

e25fcd4

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/language-modeling/run_clm_no_trainer.py

7a5b2df

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/image-pretraining/run_mim_no_trainer.py

7f87ee6

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/instance-segmentation/run_instance_segmentati…

51dbd37

…on_no_trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/multiple-choice/run_swag_no_trainer.py

72be88c

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/question-answering/run_qa_no_trainer.py

823ff12

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/object-detection/run_object_detection_no_trai…

3bffa0f

…ner.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update examples/pytorch/question-answering/run_qa_beam_search_no_trai…

f91c518

…ner.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

SunMarc merged commit d47a9e8 into huggingface:main Aug 27, 2024
7 checks passed

eminorhan deleted the remove-redundant-checkpointing branch August 27, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix redundant checkpointing in example training scripts #33131

fix redundant checkpointing in example training scripts #33131

eminorhan commented Aug 26, 2024

LysandreJik left a comment

SunMarc left a comment

SunMarc Aug 27, 2024

eminorhan Aug 27, 2024

SunMarc left a comment

HuggingFaceDocBuilderDev commented Aug 27, 2024

	if completed_steps % checkpointing_steps == 0 and step % args.gradient_accumulation_steps == 0:
	if completed_steps % checkpointing_steps == 0 and accelerator.sync_gradients:

fix redundant checkpointing in example training scripts #33131

fix redundant checkpointing in example training scripts #33131

Conversation

eminorhan commented Aug 26, 2024

What does this PR do?

Before submitting

Who can review?

LysandreJik left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

SunMarc Aug 27, 2024

Choose a reason for hiding this comment

eminorhan Aug 27, 2024

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 27, 2024