Floating-point operations logging in trainer #6768

TevenLeScao · 2020-08-27T17:06:00Z

First of two PRs to implement #4847 :

logging loss vs floating-point operations
using the results for scaling laws analysis

This directly logs floating-point operations in wandb and comet, and creates a log_history.json file with training metrics. To do so, it adds methods to PretrainedModel to count parameters with and without embeddings, and the number of floating-point operations. It also has a few Trainer fixes, most importantly averaging the eval loss across processes rather than logging the one in process 0, and a bug with checkpoint folder creation.

codecov · 2020-08-28T15:27:14Z

Codecov Report

Merging #6768 into master will increase coverage by 1.17%.
The diff coverage is 38.57%.

@@            Coverage Diff             @@
##           master    #6768      +/-   ##
==========================================
+ Coverage   78.47%   79.65%   +1.17%     
==========================================
  Files         157      157              
  Lines       28569    28625      +56     
==========================================
+ Hits        22420    22800     +380     
+ Misses       6149     5825     -324

Impacted Files	Coverage Δ
src/transformers/trainer.py	`51.85% <31.48%> (-1.81%)`	⬇️
src/transformers/modeling_utils.py	`86.66% <62.50%> (-0.84%)`	⬇️
src/transformers/modeling_tf_xlm.py	`88.42% <0.00%> (-4.85%)`	⬇️
src/transformers/file_utils.py	`82.41% <0.00%> (-0.26%)`	⬇️
src/transformers/modeling_bart.py	`95.56% <0.00%> (+0.17%)`	⬆️
src/transformers/configuration_bart.py	`94.00% <0.00%> (+4.00%)`	⬆️
src/transformers/tokenization_xlnet.py	`90.09% <0.00%> (+23.42%)`	⬆️
src/transformers/modeling_tf_transfo_xl.py	`88.13% <0.00%> (+68.28%)`	⬆️
...c/transformers/modeling_tf_transfo_xl_utilities.py	`86.00% <0.00%> (+76.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42fddac...4becfac. Read the comment docs.

sgugger

Thanks for the PR! Got a few comments on my side.

src/transformers/modeling_utils.py

sgugger · 2020-08-31T12:30:04Z

src/transformers/trainer.py

@@ -690,6 +701,12 @@ def train(self, model_path: Optional[str] = None, trial: Union["optuna.Trial", D

                tr_loss += self.training_step(model, inputs)

+                try:


Can we make a cleaner test with isinstance(model, nn.DataParallel)?

sgugger · 2020-08-31T12:31:10Z

src/transformers/trainer.py


        # Save a trained model and configuration using `save_pretrained()`.
        # They can then be reloaded using `from_pretrained()`
        if not isinstance(self.model, PreTrainedModel):
            raise ValueError("Trainer.model appears to not be a PreTrainedModel")

        xm.rendezvous("saving_checkpoint")
+        # Storing the number of floating-point operations that went into the model


Those 7 lines are duplicated, maybe put them in a private method to refactor a bit?

agreed, done

sgugger · 2020-08-31T12:32:26Z

src/transformers/trainer.py

+            concat = concat[:num_total_examples]
+        return concat
+
+    def distributed_broadcast_scalars(


This doesn't seem to use self (distributed_concat neither) so maybe those two methods should be functions?

Agree and will move them out, do you think we should keep a redirection to keep distributed_concat backwards-compatible?

src/transformers/trainer.py

sgugger · 2020-08-31T12:33:59Z

src/transformers/trainer.py

+        another model, either implement such a method in the model or override this method.
+
+        Args:
+            model (:obj:`nn.Module`):


Can't we use self.model?

yep, changed it, allows us to save a few lines in the main method too

We can remove the docstring as well

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

LysandreJik

LGTM, left a few comments.

LysandreJik · 2020-09-03T12:11:07Z

src/transformers/trainer.py

+                # in case the model has no config
+                combined_dict = {**self.args.to_sanitized_dict()}


Is there an example of a model without a configuration?

Ah yes, it's something @sgugger mentioned as well - when writing for Trainer we assume that the model is a PretrainedModel, but the one in the test doesn't inherit from PretrainedModel which is why I put this in. @julien-c also liked the idea of Trainer being domain-agnostic (eg not only NLP for example) so I figured might as well put this line in since it isn't expensive. I think in the end it's something we might want to think about since there's a lot of references to model.config (for example if training on TPU, which the test doesn't test for)

LysandreJik · 2020-09-03T12:11:36Z

src/transformers/trainer.py

+                self.total_flos = getattr(model.config, "total_flos", 0)
+


wouldn't this fail if the model didn't have a config?

Yes, the dummy test model doesn't go through it since it doesn't have a method to calculate flos so I didn't catch it! See above, I think we might have to decide whether we want to assume it has a config or not

LysandreJik · 2020-09-03T12:12:34Z

src/transformers/trainer.py

+        another model, either implement such a method in the model or override this method.
+
+        Args:
+            model (:obj:`nn.Module`):


We can remove the docstring as well

LysandreJik · 2020-09-07T08:47:57Z

I think even with domain-agnostic models we'd like to keep the configuration, no? I'm not sure the trainer would behave correctly without a configuration, so if we want to remove the dependency towards configurations, we might as well do it all at once, right?

Would the goal be to have the trainer accept all nn.Modules?

sgugger · 2020-09-08T11:47:20Z

Like agreed upon internally, we will move to Trainer accepting models instantiating a base abstractclass/conforming to some protocol. I think the config will be in the required field but have to work a bit more on this to be sure.

In any case, this is work for a subsequent PR :-)

* neFLOs calculation, logging, and reloading (huggingface#1) * testing distributed consecutive batches * fixed AttributeError from DataParallel * removed verbosity * rotate with use_mtime=True * removed print * fixed interaction with gradient accumulation * indent formatting * distributed neflo counting * fixed typo * fixed typo * mean distributed losses * exporting log history * moved a few functions * floating_point_ops clarification for transformers with parameter-reuse * code quality * double import * made flo estimation more task-agnostic * only logging flos if computed * code quality * unused import * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Sylvain review * Update src/transformers/modeling_utils.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * black Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

TevenLeScao added 20 commits July 28, 2020 15:31

neFLOs calculation, logging, and reloading (#1)

9ee591e

Merge branch 'master' of https://github.com/huggingface/transformers

a49f2aa

testing distributed consecutive batches

b50d3e1

fixed AttributeError from DataParallel

6818ed2

removed verbosity

5324678

rotate with use_mtime=True

2636bb8

removed print

04e471b

Merge branch 'master' of https://github.com/huggingface/transformers

f78de89

fixed interaction with gradient accumulation

9e7c05a

indent formatting

8def613

Merged with comet integration PR

52635d6

nlp-trainer integration merge

7b8c0ce

Merge branch 'master' of https://github.com/huggingface/transformers

245df7c

distributed neflo counting

70f919f

fixed typo

349e916

fixed typo

9cc578d

mean distributed losses

03fe015

exporting log history

fa43ae1

moved a few functions

e7a249f

floating_point_ops clarification for transformers with parameter-reuse

45f5fcb

TevenLeScao requested review from julien-c, LysandreJik and sgugger August 27, 2020 17:06

TevenLeScao added 7 commits August 27, 2020 19:17

Merged with hyperparam change

ab49c08

code quality

69d2b1e

double import

d796eef

made flo estimation more task-agnostic

c175142

only logging flos if computed

1773dd6

code quality

4610852

unused import

fae5254

sgugger reviewed Aug 31, 2020

View reviewed changes

TevenLeScao and others added 6 commits August 31, 2020 16:08

Update src/transformers/trainer.py

6f1b48c

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Update src/transformers/modeling_utils.py

304ebe8

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Sylvain review

8ec3ea6

Update src/transformers/modeling_utils.py

4becfac

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Merge branch 'master' into master

1aaaa19

black

eb9d328

LysandreJik approved these changes Sep 3, 2020

View reviewed changes

LysandreJik merged commit 01d340a into huggingface:master Sep 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating-point operations logging in trainer #6768

Floating-point operations logging in trainer #6768

TevenLeScao commented Aug 27, 2020

codecov bot commented Aug 28, 2020 •

edited

Loading

sgugger left a comment

sgugger Aug 31, 2020

sgugger Aug 31, 2020

TevenLeScao Aug 31, 2020

sgugger Aug 31, 2020

TevenLeScao Aug 31, 2020

sgugger Aug 31, 2020

TevenLeScao Aug 31, 2020

LysandreJik Sep 3, 2020

LysandreJik left a comment

LysandreJik Sep 3, 2020

TevenLeScao Sep 3, 2020 •

edited

Loading

LysandreJik Sep 3, 2020

TevenLeScao Sep 3, 2020

LysandreJik Sep 3, 2020

LysandreJik commented Sep 7, 2020

sgugger commented Sep 8, 2020

		@@ -690,6 +701,12 @@ def train(self, model_path: Optional[str] = None, trial: Union["optuna.Trial", D

		tr_loss += self.training_step(model, inputs)

		try:

		# in case the model has no config
		combined_dict = {**self.args.to_sanitized_dict()}

Floating-point operations logging in trainer #6768

Floating-point operations logging in trainer #6768

Conversation

TevenLeScao commented Aug 27, 2020

codecov bot commented Aug 28, 2020 • edited Loading

Codecov Report

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TevenLeScao Sep 3, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik commented Sep 7, 2020

sgugger commented Sep 8, 2020

codecov bot commented Aug 28, 2020 •

edited

Loading

TevenLeScao Sep 3, 2020 •

edited

Loading