Tpu save #4309

lezwon · 2020-10-22T16:29:06Z

What does this PR do?

Fixes #2700
fixes #2303
fixes #3660

Move the XLA tensors within a checkpoint to CPU before saving.
Lazy check if TPU device exists
Removed calling accelerator.barrier() on TPU when calling .test() as the multiprocessing begins only in .fit()
Moved TPU device check to run when Trainer initialized.

Related to #3044

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pytorch_lightning/utilities/cloud_io.py

codecov · 2020-10-25T01:03:44Z

Codecov Report

Merging #4309 (eef0849) into master (0c763b2) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4309   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         124     124           
  Lines        9320    9303   -17     
======================================
- Hits         8640    8616   -24     
- Misses        680     687    +7

williamFalcon

@lezwon does that move to cpu really belong in this file?
Why is this needed for TPUs?
You realize this will affect more than just TPUs no? this is why i don't like making these changes outside the accelerator.

If it turns out this is the correct place for this, please add a on_save() function to each accelerator. Then make all of them no op:

def on_save(self, ...):
    pass

And add the change ONLY to the TPU accelerator.

def on_save(self, ...):
   your_changes()

Going forward, any changes that are accelerator specific should not be done inside methods like this. Instead, each accelerator needs to implement this method and then called like:

self.accelerator.on_save()

The reason is that we are trying to break up all the underlying accelerator code so they are easier to debug and changes to an accelerator won't break all the others.

lezwon · 2020-10-25T11:57:35Z

@lezwon does that move to cpu really belong in this file?
Why is this needed for TPUs?
You realize this will affect more than just TPUs no? this is why i don't like making these changes outside the accelerator.

If it turns out this is the correct place for this, please add a on_save() function to each accelerator. Then make all of them no op:
def on_save(self, ...):
    pass
And add the change ONLY to the TPU accelerator.
def on_save(self, ...):
   your_changes()
Going forward, any changes that are accelerator specific should not be done inside methods like this. Instead, each accelerator needs to implement this method and then called like:
self.accelerator.on_save()
The reason is that we are trying to break up all the underlying accelerator code so they are easier to debug and changes to an accelerator won't break all the others.

The XLA guide recommends users to move the tensors to CPU before saving so that they can be loaded on non-TPU devices. Lightning users have faced this issue wherein they have trained a model on a TPU and arent able to use it on a GPU or CPU. Hence we need to move these tensors to CPU before saving. I get your point about separating such code and making it accelerator specific. Will refactor this into on_save. 👍

mergify · 2020-10-25T17:23:37Z

This pull request is now in conflict... :(

SeanNaren · 2020-11-03T10:08:05Z

hey @lezwon any activity here? were you able to reproduce this using the boring model btw? Don't mind picking this up

lezwon · 2020-11-04T01:19:41Z

hey @SeanNaren, I've been making some refactors to make this change TPU specific. WIll push an updated branch soon :)

This reverts commit 0c9316b

removed barrier for tpu during test reduced epochs

Borda

lgtm

tests/utilities/test_xla_device_utils.py

williamFalcon · 2020-12-02T11:46:35Z

pytorch_lightning/core/optimizer.py

-from pytorch_lightning.core.grads import GradInformation
-from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks, ModelHooks
-from pytorch_lightning.core.memory import ModelSummary
-from pytorch_lightning.core.saving import ALLOWED_CONFIG_TYPES, PRIMITIVE_TYPES, ModelIO


where did this go? cc @tchaton

I guess useless imports.

shall not be there at all in the first place, @SeanNaren

tchaton · 2020-12-02T11:57:46Z

pytorch_lightning/core/optimizer.py

-from pytorch_lightning.core.grads import GradInformation
-from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks, ModelHooks
-from pytorch_lightning.core.memory import ModelSummary
-from pytorch_lightning.core.saving import ALLOWED_CONFIG_TYPES, PRIMITIVE_TYPES, ModelIO


I guess useless imports.

lezwon self-assigned this Oct 22, 2020

lezwon added bug Something isn't working checkpointing Related to checkpointing accelerator: tpu Tensor Processing Unit labels Oct 22, 2020

lezwon added this to the 1.1 milestone Oct 22, 2020

ananthsub suggested changes Oct 25, 2020

View reviewed changes

pytorch_lightning/utilities/cloud_io.py Outdated Show resolved Hide resolved

lezwon force-pushed the tpu_save branch from c53aedb to c0f2ad6 Compare October 25, 2020 09:48

lezwon modified the milestones: 1.1, 1.0.x Oct 25, 2020

lezwon marked this pull request as ready for review October 25, 2020 10:48

lezwon requested review from ananyahjha93, awaelchli, Borda, justusschock, nateraw, SeanNaren, tchaton, teddykoker and williamFalcon as code owners October 25, 2020 10:48

mergify bot requested a review from a team October 25, 2020 10:48

lezwon force-pushed the tpu_save branch from ba5ee22 to b1de30a Compare October 25, 2020 10:51

williamFalcon requested changes Oct 25, 2020

View reviewed changes

mergify bot requested a review from a team October 25, 2020 11:38

SeanNaren mentioned this pull request Nov 3, 2020

Update Pytorch Lightning to stable release version allenai/longformer#130

Closed

lezwon force-pushed the tpu_save branch from 3617f16 to 6013ed8 Compare November 5, 2020 16:38

lezwon and others added 15 commits December 2, 2020 09:41

change filename to run test

a9a8188

run test_tpu_backend

9e78274

added xla_device_utils to tests

5becb7b

added xla_device_utils to test

1946591

removed tests

502e8a1

Revert "added xla_device_utils to test"

c6e4535

This reverts commit 0c9316b

fixed pep

3a35796

increase timeout and print traceback

0565e93

lazy check tpu exists

7321045

increased timeout

5573d51

removed barrier for tpu during test reduced epochs

fixed torch_xla imports

da59b4d

fix tests

63c6e15

define xla utils

57029e3

fix test

b34d615

aval

9c0c567

Borda force-pushed the tpu_save branch from bb674f2 to 9c0c567 Compare December 2, 2020 08:45

Borda and others added 2 commits December 2, 2020 09:46

chlog

e4eb7ed

docs

d06c1b2

Borda approved these changes Dec 2, 2020

View reviewed changes

tests/utilities/test_xla_device_utils.py Show resolved Hide resolved

tests/utilities/test_xla_device_utils.py Outdated Show resolved Hide resolved

Borda and others added 2 commits December 2, 2020 09:53

aval

b180798

Apply suggestions from code review

bab6d12

Borda added the ready PRs ready to be merged label Dec 2, 2020

justusschock approved these changes Dec 2, 2020

View reviewed changes

Merge branch 'master' into tpu_save

a485e29

williamFalcon reviewed Dec 2, 2020

View reviewed changes

Merge branch 'master' into tpu_save

eef0849

williamFalcon approved these changes Dec 2, 2020

View reviewed changes

tchaton approved these changes Dec 2, 2020

View reviewed changes

SeanNaren merged commit 12cb994 into master Dec 2, 2020

SeanNaren deleted the tpu_save branch December 2, 2020 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tpu save #4309

Tpu save #4309

lezwon commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 25, 2020 •

edited

Loading

williamFalcon left a comment

lezwon commented Oct 25, 2020 •

edited

Loading

mergify bot commented Oct 25, 2020

SeanNaren commented Nov 3, 2020

lezwon commented Nov 4, 2020

Borda left a comment

williamFalcon Dec 2, 2020

tchaton Dec 2, 2020

Borda Dec 2, 2020

tchaton Dec 2, 2020

Tpu save #4309

Tpu save #4309

Conversation

lezwon commented Oct 22, 2020 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Oct 25, 2020 • edited Loading

Codecov Report

williamFalcon left a comment

Choose a reason for hiding this comment

lezwon commented Oct 25, 2020 • edited Loading

mergify bot commented Oct 25, 2020

SeanNaren commented Nov 3, 2020

lezwon commented Nov 4, 2020

Borda left a comment

Choose a reason for hiding this comment

williamFalcon Dec 2, 2020

Choose a reason for hiding this comment

tchaton Dec 2, 2020

Choose a reason for hiding this comment

Borda Dec 2, 2020

Choose a reason for hiding this comment

tchaton Dec 2, 2020

Choose a reason for hiding this comment

lezwon commented Oct 22, 2020 •

edited

Loading

codecov bot commented Oct 25, 2020 •

edited

Loading

lezwon commented Oct 25, 2020 •

edited

Loading