Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tpu save #4309

Merged
merged 27 commits into from
Dec 2, 2020
Merged

Tpu save #4309

merged 27 commits into from
Dec 2, 2020

Conversation

lezwon
Copy link
Contributor

@lezwon lezwon commented Oct 22, 2020

What does this PR do?

Fixes #2700
fixes #2303
fixes #3660

  • Move the XLA tensors within a checkpoint to CPU before saving.
  • Lazy check if TPU device exists
  • Removed calling accelerator.barrier() on TPU when calling .test() as the multiprocessing begins only in .fit()
  • Moved TPU device check to run when Trainer initialized.

Related to #3044

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@lezwon lezwon self-assigned this Oct 22, 2020
@lezwon lezwon added bug Something isn't working checkpointing Related to checkpointing accelerator: tpu Tensor Processing Unit labels Oct 22, 2020
@lezwon lezwon added this to the 1.1 milestone Oct 22, 2020
@codecov
Copy link

codecov bot commented Oct 25, 2020

Codecov Report

Merging #4309 (eef0849) into master (0c763b2) will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #4309   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         124     124           
  Lines        9320    9303   -17     
======================================
- Hits         8640    8616   -24     
- Misses        680     687    +7     

@lezwon lezwon modified the milestones: 1.1, 1.0.x Oct 25, 2020
@lezwon lezwon marked this pull request as ready for review October 25, 2020 10:48
@mergify mergify bot requested a review from a team October 25, 2020 10:48
Copy link
Contributor

@williamFalcon williamFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lezwon does that move to cpu really belong in this file?
Why is this needed for TPUs?
You realize this will affect more than just TPUs no? this is why i don't like making these changes outside the accelerator.

If it turns out this is the correct place for this, please add a on_save() function to each accelerator. Then make all of them no op:

def on_save(self, ...):
    pass

And add the change ONLY to the TPU accelerator.

def on_save(self, ...):
   your_changes()

Going forward, any changes that are accelerator specific should not be done inside methods like this. Instead, each accelerator needs to implement this method and then called like:

self.accelerator.on_save()

The reason is that we are trying to break up all the underlying accelerator code so they are easier to debug and changes to an accelerator won't break all the others.

@mergify mergify bot requested a review from a team October 25, 2020 11:38
@lezwon
Copy link
Contributor Author

lezwon commented Oct 25, 2020

@lezwon does that move to cpu really belong in this file?
Why is this needed for TPUs?
You realize this will affect more than just TPUs no? this is why i don't like making these changes outside the accelerator.

If it turns out this is the correct place for this, please add a on_save() function to each accelerator. Then make all of them no op:

def on_save(self, ...):
    pass

And add the change ONLY to the TPU accelerator.

def on_save(self, ...):
   your_changes()

Going forward, any changes that are accelerator specific should not be done inside methods like this. Instead, each accelerator needs to implement this method and then called like:

self.accelerator.on_save()

The reason is that we are trying to break up all the underlying accelerator code so they are easier to debug and changes to an accelerator won't break all the others.

The XLA guide recommends users to move the tensors to CPU before saving so that they can be loaded on non-TPU devices. Lightning users have faced this issue wherein they have trained a model on a TPU and arent able to use it on a GPU or CPU. Hence we need to move these tensors to CPU before saving. I get your point about separating such code and making it accelerator specific. Will refactor this into on_save. 👍

@mergify
Copy link
Contributor

mergify bot commented Oct 25, 2020

This pull request is now in conflict... :(

@SeanNaren
Copy link
Contributor

hey @lezwon any activity here? were you able to reproduce this using the boring model btw? Don't mind picking this up

@lezwon
Copy link
Contributor Author

lezwon commented Nov 4, 2020

hey @SeanNaren, I've been making some refactors to make this change TPU specific. WIll push an updated branch soon :)

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

tests/utilities/test_xla_device_utils.py Show resolved Hide resolved
tests/utilities/test_xla_device_utils.py Outdated Show resolved Hide resolved
@Borda Borda added the ready PRs ready to be merged label Dec 2, 2020
from pytorch_lightning.core.grads import GradInformation
from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks, ModelHooks
from pytorch_lightning.core.memory import ModelSummary
from pytorch_lightning.core.saving import ALLOWED_CONFIG_TYPES, PRIMITIVE_TYPES, ModelIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where did this go? cc @tchaton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess useless imports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall not be there at all in the first place, @SeanNaren

from pytorch_lightning.core.grads import GradInformation
from pytorch_lightning.core.hooks import CheckpointHooks, DataHooks, ModelHooks
from pytorch_lightning.core.memory import ModelSummary
from pytorch_lightning.core.saving import ALLOWED_CONFIG_TYPES, PRIMITIVE_TYPES, ModelIO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess useless imports.

@SeanNaren SeanNaren merged commit 12cb994 into master Dec 2, 2020
@SeanNaren SeanNaren deleted the tpu_save branch December 2, 2020 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator: tpu Tensor Processing Unit bug Something isn't working checkpointing Related to checkpointing priority: 1 Medium priority task ready PRs ready to be merged
Projects
None yet
10 participants