Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing TPU tests #2632

Merged
merged 55 commits into from
Jul 27, 2020
Merged

fixing TPU tests #2632

merged 55 commits into from
Jul 27, 2020

Conversation

Borda
Copy link
Member

@Borda Borda commented Jul 17, 2020

What does this PR do?

Fixes #2124
Fixes #1956

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@Borda Borda added bug Something isn't working ci Continuous Integration labels Jul 17, 2020
@Borda Borda added this to the 0.8.x milestone Jul 17, 2020
@mergify mergify bot requested a review from a team July 17, 2020 13:22
@codecov
Copy link

codecov bot commented Jul 17, 2020

Codecov Report

Merging #2632 into master will increase coverage by 0%.
The diff coverage is 49%.

@@          Coverage Diff           @@
##           master   #2632   +/-   ##
======================================
  Coverage      91%     91%           
======================================
  Files          82      82           
  Lines        6770    6784   +14     
======================================
+ Hits         6127    6151   +24     
+ Misses        643     633   -10     

@Borda
Copy link
Member Author

Borda commented Jul 17, 2020

GPU available: False, used: False
TPU available: True, using: 8 TPU cores
training on 8 TPU cores
Exception in device=TPU:0: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/distrib_parts.py", line 195, in tpu_train
    self._device = xm.xla_device(tpu_core_idx) if tpu_core_idx is not None else xm.xla_device()
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/core/xla_model.py", line 239, in xla_device
    torch_xla._XLAC._xla_set_default_device(device)
RuntimeError: Invalid device string: 'xla:EvalModelTemplate(
  (c_d1): Linear(in_features=784, out_features=1000, bias=True)
  (c_d1_bn): BatchNorm1d(1000, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (c_d1_drop): Dropout(p=0.2, inplace=False)
  (c_d2): Linear(in_features=1000, out_features=10, bias=True)
)'
Traceback (most recent call last):
  File "/content/pytorch-lightning/tests/base/develop_utils.py", line 102, in inner_f
    func(**kwargs)
  File "/content/pytorch-lightning/tests/models/test_tpu.py", line 240, in test_dataloaders_passed_to_fit
    val_dataloaders=model.val_dataloader(),
  File "/content/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1029, in fit
    start_method=start_method,
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 395, in spawn
    start_method=start_method)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/spawn.py", line 113, in join
    (error_index, exitcode)
Exception: process 0 terminated with exit code 17

@Borda
Copy link
Member Author

Borda commented Jul 21, 2020

it seems that first argument of tpu_train is ignored so model was wrongly treated as index...
Actual failed is:

Traceback (most recent call last):
  File "/content/pytorch-lightning/tests/base/develop_utils.py", line 102, in inner_f
    func(**kwargs)
  File "/content/pytorch-lightning/tests/models/test_tpu.py", line 39, in test_model_tpu_cores_1
    tpipes.run_model_test(trainer_options, model, on_gpu=False, with_hpc=False)
  File "/content/pytorch-lightning/tests/base/develop_pipelines.py", line 50, in run_model_test
    result = trainer.fit(model)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1029, in fit
    start_method=start_method,
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 387, in spawn
    _start_fn(0, pf_cfg, fn, args)
  File "/usr/local/lib/python3.6/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/distrib_parts.py", line 223, in tpu_train
    self.run_pretrain_routine(model)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1193, in run_pretrain_routine
    self._run_sanity_check(ref_model, model)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/trainer.py", line 1220, in _run_sanity_check
    False)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/evaluation_loop.py", line 294, in _evaluate
    output = self.evaluation_forward(model, batch, batch_idx, dataloader_idx, test_mode)
  File "/content/pytorch-lightning/pytorch_lightning/trainer/evaluation_loop.py", line 481, in evaluation_forward
    output = model.validation_step(*args)
  File "/content/pytorch-lightning/tests/base/model_valid_steps.py", line 25, in validation_step
    val_acc = torch.sum(y == labels_hat).item() / (len(y) * 1.0)
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:991 : Check failed: device == xrt_data.device() (TPU:0 vs. CPU:0)
*** Begin stack trace ***
	tensorflow::CurrentStackTrace[abi:cxx11]()
	xla::XrtComputationClient::GetArgumentsInputs(absl::lts_2020_02_25::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
	xla::XrtComputationClient::CreateExecuteOps(std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, xla::XrtSessionCache::Ref, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, xla::XrtSessionCache::Ref> > >*, xla::XrtComputationClient::XrtComputation const&, std::vector<std::vector<std::shared_ptr<xla::ComputationClient::Data>, std::allocator<std::shared_ptr<xla::ComputationClient::Data> > >, std::allocator<std::vector<std::shared_ptr<xla::ComputationClient::Data>, std::allocator<std::shared_ptr<xla::ComputationClient::Data> > > > > const&, bool, absl::lts_2020_02_25::Span<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const>, std::unordered_map<tensorflow::Output, tensorflow::Input::Initializer, tensorflow::OutputHash, std::equal_to<tensorflow::Output>, std::allocator<std::pair<tensorflow::Output const, tensorflow::Input::Initializer> > >*)
	xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_2020_02_25::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, xla::ComputationClient::ExecuteComputationOptions const&)	
	clone
*** End stack trace ***

it is interesting as both are in the same device before...

>>> model.device
xla:0
>>> y.device
xla:1
>>> labels_hat.device
xla:1

similar to pytorch/xla#894

@dlibenzi do you have advice what am I doing wrong?

@Borda
Copy link
Member Author

Borda commented Jul 21, 2020

Explanation of why the error raise when .item() is called...

XLA Tensors are Lazy

CPU and CUDA tensors launch operations immediately or eagerly. XLA tensors, on the other hand, are lazy. They record operations in a graph until the results are needed. Deferring execution like this lets XLA optimize it.

even with device = xm.xla_device()

  File "/content/pytorch-lightning/tests/base/model_valid_steps.py", line 43, in validation_step
    torch.sum(y.to(device) == labels_hat.to(device)).item()
RuntimeError: tensorflow/compiler/xla/xla_client/xrt_computation_client.cc:991 : Check failed: device == xrt_data.device() (TPU:0 vs. CPU:0)

@davidel
Copy link

davidel commented Jul 21, 2020

As I explained countless times, you cannot do stuff like this:

https://github.com/PyTorchLightning/pytorch-lightning/pull/2632/files#diff-d673a98c14aa5cd59610bcc447178852R197

You should get rid of that TPU IDs, like you can control which device ordinal you get within a process. You cannot.
In multi-processing mode (xmp.spawn()) you should simply call xm.xla_device() and get the only device it is assigned to the calling process.
You cannot control the TPU ID, so you should just get rid of that API which try to control the TPU IDs.

If you want to have some sort of information about which ordinal within the distributed system the calling process has, you can use the xm.get_ordinal() API.

@williamFalcon
Copy link
Contributor

@davidel, I understand... can you explain what @lezwon did in the kaggle kernel then? he seems to have run 8 copies of a model each on its own TPU core. Or is that incorrect @lezwon ?

@Borda
Copy link
Member Author

Borda commented Jul 21, 2020

@davidel I am going to change the TPU id in next step, but not I am getting this error in test_model_tpu_cores_1 with just xm.xla_device() without setting id... so is possible that the data are not pushed to the device as it is "lazy"?

# call setup after the ddp process has connected
self.setup('fit')
if self.is_function_implemented('setup', model):
model.setup('fit')

# put model on tpu
self._device = xm.xla_device(self.tpu_id) if self.tpu_id is not None else xm.xla_device()
xm.get_ordinal()
# TODO, wrong definition of TPU index
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be changed in the next step of corrections... cc: @davidel

@lezwon
Copy link
Contributor

lezwon commented Jul 21, 2020

@davidel, I understand... can you explain what @lezwon did in the kaggle kernel then? he seems to have run 8 copies of a model each on its own TPU core. Or is that incorrect @lezwon ?

@davidel @williamFalcon When TPU ID is provided, the training does not run in multi-processing mode i.e xmp.spawn(). I just fetch the device using xm.xla_device() and train the model on it. I assumed this enabled us to train a separate model on every core as explained here pytorch/xla#2041 (comment). It seemed to work too. Did we miss something?

@davidel
Copy link

davidel commented Jul 21, 2020

I will be flying today. Will get back to this once on the other side of the pond 😄

@zcain117
Copy link
Contributor

@davidel, I understand... can you explain what @lezwon did in the kaggle kernel then? he seems to have run 8 copies of a model each on its own TPU core. Or is that incorrect @lezwon ?

@davidel @williamFalcon When TPU ID is provided, the training does not run in multi-processing mode i.e xmp.spawn(). I just fetch the device using xm.xla_device() and train the model on it. I assumed this enabled us to train a separate model on every core as explained here pytorch/xla#2041 (comment). It seemed to work too. Did we miss something?

I had some follow-up questions:

  1. Do you mean 1 independent copy of the same model on each of the 8 cores? Or different types of models training simultaneously, each using 1 core?
  2. If you mean the former, what is the use case? I think it would be better to have the 8 cores working together by sharing the gradients/weights between processes using xmp.spawn given that the smallest TPU a user can have is 8-cores, so it's not like a user can save on cost by using only 1 core

@zcain117
Copy link
Contributor

@Borda I see this as the latest error in the Github Actions run for the 8-core test:

AttributeError: Can't pickle local object 'test_model_16bit_tpu_cores_8.<locals>.long_train_loader'

I talked with Davide about this last week and here is an explanation of what is happening:

  • As you know, when you use xmp.spawn you pass in some kind of mp_fn. That function is the code that will run on each TPU core.
  • If you have Python objects that are defined outside that function, they will be pickled and sent over the wire to the TPU cores. I think this is what is happening with your dataloader in this case.
  • If you create the dataloader within the mp_fn, then the TPU core will start up, run the code, and create the dataloader and there will be no need for pickling. This is what we recommend, otherwise you'll need some kind of dataloader that is able to be pickled.

Here are 2 examples of defining the dataloader inside the mp_fn:

  1. Our canonical example. Note that the dataloader is instantiated within train_mnist here, which is part of the mp_fn code here
  2. This Kaggle example. Note that the dataloader is instantiated within run(), which is then wrapped in mp_fn

@Borda
Copy link
Member Author

Borda commented Jul 22, 2020

  1. Do you mean 1 independent copy of the same model on each of the 8 cores? Or different types of models training simultaneously, each using 1 core?

yes, the point is to speed-up training

  1. If you mean the former, what is the use case? I think it would be better to have the 8 cores working together by sharing the gradients/weights between processes using xmp.spawn given that the smallest TPU a user can have is 8-cores, so it's not like a user can save on cost by using only 1 core

I agree, the trainer is teaching just one model so the case with selecting core was about multiple users can train their own model/trainer and share one physical device

@Borda
Copy link
Member Author

Borda commented Jul 22, 2020

  • If you create the dataloader within the mp_fn, then the TPU core will start up, run the code, and create the dataloader and there will be no need for pickling. This is what we recommend, otherwise you'll need some kind of dataloader that is able to be pickled.

in such case, it is very similar to DDP when we run a python script on beach GPU separately, right?

@zcain117
Copy link
Contributor

  • If you create the dataloader within the mp_fn, then the TPU core will start up, run the code, and create the dataloader and there will be no need for pickling. This is what we recommend, otherwise you'll need some kind of dataloader that is able to be pickled.

in such case, it is very similar to DDP when we run a python script on beach GPU separately, right?

I'm not familiar with DDP but I think generally this is similar to the multi-GPU case. Each TPU core is running it's own copy of the same code, then after each backprop, we synchronize the gradients using xm.optimizer_step, e.g. here

See here for xm.optimizer_step code but it's basically just sharing gradients from backprop between TPU cores.

This means that each TPU core should end up with the same weights and it means that the TPU cores are working together to make training progress rather than having independent copies of your model training on each core.

Note that we are sharing gradients but each process instantiates its own weights. One pitfall is if each process ends up with different initial weights. It's important to set the same seed between processes so that initial weights are the same, like we do here in our example. Note that we set the seed immediately. This is important since any code that runs before setting the seed can change the seed for that process and lead to different weights per process, which can lead to e.g. model failing to converge.

@williamFalcon
Copy link
Contributor

yes, this is exactly how ddp works and what we are set up to do. This is how we setup the training.
In fact, even in our tests the seed is being set. (we don’t set seeds for users, so this is set by us for test purpose only)

@lezwon
Copy link
Contributor

lezwon commented Jul 23, 2020

  1. Do you mean 1 independent copy of the same model on each of the 8 cores? Or different types of models training simultaneously, each using 1 core?

Yes, An independent instance of a model on each core with a different fold.

  1. If you mean the former, what is the use case? I think it would be better to have the 8 cores working together by sharing the gradients/weights between processes using xmp.spawn given that the smallest TPU a user can have is 8-cores, so it's not like a user can save on cost by using only 1 core

The use case is K-fold training. The user can train K-models simultaneously with a different fold on every core.

@zcain117
Copy link
Contributor

zcain117 commented Jul 23, 2020

  1. Do you mean 1 independent copy of the same model on each of the 8 cores? Or different types of models training simultaneously, each using 1 core?

Yes, An independent instance of a model on each core with a different fold.

  1. If you mean the former, what is the use case? I think it would be better to have the 8 cores working together by sharing the gradients/weights between processes using xmp.spawn given that the smallest TPU a user can have is 8-cores, so it's not like a user can save on cost by using only 1 core

The use case is K-fold training. The user can train K-models simultaneously with a different fold on every core.

I see. I think we can make this work. I'm assuming you would also want to write out 8 different model weights.

For your normal 8-core case, you'll want to use xm.save as shown in the example Kaggle script

If you want every core to write its weights, you'll want to call xm.save(..., master_only=False, ...) (see method here)

@lezwon
Copy link
Contributor

lezwon commented Jul 23, 2020

I'm not sure if xm.save is being used in Lightning right now. In my kernel here: https://www.kaggle.com/lezwon/parallel-kfold-training-on-tpu-using-pytorch-li, I created a separate trainer instance for each core with a checkpoint callback. Seems to have worked.

@davidel
Copy link

davidel commented Jul 23, 2020

@davidel, I understand... can you explain what @lezwon did in the kaggle kernel then? he seems to have run 8 copies of a model each on its own TPU core. Or is that incorrect @lezwon ?

@davidel @williamFalcon When TPU ID is provided, the training does not run in multi-processing mode i.e xmp.spawn(). I just fetch the device using xm.xla_device() and train the model on it. I assumed this enabled us to train a separate model on every core as explained here pytorch/xla#2041 (comment). It seemed to work too. Did we miss something?

If it does not run as multi-core it might be OK ... but honestly specifying the core ID makes no sense from an API standpoint.

@Borda
Copy link
Member Author

Borda commented Jul 25, 2020

checkpoint loading, maybe similar to #2700

Traceback (most recent call last):
  File "/content/pytorch-lightning/tests/base/develop_utils.py", line 102, in inner_f
    func(**kwargs)
  File "/content/pytorch-lightning/tests/models/test_tpu.py", line 100, in test_model_tpu_cores_8
    tpipes.run_model_test(trainer_options, model, on_gpu=False, with_hpc=False)
  File "/content/pytorch-lightning/tests/base/develop_pipelines.py", line 56, in run_model_test
    pretrained_model = load_model_from_checkpoint(logger, trainer.checkpoint_callback.best_model_path)
  File "/content/pytorch-lightning/tests/base/develop_utils.py", line 64, in load_model_from_checkpoint
    trained_model = module_class.load_from_checkpoint(root_weights_dir)
  File "/content/pytorch-lightning/pytorch_lightning/core/saving.py", line 142, in load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
  File "/content/pytorch-lightning/pytorch_lightning/utilities/cloud_io.py", line 10, in load
    return torch.load(path_or_url, map_location=map_location)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 581, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/usr/local/lib/python3.6/dist-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: ''

EDIT: it seems that no checkpoint is saved during training
EDIT2: it seems that the checkpoint from spawn is not updated back to the main process so global trainer does not know about children checkpoints...

cc: @lezwon @zcain117

Borda and others added 22 commits July 27, 2020 23:59
@Borda
Copy link
Member Author

Borda commented Jul 27, 2020

@williamFalcon lets fix the master... :]
cc: @PyTorchLightning/core-contributors

@Borda Borda added the ready PRs ready to be merged label Jul 27, 2020
@williamFalcon williamFalcon merged commit 0fe933e into master Jul 27, 2020
@Borda Borda deleted the tpu/fix-tests branch July 28, 2020 05:03
@Borda Borda mentioned this pull request Sep 21, 2020
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ci Continuous Integration priority: 0 High priority task ready PRs ready to be merged
Projects
None yet
9 participants