Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix no_op_model algorithm #614

Merged
merged 3 commits into from
Feb 28, 2022
Merged

Fix no_op_model algorithm #614

merged 3 commits into from
Feb 28, 2022

Conversation

dskhudia
Copy link
Contributor

composer -n 1 examples/run_composer_trainer.py -f composer/yamls/models/resnet9_cifar10.yaml --max_duration 1ep --datadir /localdisk/CIFAR10 --validate_every_n_epochs=0 --algorithms no_op_model

Above command results in the following error without this fix

   Traceback (most recent call last):
  File "examples/run_composer_trainer.py", line 60, in <module>
    main()
  File "examples/run_composer_trainer.py", line 56, in main
    trainer.fit()
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 774, in fit
    self._train_loop()
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 936, in _train_loop
    total_loss = self._train_batch(microbatches)
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 1009, in _train_batch
    return self._train_batch_inner(microbatches)
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 1040, in _train_batch_inner
    state.outputs = state.model.forward(state.batch)
  File "/mnt/cota/daya/composer/composer/algorithms/no_op_model/no_op_model.py", line 46, in forward
    return y * self.weights
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

@@ -26,7 +26,7 @@ class NoOpModelClass(ComposerModel):

def __init__(self, original_model: torch.nn.Module):
super().__init__()
self.weights = torch.tensor([1.5], requires_grad=True, dtype=torch.float)
self.weights = torch.nn.Parameter(torch.Tensor([1.5]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on how this would fix a device error...would we need to set the device explicitly, somewhat like this?

Suggested change
self.weights = torch.nn.Parameter(torch.Tensor([1.5]))
original_device = next(original_model.parameters()).device
self.weights = torch.nn.Parameter(torch.Tensor([1.5]), device=original_device)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no_op_model algorithm runs on the init event (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L649), i.e., before moving the model parameters to gpu. So original_device in your code above is still cpu. Once we make it a parameter, module_to_device (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L710) takes care of it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, makes sense. Thanks!

Copy link
Contributor

@hanlint hanlint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for catching this!

@dskhudia dskhudia merged commit eda1b34 into dev Feb 28, 2022
@dskhudia dskhudia deleted the daya/fix_no_op branch February 28, 2022 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants