Fix no_op_model algorithm #614

dskhudia · 2022-02-28T21:04:07Z

composer -n 1 examples/run_composer_trainer.py -f composer/yamls/models/resnet9_cifar10.yaml --max_duration 1ep --datadir /localdisk/CIFAR10 --validate_every_n_epochs=0 --algorithms no_op_model

Above command results in the following error without this fix

   Traceback (most recent call last):
  File "examples/run_composer_trainer.py", line 60, in <module>
    main()
  File "examples/run_composer_trainer.py", line 56, in main
    trainer.fit()
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 774, in fit
    self._train_loop()
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 936, in _train_loop
    total_loss = self._train_batch(microbatches)
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 1009, in _train_batch
    return self._train_batch_inner(microbatches)
  File "/mnt/cota/daya/composer/composer/trainer/trainer.py", line 1040, in _train_batch_inner
    state.outputs = state.model.forward(state.batch)
  File "/mnt/cota/daya/composer/composer/algorithms/no_op_model/no_op_model.py", line 46, in forward
    return y * self.weights
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

ravi-mosaicml · 2022-02-28T21:13:11Z

composer/algorithms/no_op_model/no_op_model.py

@@ -26,7 +26,7 @@ class NoOpModelClass(ComposerModel):

    def __init__(self, original_model: torch.nn.Module):
        super().__init__()
-        self.weights = torch.tensor([1.5], requires_grad=True, dtype=torch.float)
+        self.weights = torch.nn.Parameter(torch.Tensor([1.5]))


I'm a bit confused on how this would fix a device error...would we need to set the device explicitly, somewhat like this?

Suggested change

self.weights = torch.nn.Parameter(torch.Tensor([1.5]))

original_device = next(original_model.parameters()).device

self.weights = torch.nn.Parameter(torch.Tensor([1.5]), device=original_device)

no_op_model algorithm runs on the init event (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L649), i.e., before moving the model parameters to gpu. So original_device in your code above is still cpu. Once we make it a parameter, module_to_device (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L710) takes care of it.

Ahhh, makes sense. Thanks!

hanlint

LGTM, thanks for catching this!

Fix no_op_model algorithm

e7f9007

dskhudia requested review from bandish-shah and hanlint February 28, 2022 21:04

ravi-mosaicml reviewed Feb 28, 2022

View reviewed changes

hanlint approved these changes Feb 28, 2022

View reviewed changes

hanlint and others added 2 commits March 1, 2022 05:28

Merge branch 'dev' into daya/fix_no_op

3015847

Merge branch 'dev' into daya/fix_no_op

a25a0db

dskhudia merged commit eda1b34 into dev Feb 28, 2022

dskhudia deleted the daya/fix_no_op branch February 28, 2022 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix no_op_model algorithm #614

Fix no_op_model algorithm #614

dskhudia commented Feb 28, 2022

ravi-mosaicml Feb 28, 2022

dskhudia Feb 28, 2022

ravi-mosaicml Feb 28, 2022

hanlint left a comment

	self.weights = torch.nn.Parameter(torch.Tensor([1.5]))
	original_device = next(original_model.parameters()).device
	self.weights = torch.nn.Parameter(torch.Tensor([1.5]), device=original_device)

Fix no_op_model algorithm #614

Fix no_op_model algorithm #614

Conversation

dskhudia commented Feb 28, 2022

ravi-mosaicml Feb 28, 2022

Choose a reason for hiding this comment

dskhudia Feb 28, 2022

Choose a reason for hiding this comment

ravi-mosaicml Feb 28, 2022

Choose a reason for hiding this comment

hanlint left a comment

Choose a reason for hiding this comment