-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix no_op_model algorithm #614
Conversation
@@ -26,7 +26,7 @@ class NoOpModelClass(ComposerModel): | |||
|
|||
def __init__(self, original_model: torch.nn.Module): | |||
super().__init__() | |||
self.weights = torch.tensor([1.5], requires_grad=True, dtype=torch.float) | |||
self.weights = torch.nn.Parameter(torch.Tensor([1.5])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused on how this would fix a device error...would we need to set the device explicitly, somewhat like this?
self.weights = torch.nn.Parameter(torch.Tensor([1.5])) | |
original_device = next(original_model.parameters()).device | |
self.weights = torch.nn.Parameter(torch.Tensor([1.5]), device=original_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no_op_model algorithm runs on the init event (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L649), i.e., before moving the model parameters to gpu. So original_device
in your code above is still cpu. Once we make it a parameter, module_to_device (https://github.com/mosaicml/composer/blob/dev/composer/trainer/trainer.py#L710) takes care of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, makes sense. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for catching this!
composer -n 1 examples/run_composer_trainer.py -f composer/yamls/models/resnet9_cifar10.yaml --max_duration 1ep --datadir /localdisk/CIFAR10 --validate_every_n_epochs=0 --algorithms no_op_model
Above command results in the following error without this fix