You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run the following command: composer -n 2 examples/run_composer_trainer.py -f composer/yamls/models/resnet50.yaml --algorithms stochastic_depth --algorithms.stochastic_depth.target_layer_name ResNetBottleneck --algorithms.stochastic_depth.drop_rate 0.0 --loggers tqdm
The training loop freezes. Using only 1 GPU or switching to sample-wise stochastic depth (--algorithms.stochastic_depth.stochastic_method sample) seem to work as expected
Appears to be a bug with the block-wise stochastic depth layers, may be related to #296. Should probably be fixed before v0.4.
The text was updated successfully, but these errors were encountered:
Been looking into this. Somehow, the code is hanging at this line in PyTorch's native gradient scaler. optimizer_state["found_inf_per_device"] seems to correspond to some sort of voodoo-magic thing that causes a distributed communications call when accessed. It seems to be linked to the _MultiDeviceReplicator defined earlier in that file.
Debugging this by tracing looks like it's gonna get very difficult very quickly. I'll try another approach tomorrow?
Not sure if this helps, but the code freezes when find_unused_parameters=True even if no layers are dropped. If find_unused_parameters=False and layers are dropped, then it also freezes. Could there be issues with finding the unused parameters? Any possibility it is related manually adjusting the parameters in optimizer after surgery?
…_parameters is set (#1087)
`FORCED_SYNC` currently seems to be unreliable. Switch to using `MULTI_AUTO_SYNC` instead when `find_unusued_parameters` is set. See #1086.
Closes#492
To reproduce
When I run the following command:
composer -n 2 examples/run_composer_trainer.py -f composer/yamls/models/resnet50.yaml --algorithms stochastic_depth --algorithms.stochastic_depth.target_layer_name ResNetBottleneck --algorithms.stochastic_depth.drop_rate 0.0 --loggers tqdm
The training loop freezes. Using only 1 GPU or switching to sample-wise stochastic depth (
--algorithms.stochastic_depth.stochastic_method sample
) seem to work as expectedAppears to be a bug with the block-wise stochastic depth layers, may be related to #296. Should probably be fixed before v0.4.
The text was updated successfully, but these errors were encountered: