Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Closed
Landanjs opened this issue Feb 16, 2022 · 2 comments · Fixed by #1087
Closed

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Landanjs opened this issue Feb 16, 2022 · 2 comments · Fixed by #1087
Assignees
Labels
bug Something isn't working research Non-engineering enhancements for specific models, algorithms, or datasets.

Comments

@Landanjs
Copy link
Contributor

To reproduce

When I run the following command:
composer -n 2 examples/run_composer_trainer.py -f composer/yamls/models/resnet50.yaml --algorithms stochastic_depth --algorithms.stochastic_depth.target_layer_name ResNetBottleneck --algorithms.stochastic_depth.drop_rate 0.0 --loggers tqdm

The training loop freezes. Using only 1 GPU or switching to sample-wise stochastic depth (--algorithms.stochastic_depth.stochastic_method sample) seem to work as expected

Appears to be a bug with the block-wise stochastic depth layers, may be related to #296. Should probably be fixed before v0.4.

@Landanjs Landanjs added the bug Something isn't working label Feb 16, 2022
@Landanjs Landanjs added this to the v0.4 milestone Feb 16, 2022
@Landanjs Landanjs self-assigned this Feb 16, 2022
@jbloxham
Copy link
Contributor

Been looking into this. Somehow, the code is hanging at this line in PyTorch's native gradient scaler. optimizer_state["found_inf_per_device"] seems to correspond to some sort of voodoo-magic thing that causes a distributed communications call when accessed. It seems to be linked to the _MultiDeviceReplicator defined earlier in that file.

Debugging this by tracing looks like it's gonna get very difficult very quickly. I'll try another approach tomorrow?

@hanlint hanlint modified the milestones: v0.4, Backlog Feb 17, 2022
@Landanjs
Copy link
Contributor Author

Landanjs commented Feb 17, 2022

Not sure if this helps, but the code freezes when find_unused_parameters=True even if no layers are dropped. If find_unused_parameters=False and layers are dropped, then it also freezes. Could there be issues with finding the unused parameters? Any possibility it is related manually adjusting the parameters in optimizer after surgery?

@ravi-mosaicml ravi-mosaicml removed this from the Backlog milestone Feb 28, 2022
@ravi-mosaicml ravi-mosaicml added the research Non-engineering enhancements for specific models, algorithms, or datasets. label Mar 31, 2022
ravi-mosaicml pushed a commit that referenced this issue May 25, 2022
…_parameters is set (#1087)

`FORCED_SYNC` currently seems to be unreliable. Switch to using `MULTI_AUTO_SYNC` instead when `find_unusued_parameters` is set. See #1086.

Closes #492
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working research Non-engineering enhancements for specific models, algorithms, or datasets.
Projects
None yet
4 participants