Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Landanjs · 2022-02-16T18:42:22Z

To reproduce

When I run the following command:
composer -n 2 examples/run_composer_trainer.py -f composer/yamls/models/resnet50.yaml --algorithms stochastic_depth --algorithms.stochastic_depth.target_layer_name ResNetBottleneck --algorithms.stochastic_depth.drop_rate 0.0 --loggers tqdm

The training loop freezes. Using only 1 GPU or switching to sample-wise stochastic depth (--algorithms.stochastic_depth.stochastic_method sample) seem to work as expected

Appears to be a bug with the block-wise stochastic depth layers, may be related to #296. Should probably be fixed before v0.4.

The text was updated successfully, but these errors were encountered:

jbloxham · 2022-02-17T02:39:11Z

Been looking into this. Somehow, the code is hanging at this line in PyTorch's native gradient scaler. optimizer_state["found_inf_per_device"] seems to correspond to some sort of voodoo-magic thing that causes a distributed communications call when accessed. It seems to be linked to the _MultiDeviceReplicator defined earlier in that file.

Debugging this by tracing looks like it's gonna get very difficult very quickly. I'll try another approach tomorrow?

Landanjs · 2022-02-17T20:24:03Z

Not sure if this helps, but the code freezes when find_unused_parameters=True even if no layers are dropped. If find_unused_parameters=False and layers are dropped, then it also freezes. Could there be issues with finding the unused parameters? Any possibility it is related manually adjusting the parameters in optimizer after surgery?

…_parameters is set (#1087) `FORCED_SYNC` currently seems to be unreliable. Switch to using `MULTI_AUTO_SYNC` instead when `find_unusued_parameters` is set. See #1086. Closes #492

Landanjs added the bug Something isn't working label Feb 16, 2022

Landanjs added this to the v0.4 milestone Feb 16, 2022

Landanjs self-assigned this Feb 16, 2022

hanlint modified the milestones: v0.4, Backlog Feb 17, 2022

ravi-mosaicml removed this from the Backlog milestone Feb 28, 2022

ravi-mosaicml added the research Non-engineering enhancements for specific models, algorithms, or datasets. label Mar 31, 2022

Landanjs mentioned this issue May 25, 2022

Change default DDP sync strategy to MULTI_AUTO_SYNC when find_unusued_parameters is set #1087

Merged

ravi-mosaicml closed this as completed in #1087 May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Landanjs commented Feb 16, 2022

jbloxham commented Feb 17, 2022

Landanjs commented Feb 17, 2022 •

edited

Loading

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Training freezes when using multi-gpu jobs with block-wise stochastic depth #492

Comments

Landanjs commented Feb 16, 2022

To reproduce

jbloxham commented Feb 17, 2022

Landanjs commented Feb 17, 2022 • edited Loading

Landanjs commented Feb 17, 2022 •

edited

Loading