[Train] Make `prepare_model` always use the correct device #29104

amogkam · 2022-10-06T00:06:12Z

Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

Previously, prepare_model would use the local rank as the device even though local rank may not be the same as the actual device index. This mismatch can happen when CUDA_VISIBLE_DEVICES is set for example, which we do by default in Ray Train.

We should always use train.torch.get_device() as the device values for wrapping in DDP.

Closes #28996

Why are these changes needed?

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

bveeramani

Seems reasonable

…ct#29104) Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com Previously, prepare_model would use the local rank as the device even though local rank may not be the same as the actual device index. This mismatch can happen when CUDA_VISIBLE_DEVICES is set for example, which we do by default in Ray Train. We should always use train.torch.get_device() as the device values for wrapping in DDP. Closes ray-project#28996 Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

ef84eb9

Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com>

amogkam assigned matthewdeng Oct 6, 2022

amogkam mentioned this pull request Oct 6, 2022

[Train] Support full ray.get_gpu_ids() API in train.torch.get_device() #28659

Merged

7 tasks

amogkam assigned Yard1 and bveeramani Oct 6, 2022

bveeramani approved these changes Oct 6, 2022

View reviewed changes

matthewdeng approved these changes Oct 6, 2022

View reviewed changes

amogkam merged commit 2217f0c into ray-project:master Oct 6, 2022

amogkam deleted the fix-ddp-wrap branch October 6, 2022 04:16

matthewdeng mentioned this pull request Oct 13, 2022

[CI] linux://doc:datasets_train is failing/flaky on master. #29300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Make `prepare_model` always use the correct device #29104

[Train] Make `prepare_model` always use the correct device #29104

amogkam commented Oct 6, 2022

bveeramani left a comment

[Train] Make prepare_model always use the correct device #29104

[Train] Make prepare_model always use the correct device #29104

Conversation

amogkam commented Oct 6, 2022

Why are these changes needed?

Related issue number

Checks

bveeramani left a comment

Choose a reason for hiding this comment

[Train] Make `prepare_model` always use the correct device #29104

[Train] Make `prepare_model` always use the correct device #29104