Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] linux://doc:datasets_train is failing/flaky on master. #29300

Closed
rickyyx opened this issue Oct 13, 2022 · 4 comments · Fixed by #29305
Closed

[CI] linux://doc:datasets_train is failing/flaky on master. #29300

rickyyx opened this issue Oct 13, 2022 · 4 comments · Fixed by #29305
Assignees
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@rickyyx
Copy link
Member

rickyyx commented Oct 13, 2022

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc:datasets_train-END
....

@rickyyx rickyyx added release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component) flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ r2.1-failure labels Oct 13, 2022
@rickyyx
Copy link
Member Author

rickyyx commented Oct 13, 2022

@matthewdeng
Copy link
Contributor

Hmmm @amogkam any ideas about this one?

opt/bin/doc/datasets_train.runfiles/com_github_ray_project_ray/doc/source/ray-core/_examples/datasets_train/datasets_train.py", line 350, in train_epoch
--
  | loss = criterion(outputs, labels.float())
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  | return forward_call(*input, **kwargs)
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 716, in forward
  | reduction=self.reduction)
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2960, in binary_cross_entropy_with_logits
  | return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
  | RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I originally thought it was due to #29104 but see that it's already checked in...

@amogkam
Copy link
Contributor

amogkam commented Oct 13, 2022

The example should use train.torch.get_device() instead of manually doing it with local rank

# Setup device.
    device = torch.device(
        f"cuda:{session.get_local_rank()}"
        if use_gpu and torch.cuda.is_available()
        else "cpu"
    )

@amogkam
Copy link
Contributor

amogkam commented Oct 13, 2022

#29305

amogkam added a commit that referenced this issue Oct 13, 2022
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

The example should be using train.torch.get_device() rather than manually setting with local_rank. This is because local rank may not always match with the device index.

Closes #29300
rickyyx pushed a commit that referenced this issue Oct 19, 2022
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

The example should be using train.torch.get_device() rather than manually setting with local_rank. This is because local rank may not always match with the device index.

Closes #29300
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com

The example should be using train.torch.get_device() rather than manually setting with local_rank. This is because local rank may not always match with the device index.

Closes ray-project#29300

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants