[CI] `linux://doc:datasets_train` is failing/flaky on master. #29300

rickyyx · 2022-10-13T01:22:55Z

6bac4ab FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
34cd1cb FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
5229df1 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
44fc6a3 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
c9875ed FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
60840b5 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
297341e FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
217dcd7 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
9a7aa24 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
47aa6c4 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
da6adbb FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
081ce2f FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
d7b2b49 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
3ede830 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
65a8b0f FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
91040d9 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
8486ddd FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
7e939d7 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
4732753 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
68a35fc FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)
23b3a59 FAILED Buildkite 📺 :database: 🚂 Datasets Train Integration GPU Tests and Examples (Python 3.7)

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc:datasets_train-END
....

The text was updated successfully, but these errors were encountered:

rickyyx · 2022-10-13T01:23:15Z

Failure on release branch: https://buildkite.com/ray-project/oss-ci-build-branch/builds/488#0183cd54-7535-4e9e-8420-3076ccebc871

matthewdeng · 2022-10-13T01:32:56Z

Hmmm @amogkam any ideas about this one?

opt/bin/doc/datasets_train.runfiles/com_github_ray_project_ray/doc/source/ray-core/_examples/datasets_train/datasets_train.py", line 350, in train_epoch
--
  | loss = criterion(outputs, labels.float())
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
  | return forward_call(*input, **kwargs)
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 716, in forward
  | reduction=self.reduction)
  | File "/opt/miniconda/lib/python3.7/site-packages/torch/nn/functional.py", line 2960, in binary_cross_entropy_with_logits
  | return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
  | RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I originally thought it was due to #29104 but see that it's already checked in...

amogkam · 2022-10-13T01:36:44Z

The example should use train.torch.get_device() instead of manually doing it with local rank

# Setup device.
    device = torch.device(
        f"cuda:{session.get_local_rank()}"
        if use_gpu and torch.cuda.is_available()
        else "cpu"
    )

amogkam · 2022-10-13T01:43:31Z

#29305

Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com The example should be using train.torch.get_device() rather than manually setting with local_rank. This is because local rank may not always match with the device index. Closes #29300

Signed-off-by: Amog Kamsetty amogkamsetty@yahoo.com The example should be using train.torch.get_device() rather than manually setting with local_rank. This is because local rank may not always match with the device index. Closes ray-project#29300 Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

rickyyx added release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component) flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ r2.1-failure labels Oct 13, 2022

rickyyx assigned matthewdeng and c21 Oct 13, 2022

rickyyx unassigned matthewdeng Oct 13, 2022

amogkam mentioned this issue Oct 13, 2022

[Train] Deflake datasets_train test #29305

Merged

7 tasks

amogkam closed this as completed in #29305 Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] `linux://doc:datasets_train` is failing/flaky on master. #29300

[CI] `linux://doc:datasets_train` is failing/flaky on master. #29300

rickyyx commented Oct 13, 2022

rickyyx commented Oct 13, 2022

matthewdeng commented Oct 13, 2022

amogkam commented Oct 13, 2022 •

edited

Loading

amogkam commented Oct 13, 2022

[CI] linux://doc:datasets_train is failing/flaky on master. #29300

[CI] linux://doc:datasets_train is failing/flaky on master. #29300

Comments

rickyyx commented Oct 13, 2022

rickyyx commented Oct 13, 2022

matthewdeng commented Oct 13, 2022

amogkam commented Oct 13, 2022 • edited Loading

amogkam commented Oct 13, 2022

[CI] `linux://doc:datasets_train` is failing/flaky on master. #29300

[CI] `linux://doc:datasets_train` is failing/flaky on master. #29300

amogkam commented Oct 13, 2022 •

edited

Loading