[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

clarng · 2023-04-07T19:47:19Z

What happened + What you expected to happen

started to fail recently (timing out) after running 8 hrs

e.g. https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga

success run before took 3 hrs to finish : https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_vc3t9r3se8lkd1za9w1enfssng

Versions / Dependencies

master and 2.4.0

Reproduction script

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga

Issue Severity

None

jianoaix · 2023-04-07T21:06:58Z

Was it failing after the cherrypicks made in last couple days?

clarng · 2023-04-07T21:14:04Z

the run that first fail was apr 2 on master, branch cut happened on 4th

jianoaix · 2023-04-07T21:43:11Z

bisecting: https://buildkite.com/ray-project/release-tests-pr/builds?branch=master

jianoaix · 2023-04-07T23:26:11Z

The test tool itself is broken, seems missing dependency:

Traceback (most recent call last):
--
  | File "ray_release/scripts/run_release_test.py", line 18, in <module>
  | from ray_release.glue import run_release_test
  | File "/tmp/release-u8BE1A8z33/release/ray_release/glue.py", line 11, in <module>
  | from ray_release.command_runner.anyscale_job_runner import AnyscaleJobRunner
  | File "/tmp/release-u8BE1A8z33/release/ray_release/command_runner/anyscale_job_runner.py", line 22, in <module>
  | from ray_release.file_manager.job_file_manager import JobFileManager
  | File "/tmp/release-u8BE1A8z33/release/ray_release/file_manager/job_file_manager.py", line 8, in <module>
  | from google.cloud import storage
  | ImportError: cannot import name 'storage' from 'google.cloud' (unknown location)

jianoaix · 2023-04-07T23:53:15Z

While the test tool is broken for bisecting, I'm looking at logs. It seems the successful run actually had more spilling:

Successful run: (raylet, ip=10.0.10.197) Spilled 2099323 MiB, 367864 objects, write throughput 885 MiB/s.
Failed run: (raylet, ip=10.0.40.230) Spilled 524833 MiB, 100387 objects, write throughput 1093 MiB/s.

This may indicate the failed run was completely stuck and not making progress (not even spilling to disk), and then got timeout.

jianoaix · 2023-04-10T15:58:40Z

So the bisecting pointed to this PR the culprit: adb6775
The job: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_7b65v5tgs7uwsh975wbckkvvau
@clarng

The commit right before this PR passed: eae4e78: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_4a8rsymheiljps4javg4gfbb6v

And all trials earlier than this PR also passed:

ericl · 2023-04-10T19:16:20Z

It actually makes sense that PR would regress this (actually, that was one of the main potential regression concerns of that change).

Let's split into a short term and long term resolution?

Short term I think we should add a ray config to allow reverting to the previous behavior. We can turn this on by default for this test only, I think, since it is a niche use case for now.
Longer term we should probably adapt the scheduling API to allow more flexible level of softness.

@jjyao what do you think of this plan?

jjyao · 2023-04-10T19:31:36Z

Given it does have negative impact on some workloads, should we keep the old behavior by default for the sake of backward compatibility and safety.

For 2.4 we can introduce a private (used by dataset only) _spill_on_unavailable option to NodeAffinitySchedulingStrategy and only set to true by dataset. It should be a pretty small change and I can get it done by EOD.

ericl · 2023-04-10T19:49:07Z

I'm ok with that plan, but can we also figure out the proper API in 2.5 (where we also make the new behavior the default)?

jjyao · 2023-04-10T19:51:42Z

Yes. So 2.4 we use the private option. In 2.5 we figure out the proper API, remove the private option and pick the default behavior.

jianoaix · 2023-04-11T23:04:36Z

@jjyao The PR seems to break the CI: https://buildkite.com/ray-project/oss-ci-build-pr/builds/18212

jjyao · 2023-04-12T16:06:06Z

@jianoaix which CI test? I checked https://flakey-tests.ray.io/ which looks good.

jianoaix · 2023-04-12T16:07:14Z

Yeah, I think it's just that PR wasn't synced. Sorry for spam, it should be good.

clarng added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023

clarng assigned c21 and scottjlee Apr 7, 2023

jianoaix added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023

c21 assigned jianoaix and unassigned c21 and scottjlee Apr 7, 2023

jianoaix assigned clarng Apr 10, 2023

jjyao assigned jjyao and unassigned jianoaix Apr 10, 2023

This was referenced Apr 10, 2023

[Core] Introduce spill_on_unavailable option for soft NodeAffinitySchedulingStrategy #34224

Merged

[Core] Promote _spill_on_unavailable in NodeAffinitySchedulingStrategy to public API #34283

Closed

jjyao closed this as completed in #34224 Apr 11, 2023

jjyao unassigned clarng Apr 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

clarng commented Apr 7, 2023

jianoaix commented Apr 7, 2023

clarng commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 10, 2023 •

edited

Loading

ericl commented Apr 10, 2023

jjyao commented Apr 10, 2023

ericl commented Apr 10, 2023 •

edited

Loading

jjyao commented Apr 10, 2023

jianoaix commented Apr 11, 2023

jjyao commented Apr 12, 2023

jianoaix commented Apr 12, 2023

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

Comments

clarng commented Apr 7, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jianoaix commented Apr 7, 2023

clarng commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 7, 2023

jianoaix commented Apr 10, 2023 • edited Loading

ericl commented Apr 10, 2023

jjyao commented Apr 10, 2023

ericl commented Apr 10, 2023 • edited Loading

jjyao commented Apr 10, 2023

jianoaix commented Apr 11, 2023

jjyao commented Apr 12, 2023

jianoaix commented Apr 12, 2023

jianoaix commented Apr 10, 2023 •

edited

Loading

ericl commented Apr 10, 2023 •

edited

Loading