Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

Closed
clarng opened this issue Apr 7, 2023 · 13 comments · Fixed by #34224
Closed

[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170

clarng opened this issue Apr 7, 2023 · 13 comments · Fixed by #34224
Assignees
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release

Comments

@clarng
Copy link
Contributor

clarng commented Apr 7, 2023

What happened + What you expected to happen

started to fail recently (timing out) after running 8 hrs

e.g. https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga

success run before took 3 hrs to finish : https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_vc3t9r3se8lkd1za9w1enfssng

Screen Shot 2023-04-07 at 12 45 10 PM

Screen Shot 2023-04-07 at 12 44 59 PM

Versions / Dependencies

master and 2.4.0

Reproduction script

https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga

Issue Severity

None

@clarng clarng added bug Something that is supposed to be working; but isn't release-blocker P0 Issue that blocks the release triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023
@jianoaix jianoaix added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 7, 2023
@c21 c21 assigned jianoaix and unassigned c21 and scottjlee Apr 7, 2023
@jianoaix
Copy link
Contributor

jianoaix commented Apr 7, 2023

Was it failing after the cherrypicks made in last couple days?

@clarng
Copy link
Contributor Author

clarng commented Apr 7, 2023

the run that first fail was apr 2 on master, branch cut happened on 4th

@jianoaix
Copy link
Contributor

jianoaix commented Apr 7, 2023

@jianoaix
Copy link
Contributor

jianoaix commented Apr 7, 2023

The test tool itself is broken, seems missing dependency:

Traceback (most recent call last):
--
  | File "ray_release/scripts/run_release_test.py", line 18, in <module>
  | from ray_release.glue import run_release_test
  | File "/tmp/release-u8BE1A8z33/release/ray_release/glue.py", line 11, in <module>
  | from ray_release.command_runner.anyscale_job_runner import AnyscaleJobRunner
  | File "/tmp/release-u8BE1A8z33/release/ray_release/command_runner/anyscale_job_runner.py", line 22, in <module>
  | from ray_release.file_manager.job_file_manager import JobFileManager
  | File "/tmp/release-u8BE1A8z33/release/ray_release/file_manager/job_file_manager.py", line 8, in <module>
  | from google.cloud import storage
  | ImportError: cannot import name 'storage' from 'google.cloud' (unknown location)

@jianoaix
Copy link
Contributor

jianoaix commented Apr 7, 2023

While the test tool is broken for bisecting, I'm looking at logs. It seems the successful run actually had more spilling:

  • Successful run: (raylet, ip=10.0.10.197) Spilled 2099323 MiB, 367864 objects, write throughput 885 MiB/s.
  • Failed run: (raylet, ip=10.0.40.230) Spilled 524833 MiB, 100387 objects, write throughput 1093 MiB/s.

This may indicate the failed run was completely stuck and not making progress (not even spilling to disk), and then got timeout.

@ericl
Copy link
Contributor

ericl commented Apr 10, 2023

It actually makes sense that PR would regress this (actually, that was one of the main potential regression concerns of that change).

Let's split into a short term and long term resolution?

  • Short term I think we should add a ray config to allow reverting to the previous behavior. We can turn this on by default for this test only, I think, since it is a niche use case for now.

  • Longer term we should probably adapt the scheduling API to allow more flexible level of softness.

@jjyao what do you think of this plan?

@jjyao
Copy link
Collaborator

jjyao commented Apr 10, 2023

Given it does have negative impact on some workloads, should we keep the old behavior by default for the sake of backward compatibility and safety.

For 2.4 we can introduce a private (used by dataset only) _spill_on_unavailable option to NodeAffinitySchedulingStrategy and only set to true by dataset. It should be a pretty small change and I can get it done by EOD.

@ericl
Copy link
Contributor

ericl commented Apr 10, 2023

I'm ok with that plan, but can we also figure out the proper API in 2.5 (where we also make the new behavior the default)?

@jjyao
Copy link
Collaborator

jjyao commented Apr 10, 2023

Yes. So 2.4 we use the private option. In 2.5 we figure out the proper API, remove the private option and pick the default behavior.

@jianoaix
Copy link
Contributor

@jjyao The PR seems to break the CI: https://buildkite.com/ray-project/oss-ci-build-pr/builds/18212

@jjyao
Copy link
Collaborator

jjyao commented Apr 12, 2023

@jianoaix which CI test? I checked https://flakey-tests.ray.io/ which looks good.

@jianoaix
Copy link
Contributor

Yeah, I think it's just that PR wasn't synced. Sorry for spam, it should be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order release-blocker P0 Issue that blocks the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants