-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] release test failure: dataset_shuffle_push_based_random_shuffle_100tb #34170
Comments
Was it failing after the cherrypicks made in last couple days? |
the run that first fail was apr 2 on master, branch cut happened on 4th |
The test tool itself is broken, seems missing dependency:
|
While the test tool is broken for bisecting, I'm looking at logs. It seems the successful run actually had more spilling:
This may indicate the failed run was completely stuck and not making progress (not even spilling to disk), and then got timeout. |
So the bisecting pointed to this PR the culprit: adb6775 The commit right before this PR passed: eae4e78: https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_4a8rsymheiljps4javg4gfbb6v And all trials earlier than this PR also passed:
|
It actually makes sense that PR would regress this (actually, that was one of the main potential regression concerns of that change). Let's split into a short term and long term resolution?
@jjyao what do you think of this plan? |
Given it does have negative impact on some workloads, should we keep the old behavior by default for the sake of backward compatibility and safety. For 2.4 we can introduce a private (used by dataset only) |
I'm ok with that plan, but can we also figure out the proper API in 2.5 (where we also make the new behavior the default)? |
Yes. So 2.4 we use the private option. In 2.5 we figure out the proper API, remove the private option and pick the default behavior. |
@jjyao The PR seems to break the CI: https://buildkite.com/ray-project/oss-ci-build-pr/builds/18212 |
@jianoaix which CI test? I checked https://flakey-tests.ray.io/ which looks good. |
Yeah, I think it's just that PR wasn't synced. Sorry for spam, it should be good. |
What happened + What you expected to happen
started to fail recently (timing out) after running 8 hrs
e.g. https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga
success run before took 3 hrs to finish : https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_vc3t9r3se8lkd1za9w1enfssng
Versions / Dependencies
master and 2.4.0
Reproduction script
https://console.anyscale-staging.com/o/anyscale-internal/jobs/prodjob_jv6r66hzdef9zx4dylhkmfxsga
Issue Severity
None
The text was updated successfully, but these errors were encountered: