Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][tests] chaos_dataset_shuffle_push_based_sort_1tb fails with ray.exceptions.WorkerCrashedError #28411

Closed
stephanie-wang opened this issue Sep 9, 2022 · 3 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@stephanie-wang
Copy link
Contributor

What happened + What you expected to happen

==== Driver memory summary ====
max: 0.490664/GB
rss: 0.502439936/GB
--- Aggregate object store stats across all nodes ---
Plasma memory usage 183414 MiB, 727 objects, 52.16% full, 24.37% needed
Plasma filesystem mmap usage: 3141 MiB
Spilled 3441152 MiB, 114668 objects, avg write throughput 3129 MiB/s
Restored 2818676 MiB, 10636 objects, avg read throughput 2107 MiB/s
Objects consumed by Ray tasks: 2780039 MiB.


Stage 0 read: 1/1000 blocks executed in 2293.37s, 999/1000 blocks split from parent
* Remote wall time: 1.05s min, 1.05s max, 1.05s mean, 1.05s total
* Remote cpu time: 1.11s min, 1.11s max, 1.11s mean, 1.11s total
* Peak heap memory usage (MiB): 16989260000.0 min, 16989260000.0 max, 16989260000 mean
* Output num rows: 125000000 min, 125000000 max, 125000000 mean, 125000000000 total
* Output size bytes: 1000000000 min, 1000000000 max, 1000000000 mean, 1000000000000 total
* Tasks per node: 1 min, 1 max, 1 mean; 1 nodes used

Traceback (most recent call last):
  File "dataset/sort.py", line 165, in <module>
    raise exc
  File "dataset/sort.py", line 119, in <module>
    ds = ds.sort(key="c_0")
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 1716, in sort
    return Dataset(plan, self._epoch, self._lazy)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/dataset.py", line 203, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 309, in execute
    blocks, clear_input_blocks, self._run_by_consumer
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/plan.py", line 756, in __call__
    blocks, clear_input_blocks, self.block_udf, self.ray_remote_args
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/stage_impl.py", line 194, in do_sort
    return sort_impl(blocks, clear_input_blocks, key, descending)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/sort.py", line 153, in sort_impl
    clear_input_blocks,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/push_based_shuffle.py", line 495, in execute
    reduce_stage_metadata += next(reduce_stage_executor)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/push_based_shuffle.py", line 155, in __next__
    prev_metadata_refs
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/worker.py", line 2279, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError: ray::reduce() (pid=1286, ip=172.31.66.173)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::merge() (pid=144151, ip=172.31.86.48)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.WorkerCrashedError: The worker died unexpectedly while executing this task. Check python-core-worker-*.log files for more information.

Versions / Dependencies

Commit: ca705fd
Last successful commit: e12d436

Reproduction script

chaos_dataset_shuffle_push_based_sort_1tb

Issue Severity

No response

@stephanie-wang stephanie-wang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 9, 2022
@stephanie-wang stephanie-wang added this to the Core Nightly/CI Regressions milestone Sep 9, 2022
@stephanie-wang stephanie-wang added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Sep 9, 2022
@stephanie-wang
Copy link
Contributor Author

duplicate of #26922.

@scv119 scv119 added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Sep 19, 2022
@rickyyx
Copy link
Contributor

rickyyx commented Sep 26, 2022

What's the status for this?

But the test starts to fail again: #28774

@hora-anyscale
Copy link
Contributor

Per Triage Sync: Flaky Test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

5 participants