Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] linux://doc/source/tune/examples:bohb_example is failing/flaky on master. #35428

Closed
justinvyu opened this issue May 17, 2023 · 2 comments · Fixed by #36951
Closed

[CI] linux://doc/source/tune/examples:bohb_example is failing/flaky on master. #35428

justinvyu opened this issue May 17, 2023 · 2 comments · Fixed by #36951
Labels
bug Something that is supposed to be working; but isn't flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ tune Tune-related issues

Comments

@justinvyu
Copy link
Contributor

justinvyu commented May 17, 2023

See an example failure here: https://buildkite.com/ray-project/oss-ci-build-branch/builds/3997#018830f1-91f5-4f3d-a4ae-7ee8abf313ab

This is the flaky example: https://docs.ray.io/en/latest/tune/examples/bohb_example.html

Note that this flakiness is DIFFERENT from the recent tune/bohb_example.py flakiness. This issue is tracking flakiness that has been present for longer.

Screen Shot 2023-05-23 at 1 34 46 AM

Stack trace:



Traceback (most recent call last):
--
  | File "/tmp/tmpvdzg3wez", line 151, in <module>
  | results = tuner.fit()
  | File "/ray/python/ray/tune/tuner.py", line 346, in fit
  | return self._local_tuner.fit()
  | File "/ray/python/ray/tune/impl/tuner_internal.py", line 587, in fit
  | analysis = self._fit_internal(trainable, param_space)
  | File "/ray/python/ray/tune/impl/tuner_internal.py", line 713, in _fit_internal
  | **args,
  | File "/ray/python/ray/tune/tune.py", line 1059, in run
  | runner.step()
  | File "/ray/python/ray/tune/execution/tune_controller.py", line 256, in step
  | if not self._actor_manager.next(timeout=0.1):
  | File "/ray/python/ray/air/execution/_internal/actor_manager.py", line 222, in next
  | self._actor_state_events.resolve_future(future)
  | File "/ray/python/ray/air/execution/_internal/event_manager.py", line 118, in resolve_future
  | on_result(result)
  | File "/ray/python/ray/air/execution/_internal/actor_manager.py", line 382, in on_actor_start
  | tracked_actor=tracked_actor, future=future
  | File "/ray/python/ray/air/execution/_internal/actor_manager.py", line 243, in _actor_start_resolved
  | tracked_actor._on_start(tracked_actor)
  | File "/ray/python/ray/tune/execution/tune_controller.py", line 719, in _actor_started
  | self._unstage_trial_with_resources(trial)
  | File "/ray/python/ray/tune/execution/tune_controller.py", line 647, in _unstage_trial_with_resources
  | "Started a trial with resources requested by a different trial, but "
  | RuntimeError: Started a trial with resources requested by a different trial, but this trial was lost. This is an error in Ray Tune's execution logic. Please raise a GitHub issue at https://github.com/ray-project/ray/issues

Todo

  1. Try reproducing the error (unstaging the actor seems to be the issue). Not able repro locally.
  2. Determine whether this is due to new Tune execution. Is this a release blocker for 2.5?

Oldest flakiness that I found (when this issue was created)

....
Generated from flaky test tracker. Please do not edit the signature in this section.
DataCaseName-linux://doc/source/tune/examples:bohb_example-END
....

@justinvyu justinvyu added bug Something that is supposed to be working; but isn't tune Tune-related issues flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ labels May 17, 2023
@justinvyu
Copy link
Contributor Author

justinvyu commented May 23, 2023

Initial investigation:

Still need to figure out how to reproduce this locally, but here are some questions:

  • If a trial gets started through this path:

    ###
    # 3: Start any trial that can be started with a cached actor
    if self._actor_cache.num_cached_objects:
    for resource in self._resources_to_pending_trials:
    if not self._resources_to_pending_trials[resource]:
    continue
    if not self._actor_cache.has_cached_object(resource):
    continue
    start_trial = self._resources_to_pending_trials[resource].pop()
    logger.debug(
    f"Trying to re-use actor for enqueued trial: {start_trial}"
    )
    if not self._maybe_reuse_cached_actor(start_trial):
    self._resources_to_pending_trials[resource].add(start_trial)

    • It will not be added to _staged_trials, but it will get an actor in _trial_to_actor.
    • If staged trials is empty, then this leads to the runtime error. Since _schedule_trial_reset -> _on_trial_reset -> _actor_started -> _unstage_trial_with_resources (this last one is in the stack trace).

Would this happen every time we get to this 3rd case in _maybe_add_actors?

        Lastly, we see if we have cached actors that we can assign to a pending or
        paused trial. This can be the case when a trial has not been staged, yet,
        for instance because the number of staging trials was too large.

@krfricke
Copy link
Contributor

I think this is the right track!

So we

  • have cached actor(s)
  • have a trial that is PENDING but not the next trial selected by the scheduler (another one comes first)
  • thus start the trial with actor reuse because of the logic in the third block.

When the reset is resolved, it triggers the unstage and that fails.

Notice that in both other cases (1 and 2) we add the trials to self._staged_trials. I believe the correct fix is to do this here, too.

It might be that this only comes up if we have more than 2 cached actors. Generally I think the main reason we didn't notice this so far is that the situation is relatively unique, maybe even specifically to BOHB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants