-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] bohb_example
patch fix
#36192
Conversation
if t.status == Trial.PAUSED: | ||
trial_runner.stop_trial(t) | ||
elif t.status == Trial.RUNNING: | ||
elif t.status in Trial.RUNNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be
elif t.status in Trial.RUNNING: | |
elif t.status in {Trial.RUNNING, Trial.PENDING}: |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did this, but it started giving some errors on the execution side.
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/execution/trial_runner.py", line 791, in _process_trial_result
decision = self._scheduler_alg.on_trial_result(
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/schedulers/hb_bohb.py", line 88, in on_trial_result
bracket.update_trial_stats(trial, result)
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/schedulers/hyperband.py", line 470, in update_trial_stats
assert trial in self._live_trials
AssertionError
The trial is pending, gets cleaned up in the scheduler state, then starts running and returns a result. Then, the scheduler is confused because it already cleaned up the trial.
Basically, we should be calling stop_trial
for bad trials, but that didn't really work when I tried it -- the example started to hang after a certain number of trials (see comment above).
if t.status == Trial.PAUSED: | ||
trial_runner.stop_trial(t) | ||
elif t.status == Trial.RUNNING: | ||
elif t.status in Trial.RUNNING: | ||
bracket.cleanup_trial(t) | ||
action = TrialScheduler.STOP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we stop_trial
on a paused trial? It's already stopped (doesn't have a live actor). Doesn't this just change the status from TERMINATED
to ERROR
?
Why do we not stop_trial
on a running or pending trial? Why do we cleanup the bracket instead? Doesn't this cause us to not track a running trial?
Instead, we return STOP
as the action to take if any of the bad trials is running, even though the returned action is not even for the running trial that we are inspecting in the loop. It's some arbitrary trial that got a result and started this process_bracket
logic.
if t.status == Trial.PAUSED: | ||
trial_runner.stop_trial(t) | ||
elif t.status == Trial.RUNNING: | ||
elif t.status in Trial.RUNNING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially did this, but it started giving some errors on the execution side.
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/execution/trial_runner.py", line 748, in _process_trial_results
decision = self._process_trial_result(trial, result)
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/execution/trial_runner.py", line 791, in _process_trial_result
decision = self._scheduler_alg.on_trial_result(
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/schedulers/hb_bohb.py", line 88, in on_trial_result
bracket.update_trial_stats(trial, result)
File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/schedulers/hyperband.py", line 470, in update_trial_stats
assert trial in self._live_trials
AssertionError
The trial is pending, gets cleaned up in the scheduler state, then starts running and returns a result. Then, the scheduler is confused because it already cleaned up the trial.
Basically, we should be calling stop_trial
for bad trials, but that didn't really work when I tried it -- the example started to hang after a certain number of trials (see comment above).
Why are these changes needed?
The
bohb_example
is super flaky after #35338 modified the example to run more trials. This is just a patch fix to deflake the example. There is still some underlying root cause to investigate. The other BOHB example (the Jupyter notebook) is also flaky, but for a different reason. See #35428.In general, I unsure about the current hyperband implementation. See comments below for parts that I don't really understand.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.