Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tune] [PBT] Maintain consistent Trial/TrialRunner state when pausing and resuming trial #28511

Merged
merged 12 commits into from
Sep 22, 2022

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Sep 14, 2022

Why are these changes needed?

The problem

  • When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is RUNNING but hanging forever without ever performing another train step, and trial B is PAUSED waiting for A to reach the specified perturbation_interval.

Why does this happen?

  1. Synch PBT waits for the last trial to come in to perform exploiting for all trials
  2. PBT can call TrialExecutor.stop_trial(trial) within PBT._exploit before one of the other trials is finished saving (trial.is_saving is still True, and there is a decision in TrialRunner._cached_trial_decisions associated with this trial)
  3. TrialExecutor.stop_trial() will clear all the futures that were to be handled by the trial (including TrialRunner._process_trial_save on the SAVING_RESULT event, which is the event that clears trial.saving_to and pops from TrialRunner._cached_trial_decisions)
  4. This causes trial.saving_to to never be cleared, and trial.is_saving will remain True
  5. Another training result event will come in due to on_pg_ready when the trial starts again (resuming from checkpoint)
    • When train → process_trial_result → goes into the trial.is_saving code path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging forever
    • TrialRunner._post_process_on_training_saving_result will not do anything, since it checks that the trial is not in the TrialRunner._cached_trial_decisions
      • No actions will ever be executed

Fix in the PR

  • The main culprits here are inconsistent Trial.saving_to/Trial.is_saving and TrialRunner._cached_trial_decisions. These are now reset for the trial upon pausing.

Testing

  • This PR includes a test that reproduces this failure mode on the current master and is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.

Future TODOs

  • PBT directly calling trial_runner.pause_trial(trial) is not ideal to begin with, and it's the cause of this issue in the first place
  • Refactor this in the future to clearly separate responsibilities between scheduler and trial runner/executor.
  • Make sure that experiment restore is working when PBT pauses trials when other trials are checkpointing.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

… pause

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
@justinvyu justinvyu added bug Something that is supposed to be working; but isn't tune Tune-related issues labels Sep 14, 2022
@justinvyu justinvyu self-assigned this Sep 14, 2022

def continue_training(self, trial: Trial) -> None:
"""Continues the training of this trial."""
self._train(trial)

def pause_trial(self, trial: Trial) -> None:
def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I avoid having this be configurable? I think a lot of other places in the code depend on a paused trial always having an in-memory checkpoint. This one case of PBT will never use the trial's own in-memory checkpoint, which is why I would want this as False.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine - it's the main reason why we call stop and then set_status in PBT at the moment, and I think this is cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some documentation here?

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


def continue_training(self, trial: Trial) -> None:
"""Continues the training of this trial."""
self._train(trial)

def pause_trial(self, trial: Trial) -> None:
def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine - it's the main reason why we call stop and then set_status in PBT at the moment, and I think this is cleaner.

Comment on lines 980 to 983
if trial.trial_id not in self._cached_trial_decisions:
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)
if final_decision:
self._execute_action(trial, final_decision)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we can get rid if line 980 altogether (previously it was always True)

Suggested change
if trial.trial_id not in self._cached_trial_decisions:
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)
if final_decision:
self._execute_action(trial, final_decision)
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)
if final_decision:
self._execute_action(trial, final_decision)

@@ -264,6 +267,102 @@ def testSynchPassLast(self):
)
)

def testExploitWhileSavingTrial(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this test case!

Copy link
Contributor

@xwjiang2010 xwjiang2010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Justin!

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't tune Tune-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants