-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tune] [PBT] Maintain consistent Trial
/TrialRunner
state when pausing and resuming trial
#28511
[Tune] [PBT] Maintain consistent Trial
/TrialRunner
state when pausing and resuming trial
#28511
Conversation
… pause Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
|
||
def continue_training(self, trial: Trial) -> None: | ||
"""Continues the training of this trial.""" | ||
self._train(trial) | ||
|
||
def pause_trial(self, trial: Trial) -> None: | ||
def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I avoid having this be configurable? I think a lot of other places in the code depend on a paused trial always having an in-memory checkpoint. This one case of PBT will never use the trial's own in-memory checkpoint, which is why I would want this as False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine - it's the main reason why we call stop
and then set_status
in PBT at the moment, and I think this is cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some documentation here?
…h_pbt_hanging_trial_fix
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
def continue_training(self, trial: Trial) -> None: | ||
"""Continues the training of this trial.""" | ||
self._train(trial) | ||
|
||
def pause_trial(self, trial: Trial) -> None: | ||
def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine - it's the main reason why we call stop
and then set_status
in PBT at the moment, and I think this is cleaner.
if trial.trial_id not in self._cached_trial_decisions: | ||
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None) | ||
if final_decision: | ||
self._execute_action(trial, final_decision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think we can get rid if line 980 altogether (previously it was always True)
if trial.trial_id not in self._cached_trial_decisions: | |
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None) | |
if final_decision: | |
self._execute_action(trial, final_decision) | |
final_decision = self._queued_trial_decisions.pop(trial.trial_id, None) | |
if final_decision: | |
self._execute_action(trial, final_decision) |
@@ -264,6 +267,102 @@ def testSynchPassLast(self): | |||
) | |||
) | |||
|
|||
def testExploitWhileSavingTrial(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding this test case!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Justin!
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Why are these changes needed?
The problem
RUNNING
but hanging forever without ever performing another train step, and trial B isPAUSED
waiting for A to reach the specifiedperturbation_interval
.Why does this happen?
TrialExecutor.stop_trial(trial)
withinPBT._exploit
before one of the other trials is finished saving (trial.is_saving
is still True, and there is a decision inTrialRunner._cached_trial_decisions
associated with this trial)TrialExecutor.stop_trial()
will clear all the futures that were to be handled by the trial (includingTrialRunner._process_trial_save
on the SAVING_RESULT event, which is the event that clearstrial.saving_to
and pops fromTrialRunner._cached_trial_decisions
)trial.saving_to
to never be cleared, andtrial.is_saving
will remain Trueon_pg_ready
when the trial starts again (resuming from checkpoint)trial.is_saving
code path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging foreverTrialRunner._post_process_on_training_saving_result
will not do anything, since it checks that the trial is not in theTrialRunner._cached_trial_decisions
Fix in the PR
Trial.saving_to
/Trial.is_saving
andTrialRunner._cached_trial_decisions
. These are now reset for the trial upon pausing.Testing
master
and is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.Future TODOs
PBT
directly callingtrial_runner.pause_trial(trial)
is not ideal to begin with, and it's the cause of this issue in the first placeRelated issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.