[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial #28511

justinvyu · 2022-09-14T18:02:36Z

Why are these changes needed?

The problem

When running synchronous PBT while checkpointing every time a perturbation happens, the experiment can reach a state where trial A is RUNNING but hanging forever without ever performing another train step, and trial B is PAUSED waiting for A to reach the specified perturbation_interval.

Why does this happen?

Synch PBT waits for the last trial to come in to perform exploiting for all trials
PBT can call TrialExecutor.stop_trial(trial) within PBT._exploit before one of the other trials is finished saving (trial.is_saving is still True, and there is a decision in TrialRunner._cached_trial_decisions associated with this trial)
TrialExecutor.stop_trial() will clear all the futures that were to be handled by the trial (including TrialRunner._process_trial_save on the SAVING_RESULT event, which is the event that clears trial.saving_to and pops from TrialRunner._cached_trial_decisions)
This causes trial.saving_to to never be cleared, and trial.is_saving will remain True
Another training result event will come in due to on_pg_ready when the trial starts again (resuming from checkpoint)
- When train → process_trial_result → goes into the trial.is_saving code path, which only adds the decision to the cache (without a SAVING_RESULT to move it to the decision queue) → trial is hanging forever
- TrialRunner._post_process_on_training_saving_result will not do anything, since it checks that the trial is not in the TrialRunner._cached_trial_decisions
  - No actions will ever be executed

Fix in the PR

The main culprits here are inconsistent Trial.saving_to/Trial.is_saving and TrialRunner._cached_trial_decisions. These are now reset for the trial upon pausing.

Testing

This PR includes a test that reproduces this failure mode on the current master and is fixed with the PR. The test artificially creates the scenario by having one trial's checkpointing take a long time (5s), while PBT tries to pause that trial to exploit the other one.

Future TODOs

PBT directly calling trial_runner.pause_trial(trial) is not ideal to begin with, and it's the cause of this issue in the first place
Refactor this in the future to clearly separate responsibilities between scheduler and trial runner/executor.
Make sure that experiment restore is working when PBT pauses trials when other trials are checkpointing.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

… pause Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu · 2022-09-14T18:04:53Z

python/ray/tune/execution/ray_trial_executor.py


    def continue_training(self, trial: Trial) -> None:
        """Continues the training of this trial."""
        self._train(trial)

-    def pause_trial(self, trial: Trial) -> None:
+    def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None:


Should I avoid having this be configurable? I think a lot of other places in the code depend on a paused trial always having an in-memory checkpoint. This one case of PBT will never use the trial's own in-memory checkpoint, which is why I would want this as False.

I think this is fine - it's the main reason why we call stop and then set_status in PBT at the moment, and I think this is cleaner.

Could we add some documentation here?

python/ray/tune/execution/trial_runner.py

…h_pbt_hanging_trial_fix

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

krfricke

LGTM!

krfricke · 2022-09-19T09:17:55Z

python/ray/tune/execution/ray_trial_executor.py


    def continue_training(self, trial: Trial) -> None:
        """Continues the training of this trial."""
        self._train(trial)

-    def pause_trial(self, trial: Trial) -> None:
+    def pause_trial(self, trial: Trial, should_checkpoint: bool = True) -> None:


I think this is fine - it's the main reason why we call stop and then set_status in PBT at the moment, and I think this is cleaner.

krfricke · 2022-09-19T09:27:54Z

python/ray/tune/execution/trial_runner.py

+        if trial.trial_id not in self._cached_trial_decisions:
            final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)
            if final_decision:
                self._execute_action(trial, final_decision)


I actually think we can get rid if line 980 altogether (previously it was always True)

Suggested change

if trial.trial_id not in self._cached_trial_decisions:

final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)

if final_decision:

self._execute_action(trial, final_decision)

final_decision = self._queued_trial_decisions.pop(trial.trial_id, None)

if final_decision:

self._execute_action(trial, final_decision)

xwjiang2010 · 2022-09-19T15:13:38Z

python/ray/tune/tests/test_trial_scheduler_pbt.py

@@ -264,6 +267,102 @@ def testSynchPassLast(self):
            )
        )

+    def testExploitWhileSavingTrial(self):


Thank you for adding this test case!

xwjiang2010

Thank you Justin!

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

krfricke

Awesome!

justinvyu added 3 commits September 13, 2022 15:04

Fix trial hanging by clearing saving_to and cached_trial_decisions on…

475f94f

… pause Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add test for synch PBT hanging trial failure mode

88ed1be

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Better test description + linting

90b8351

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu added bug Something that is supposed to be working; but isn't tune Tune-related issues labels Sep 14, 2022

justinvyu requested review from amogkam and xwjiang2010 September 14, 2022 18:02

justinvyu self-assigned this Sep 14, 2022

justinvyu commented Sep 14, 2022

View reviewed changes

python/ray/tune/execution/trial_runner.py Show resolved Hide resolved

Merge branch 'master' of https://github.com/ray-project/ray into sync…

8871c46

…h_pbt_hanging_trial_fix

justinvyu mentioned this pull request Sep 14, 2022

[Tune] [PBT] [Doc] Add example PBT notebook #28519

Merged

7 tasks

justinvyu added 2 commits September 14, 2022 16:20

[Lint] Remove unused imports/variables

541b15a

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Use existing CombinedStopper instead of reimplementing it

6fe47af

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

krfricke approved these changes Sep 19, 2022

View reviewed changes

xwjiang2010 reviewed Sep 19, 2022

View reviewed changes

xwjiang2010 approved these changes Sep 19, 2022

View reviewed changes

justinvyu added 6 commits September 19, 2022 12:41

Fix MockRunner test interface to match TrialRunner

a93ec0e

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Use TrialRunner.pause_trial for PBTReplay as well

33e3ee2

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Add documentation for RayTrialExecutor.pause_trial/stop_trial

2a0debd

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Remove redundant check for the trial in the _cached_trial_decisions

cdf9388

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix docstring style

3ee9be8

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

Fix missing arg to pause_trial in PBTReplay

d1e23cb

Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

justinvyu requested review from krfricke and xwjiang2010 September 20, 2022 16:45

xwjiang2010 approved these changes Sep 21, 2022

View reviewed changes

krfricke approved these changes Sep 22, 2022

View reviewed changes

krfricke merged commit 2e7040e into ray-project:master Sep 22, 2022

This was referenced Sep 26, 2022

[core][release] chaos_dataset_shuffle_push_based_sort_1tb failed with WorkerCrashedError #28774

Closed

[core][release] chaos_dataset_shuffle_sort_1tb failed with OutOfMemoryError #28775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial #28511

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial #28511

justinvyu commented Sep 14, 2022 •

edited

Loading

justinvyu Sep 14, 2022

krfricke Sep 19, 2022

xwjiang2010 Sep 19, 2022

krfricke left a comment

krfricke Sep 19, 2022

krfricke Sep 19, 2022

xwjiang2010 Sep 19, 2022

xwjiang2010 left a comment

krfricke left a comment

[Tune] [PBT] Maintain consistent Trial/TrialRunner state when pausing and resuming trial #28511

[Tune] [PBT] Maintain consistent Trial/TrialRunner state when pausing and resuming trial #28511

Conversation

justinvyu commented Sep 14, 2022 • edited Loading

Why are these changes needed?

The problem

Why does this happen?

Fix in the PR

Testing

Future TODOs

Related issue number

Checks

justinvyu Sep 14, 2022

Choose a reason for hiding this comment

krfricke Sep 19, 2022

Choose a reason for hiding this comment

xwjiang2010 Sep 19, 2022

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

krfricke Sep 19, 2022

Choose a reason for hiding this comment

krfricke Sep 19, 2022

Choose a reason for hiding this comment

xwjiang2010 Sep 19, 2022

Choose a reason for hiding this comment

xwjiang2010 left a comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial #28511

[Tune] [PBT] Maintain consistent `Trial`/`TrialRunner` state when pausing and resuming trial #28511

justinvyu commented Sep 14, 2022 •

edited

Loading