[tune] Only keep cached actors if search has not ended #31974

krfricke · 2023-01-26T22:18:25Z

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in #31883.

This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged.

This behavior is tested in the cluster_tune_scale_up_down release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor.

Related issue number

Closes #31883

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

xwjiang2010 · 2023-01-26T22:30:10Z

Thanks!
The test is written very well to catch it!
Since we are at it, do yo mind adding a small description on top of tune_scale_up_down.py? Like the ones we have in tune scalability envelope.

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-01-26T22:40:26Z

Updated the description of the test!

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke · 2023-01-27T00:32:57Z

Release test passed: https://buildkite.com/ray-project/release-tests-pr/builds/26587

Waiting for rest of CI to pass before merging.

…1974) We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in ray-project#31883. This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged. This behavior is tested in the `cluster_tune_scale_up_down` release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Similar regression that was originally fixed in #31974, but re-surfaced after #33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to #33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com>

…ject#33593) Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>

…ject#33593) Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>

[tune] Only keep cached actors if search has not ended

7625ee6

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from xwjiang2010 January 26, 2023 22:18

krfricke assigned xwjiang2010 Jan 26, 2023

xwjiang2010 approved these changes Jan 26, 2023

View reviewed changes

Add description

74987de

Signed-off-by: Kai Fricke <kai@anyscale.com>

Fix mock trial executor test

22f5831

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke merged commit 6bcaa9c into ray-project:master Jan 27, 2023

krfricke deleted the tune/cleanup-actors-on-search-end branch January 27, 2023 01:02

krfricke mentioned this pull request Mar 22, 2023

[tune] Evict object cache for actor re-use when search ended #33593

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Only keep cached actors if search has not ended #31974

[tune] Only keep cached actors if search has not ended #31974

krfricke commented Jan 26, 2023

xwjiang2010 commented Jan 26, 2023

krfricke commented Jan 26, 2023

krfricke commented Jan 27, 2023

[tune] Only keep cached actors if search has not ended #31974

[tune] Only keep cached actors if search has not ended #31974

Conversation

krfricke commented Jan 26, 2023

Why are these changes needed?

Related issue number

Checks

xwjiang2010 commented Jan 26, 2023

krfricke commented Jan 26, 2023

krfricke commented Jan 27, 2023