Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Only keep cached actors if search has not ended #31974

Merged

Conversation

krfricke
Copy link
Contributor

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in #31883.

This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged.

This behavior is tested in the cluster_tune_scale_up_down release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor.

Related issue number

Closes #31883

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>
@xwjiang2010
Copy link
Contributor

Thanks!
The test is written very well to catch it!
Since we are at it, do yo mind adding a small description on top of tune_scale_up_down.py? Like the ones we have in tune scalability envelope.

Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke
Copy link
Contributor Author

Updated the description of the test!

Signed-off-by: Kai Fricke <kai@anyscale.com>
@krfricke
Copy link
Contributor Author

Release test passed: https://buildkite.com/ray-project/release-tests-pr/builds/26587

Waiting for rest of CI to pass before merging.

@krfricke krfricke merged commit 6bcaa9c into ray-project:master Jan 27, 2023
@krfricke krfricke deleted the tune/cleanup-actors-on-search-end branch January 27, 2023 01:02
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…1974)

We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in ray-project#31883.

This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged.

This behavior is tested in the `cluster_tune_scale_up_down` release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
krfricke added a commit that referenced this pull request Mar 22, 2023
Similar regression that was originally fixed in #31974, but re-surfaced after #33045.

With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail.

In our refactor to #33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component.

In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects.

We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior.

Signed-off-by: Kai Fricke <kai@anyscale.com>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
…ject#33593)

Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045.

With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail.

In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component.

In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects.

We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023
…ject#33593)

Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045.

With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail.

In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component.

In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects.

We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Ray 2.3 release] cluster_tune_scale_up_down release test fails
2 participants