-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Only keep cached actors if search has not ended #31974
Merged
krfricke
merged 3 commits into
ray-project:master
from
krfricke:tune/cleanup-actors-on-search-end
Jan 27, 2023
Merged
[tune] Only keep cached actors if search has not ended #31974
krfricke
merged 3 commits into
ray-project:master
from
krfricke:tune/cleanup-actors-on-search-end
Jan 27, 2023
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Kai Fricke <kai@anyscale.com>
Thanks! |
xwjiang2010
approved these changes
Jan 26, 2023
Signed-off-by: Kai Fricke <kai@anyscale.com>
Updated the description of the test! |
Signed-off-by: Kai Fricke <kai@anyscale.com>
Release test passed: https://buildkite.com/ray-project/release-tests-pr/builds/26587 Waiting for rest of CI to pass before merging. |
8 tasks
edoakes
pushed a commit
to edoakes/ray
that referenced
this pull request
Mar 22, 2023
…1974) We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in ray-project#31883. This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged. This behavior is tested in the `cluster_tune_scale_up_down` release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
krfricke
added a commit
that referenced
this pull request
Mar 22, 2023
Similar regression that was originally fixed in #31974, but re-surfaced after #33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to #33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com>
elliottower
pushed a commit
to elliottower/ray
that referenced
this pull request
Apr 22, 2023
…ject#33593) Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: elliottower <elliot@elliottower.com>
ProjectsByJackHe
pushed a commit
to ProjectsByJackHe/ray
that referenced
this pull request
May 4, 2023
…ject#33593) Similar regression that was originally fixed in ray-project#31974, but re-surfaced after ray-project#33045. With actor re-use, we speculatively keep one cached actor around in case it is needed when new trials are added (e.g. if we add trials one-by-one). However, we should only do this when the search has not ended, as otherwise we keep an extra actor until the end of the experiment, wasting resources. This leads the tune_scale_up_down release test fail. In our refactor to ray-project#33045 we generalized object caching, but the eviction logic here explicitly keeps one object in cache. Our previous implementation in RayTrialExecutor relied on not calling the eviction function at all when trials were still coming up, essentially not adjusting the number of cached actors down. This is rarely a problem in practice, but does not make a clean contract when separated out into a component. In this PR, we change the logic as follows: When the search ended, no trials are pending execution, and we don't want to explicitly cache an actor, we force eviction of all cached objects. We are refactoring our execution backend, thus I believe it's sufficient to keep the release test to catch this regression. In the new backend we can add light weight unit tests to capture this behavior. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Jack He <jackhe2345@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Kai Fricke kai@anyscale.com
Why are these changes needed?
We currently keep one actor cached if no other trials have been staged to prevent us removing actors when we may need them in one of the next iterations. However, when the search algorithm won't produce new trials anymore, we keep this actor needlessly. This can keep resources occupied needlessly and prevent downscaling, as caught in #31883.
This PR passes another flag to the trial executor cleanup method indicating if new trials are to be expected. If not, we can cleanup unneeded actors even if no further trials are staged.
This behavior is tested in the
cluster_tune_scale_up_down
release test. We won't add a unit test for this as this would effectively mimic the release test. Instead, we can add proper actor reuse testing once we made more progress with the execution refactor.Related issue number
Closes #31883
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.