-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tune] Add timeout to retry_fn to catch hanging syncs #28155
Conversation
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
python/ray/tune/utils/util.py
Outdated
proc = threading.Thread(target=_retry_fn) | ||
proc.daemon = True | ||
proc.start() | ||
proc.join(timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that you have a thread, imagine eventually we checkpoint on the side while the training just keeps going 🤯 😄
one nit, I also think timeout should be per-retry? (so timeout=num_retries * timeout
here). otherwise the actual timeout will be dependent on how many retries you set here? although, admittedly, num_retries is not even a configurable bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about this, but the reason why I kept a global timeout was because it is a) simpler/cleaner to implement and b) we basically want to define a maximum time we want to block training, so I think we should be fine with this. Let me know if you prefer this per-retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah actually as discussed, let's have a timeout per retry. Otherwise if the first sync hangs we will not try again. Updated the PR
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool man. 2 nits, but let's give this a try.
feel free to merge after you address the minor comments.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: Kai Fricke <kai@anyscale.com>
Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations. Signed-off-by: Kai Fricke <kai@anyscale.com>
#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>
…oject#30855) ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
…oject#30855) ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
Signed-off-by: Kai Fricke kai@anyscale.com
Why are these changes needed?
Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.
Todo:
Related issue number
Closes ##26802
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.