[tune] Add timeout to retry_fn to catch hanging syncs #28155

krfricke · 2022-08-29T20:49:36Z

Signed-off-by: Kai Fricke kai@anyscale.com

Why are these changes needed?

Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations.

Todo:

Add end to end trainable test
Throw descriptive error on timeout

Related issue number

Closes ##26802

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Kai Fricke <kai@anyscale.com>

python/ray/tune/trainable/trainable.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

python/ray/tune/execution/trial_runner.py

gjoliver · 2022-08-30T17:23:21Z

python/ray/tune/utils/util.py

+    proc = threading.Thread(target=_retry_fn)
+    proc.daemon = True
+    proc.start()
+    proc.join(timeout=timeout)


now that you have a thread, imagine eventually we checkpoint on the side while the training just keeps going 🤯 😄

one nit, I also think timeout should be per-retry? (so timeout=num_retries * timeout here). otherwise the actual timeout will be dependent on how many retries you set here? although, admittedly, num_retries is not even a configurable bit.

Yeah I thought about this, but the reason why I kept a global timeout was because it is a) simpler/cleaner to implement and b) we basically want to define a maximum time we want to block training, so I think we should be fine with this. Let me know if you prefer this per-retry

Ah actually as discussed, let's have a timeout per retry. Otherwise if the first sync hangs we will not try again. Updated the PR

Signed-off-by: Kai Fricke <kai@anyscale.com>

gjoliver

cool man. 2 nits, but let's give this a try.
feel free to merge after you address the minor comments.

python/ray/tune/utils/util.py

python/ray/tune/trainable/trainable.py

Signed-off-by: Kai Fricke <kai@anyscale.com>

) Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: Zhi Lin <zl1nn@outlook.com>

) Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: ilee300a <ilee300@anyscale.com>

Syncing sometimes hangs in pyarrow for unknown reasons. We should introduce a timeout for these syncing operations. Signed-off-by: Kai Fricke <kai@anyscale.com>

#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu>

…oject#30855) ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

…oject#30855) ray-project#28155 introduced a sync timeout for trainable checkpoint syncing to the cloud, in the case that the sync operation (default is with pyarrow) hangs. This PR adds a similar timeout for experiment checkpoint cloud syncing. Signed-off-by: Justin Yu <justinvyu@berkeley.edu> Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>

Add timeout to retry fn

c3901c7

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from gjoliver August 29, 2022 20:49

krfricke assigned gjoliver Aug 29, 2022

gjoliver reviewed Aug 29, 2022

View reviewed changes

python/ray/tune/trainable/trainable.py Outdated Show resolved Hide resolved

Kai Fricke added 4 commits August 29, 2022 17:01

fix format_vars test

8d54214

Signed-off-by: Kai Fricke <kai@anyscale.com>

pass to trainable as arg

3a3e22a

Signed-off-by: Kai Fricke <kai@anyscale.com>

error message

9f62b79

Signed-off-by: Kai Fricke <kai@anyscale.com>

Use threads instead of multiprocessing, add end to end test

042771b

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke marked this pull request as ready for review August 30, 2022 00:39

krfricke requested a review from gjoliver August 30, 2022 00:40

fix sync config init

13c6cb3

Signed-off-by: Kai Fricke <kai@anyscale.com>

gjoliver approved these changes Aug 30, 2022

View reviewed changes

Kai Fricke added 3 commits August 30, 2022 11:29

Update comment

0affba2

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge branch 'master' into tune/sync-timeout

3067e5e

Sync timeout per retry

8553d11

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke requested a review from gjoliver August 30, 2022 22:01

gjoliver approved these changes Aug 31, 2022

View reviewed changes

python/ray/tune/utils/util.py Outdated Show resolved Hide resolved

python/ray/tune/trainable/trainable.py Show resolved Hide resolved

Kai Fricke added 4 commits August 31, 2022 11:53

Better error message, rename fn

46d3b87

Signed-off-by: Kai Fricke <kai@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tune/sync-timeout

cef1b8e

Increase timeout

ef5e35b

Signed-off-by: Kai Fricke <kai@anyscale.com>

Use unique storage path

873d3eb

Signed-off-by: Kai Fricke <kai@anyscale.com>

krfricke merged commit 3590a86 into ray-project:master Sep 2, 2022

krfricke deleted the tune/sync-timeout branch September 2, 2022 11:52

richardliaw changed the title ~~[tune] Add timeout ro retry_fn to catch hanging syncs~~ [tune] Add timeout to retry_fn to catch hanging syncs Sep 2, 2022

krfricke mentioned this pull request Sep 6, 2022

[Tune] pyarrow.fs hangs indefinitely while writing checkpoint file #26802

Closed

justinvyu mentioned this pull request Dec 2, 2022

[Tune] Add timeout for experiment checkpoint syncing to cloud #30855

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Add timeout to retry_fn to catch hanging syncs #28155

[tune] Add timeout to retry_fn to catch hanging syncs #28155

krfricke commented Aug 29, 2022 •

edited

Loading

gjoliver Aug 30, 2022

krfricke Aug 30, 2022

krfricke Aug 30, 2022

gjoliver left a comment

[tune] Add timeout to retry_fn to catch hanging syncs #28155

[tune] Add timeout to retry_fn to catch hanging syncs #28155

Conversation

krfricke commented Aug 29, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

gjoliver Aug 30, 2022

Choose a reason for hiding this comment

krfricke Aug 30, 2022

Choose a reason for hiding this comment

krfricke Aug 30, 2022

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

krfricke commented Aug 29, 2022 •

edited

Loading