[Train] Syncer fails to write to S3 after 2 days of training #35519

gilvikra · 2023-05-18T21:29:03Z

What happened + What you expected to happen

training job was running for 2+ days using AWS VM but then it ran into this error:
ERROR trainable.py:597 -- Could not upload checkpoint even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported.. For large checkpoints, consider increasing SyncConfig(sync_timeout) (current value: 1800 seconds). Checkpoint URI: s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp/TorchTrainer_2023-05-15_22-41-59/TorchTrainer_b3daa_00000_0_2023-05-15_22-42-00/checkpoint_000002
The job did not indicate logs from ray for some time with stuff like trial name, status and stuff after encountering the error but then recovered. The VM is using IAM role which still has good s3 permissions. My syncconfig in TorchTrainer looks like

SyncConfig(upload_dir="s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp")

if pyarrow.fs.FileSystem.from_uri(uri) in remote_storage.py does not refresh credentials after a while, then not purging the cache after some time may be a bug in ray:
cache_key = (parsed.scheme, parsed.netloc)

if cache_key in _cached_fs:
    fs = _cached_fs[cache_key]
    return fs, path

try:
    fs, path = pyarrow.fs.FileSystem.from_uri(uri)
    _cached_fs[cache_key] = fs
    return fs, path

Versions / Dependencies

ray 2.3.1

Reproduction script

use an s3 uri based syncconfig in ray Trainer and let the training run for 2 + days.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

krfricke · 2023-06-14T14:44:48Z

@gilvikra do you still have the full log output? In particular I'm looking for a line

                    Caught sync error: [...].
                    Retrying after sleeping for [...] seconds...

to indicate what the actual root cause was

This is a test to make sure that cloud storage access still works after multi-day training. See e.g. #35519 This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles. Signed-off-by: Kai Fricke <kai@anyscale.com>

This is a test to make sure that cloud storage access still works after multi-day training. See e.g. ray-project#35519 This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles. Signed-off-by: Kai Fricke <kai@anyscale.com> Signed-off-by: e428265 <arvind.chandramouli@lmco.com>

gilvikra added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 18, 2023

gjoliver mentioned this issue May 19, 2023

[Tune] Syncer unable to write to s3 after a couple of days of training #35524

Closed

gjoliver changed the title ~~[Train]~~ [Train] Syncer fails to write to S3 after 2 days of training May 19, 2023

gjoliver added tune Tune-related issues air and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 19, 2023

gjoliver assigned krfricke May 19, 2023

krfricke mentioned this issue May 31, 2023

[air] Pass on KMS-related kwargs for s3fs #35938

Merged

8 tasks

krfricke closed this as completed in #35938 Jun 1, 2023

krfricke mentioned this issue Jun 6, 2023

[air] Add long running cloud storage checkpoint test #36115

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] Syncer fails to write to S3 after 2 days of training #35519

[Train] Syncer fails to write to S3 after 2 days of training #35519

gilvikra commented May 18, 2023

krfricke commented Jun 14, 2023

[Train] Syncer fails to write to S3 after 2 days of training #35519

[Train] Syncer fails to write to S3 after 2 days of training #35519

Comments

gilvikra commented May 18, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

krfricke commented Jun 14, 2023