Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] Syncer fails to write to S3 after 2 days of training #35519

Closed
gilvikra opened this issue May 18, 2023 · 1 comment · Fixed by #35938
Closed

[Train] Syncer fails to write to S3 after 2 days of training #35519

gilvikra opened this issue May 18, 2023 · 1 comment · Fixed by #35938
Assignees
Labels
bug Something that is supposed to be working; but isn't tune Tune-related issues

Comments

@gilvikra
Copy link

What happened + What you expected to happen

training job was running for 2+ days using AWS VM but then it ran into this error:
ERROR trainable.py:597 -- Could not upload checkpoint even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported.. For large checkpoints, consider increasing SyncConfig(sync_timeout) (current value: 1800 seconds). Checkpoint URI: s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp/TorchTrainer_2023-05-15_22-41-59/TorchTrainer_b3daa_00000_0_2023-05-15_22-42-00/checkpoint_000002
The job did not indicate logs from ray for some time with stuff like trial name, status and stuff after encountering the error but then recovered. The VM is using IAM role which still has good s3 permissions. My syncconfig in TorchTrainer looks like

SyncConfig(upload_dir="s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp")

if pyarrow.fs.FileSystem.from_uri(uri) in remote_storage.py does not refresh credentials after a while, then not purging the cache after some time may be a bug in ray:
cache_key = (parsed.scheme, parsed.netloc)

if cache_key in _cached_fs:
    fs = _cached_fs[cache_key]
    return fs, path

try:
    fs, path = pyarrow.fs.FileSystem.from_uri(uri)
    _cached_fs[cache_key] = fs
    return fs, path

Versions / Dependencies

ray 2.3.1

Reproduction script

use an s3 uri based syncconfig in ray Trainer and let the training run for 2 + days.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@gilvikra gilvikra added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 18, 2023
@gjoliver gjoliver changed the title [Train] [Train] Syncer fails to write to S3 after 2 days of training May 19, 2023
@gjoliver gjoliver added tune Tune-related issues air and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 19, 2023
@krfricke
Copy link
Contributor

@gilvikra do you still have the full log output? In particular I'm looking for a line

                    Caught sync error: [...].
                    Retrying after sleeping for [...] seconds...

to indicate what the actual root cause was

krfricke added a commit that referenced this issue Jun 30, 2023
This is a test to make sure that cloud storage access still works after multi-day training. See e.g. #35519

This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles.

Signed-off-by: Kai Fricke <kai@anyscale.com>
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
This is a test to make sure that cloud storage access still works after multi-day training. See e.g. ray-project#35519

This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles.

Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't tune Tune-related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants