You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
training job was running for 2+ days using AWS VM but then it ran into this error:
ERROR trainable.py:597 -- Could not upload checkpoint even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported.. For large checkpoints, consider increasing SyncConfig(sync_timeout) (current value: 1800 seconds). Checkpoint URI: s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp/TorchTrainer_2023-05-15_22-41-59/TorchTrainer_b3daa_00000_0_2023-05-15_22-42-00/checkpoint_000002
The job did not indicate logs from ray for some time with stuff like trial name, status and stuff after encountering the error but then recovered. The VM is using IAM role which still has good s3 permissions. My syncconfig in TorchTrainer looks like
if pyarrow.fs.FileSystem.from_uri(uri) in remote_storage.py does not refresh credentials after a while, then not purging the cache after some time may be a bug in ray:
cache_key = (parsed.scheme, parsed.netloc)
use an s3 uri based syncconfig in ray Trainer and let the training run for 2 + days.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered:
gilvikra
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 18, 2023
This is a test to make sure that cloud storage access still works after multi-day training. See e.g. #35519
This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles.
Signed-off-by: Kai Fricke <kai@anyscale.com>
This is a test to make sure that cloud storage access still works after multi-day training. See e.g. ray-project#35519
This PR also adds a miniscule concurrency group to release test runners for long running tests to avoid congestion during release cycles.
Signed-off-by: Kai Fricke <kai@anyscale.com>
Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
What happened + What you expected to happen
training job was running for 2+ days using AWS VM but then it ran into this error:
ERROR trainable.py:597 -- Could not upload checkpoint even after 3 retries.Please check if the credentials expired and that the remote filesystem is supported.. For large checkpoints, consider increasing
SyncConfig(sync_timeout)
(current value: 1800 seconds). Checkpoint URI: s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp/TorchTrainer_2023-05-15_22-41-59/TorchTrainer_b3daa_00000_0_2023-05-15_22-42-00/checkpoint_000002The job did not indicate logs from ray for some time with stuff like trial name, status and stuff after encountering the error but then recovered. The VM is using IAM role which still has good s3 permissions. My syncconfig in TorchTrainer looks like
SyncConfig(upload_dir="s3://a9vs-photon-fsx-data-repository/sync-with-fsx/Results/parisexp")
if pyarrow.fs.FileSystem.from_uri(uri) in remote_storage.py does not refresh credentials after a while, then not purging the cache after some time may be a bug in ray:
cache_key = (parsed.scheme, parsed.netloc)
Versions / Dependencies
ray 2.3.1
Reproduction script
use an s3 uri based syncconfig in ray Trainer and let the training run for 2 + days.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
The text was updated successfully, but these errors were encountered: