You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After writing a checkpoint to the parallel file system, a later job attempts to restart. SCR detects that the checkpoint exists, but it fails when trying to fetch the files.
SCR v3.0.0: rank 0 on frontier00010: NPROCS=24576
SCR v3.0.0: rank 0 on frontier00010: NNODES=3072
SCR v3.0.0: rank 0 on frontier00010: Stopping all async flush operations
SCR v3.0.0: rank 0 on frontier00010: Attempting fetch: cycle=40
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to add files to AXL transfer handle 0 @ /gpfs/scr-v3.0.1/scr/src/scr_util_mpi.c:354
SCR v3.0.0: rank 0 on frontier00010: Deleting dataset 1 `cycle=40' from cache
SCR v3.0.0: rank 0 on frontier00010: One or more processes failed to read its files @ /gpfs/scr-v3.0.1/scr/src/scr_fetch.c:471
SCR v3.0.0: rank 0 on frontier00010: scr_fetch_latest: return code 1, 2.088182 secs
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to fetch checkpoint set into cache. Restarting from the beginning @ /gpfs/scr-v3.0.1/scr/src/scr.c:2549
The text was updated successfully, but these errors were encountered:
After writing a checkpoint to the parallel file system, a later job attempts to restart. SCR detects that the checkpoint exists, but it fails when trying to fetch the files.
The text was updated successfully, but these errors were encountered: