Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SW4: fetch is failing to find files and/or add them to AXL for transfer #540

Open
adammoody opened this issue May 3, 2023 · 0 comments
Open

Comments

@adammoody
Copy link
Contributor

After writing a checkpoint to the parallel file system, a later job attempts to restart. SCR detects that the checkpoint exists, but it fails when trying to fetch the files.

SCR v3.0.0: rank 0 on frontier00010: NPROCS=24576
SCR v3.0.0: rank 0 on frontier00010: NNODES=3072
SCR v3.0.0: rank 0 on frontier00010: Stopping all async flush operations
SCR v3.0.0: rank 0 on frontier00010: Attempting fetch: cycle=40
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to add files to AXL transfer handle 0 @ /gpfs/scr-v3.0.1/scr/src/scr_util_mpi.c:354  
SCR v3.0.0: rank 0 on frontier00010: Deleting dataset 1 `cycle=40' from cache
SCR v3.0.0: rank 0 on frontier00010: One or more processes failed to read its files @ /gpfs/scr-v3.0.1/scr/src/scr_fetch.c:471
SCR v3.0.0: rank 0 on frontier00010: scr_fetch_latest: return code 1, 2.088182 secs
SCR v3.0.0 ERROR: rank 0 on frontier00010: Failed to fetch checkpoint set into cache. Restarting from the beginning @ /gpfs/scr-v3.0.1/scr/src/scr.c:2549
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant