Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chgres_cube consistency test failures on Orion #609

Closed
GeorgeGayno-NOAA opened this issue Dec 9, 2021 · 2 comments · Fixed by #611
Closed

chgres_cube consistency test failures on Orion #609

GeorgeGayno-NOAA opened this issue Dec 9, 2021 · 2 comments · Fixed by #611
Assignees
Labels
bug Something isn't working

Comments

@GeorgeGayno-NOAA
Copy link
Collaborator

GeorgeGayno-NOAA commented Dec 9, 2021

Occasionally, some of the chgres_cube tests will fail with a 'bus error'. The failures are random. The system admins recommend explicitly requesting how much memory each job needs in the driver script. For example --mem=50G. Preliminary tests show this solves the problem. (The default memory on Orion allocated by Slurm is 54GB).

@GeorgeGayno-NOAA GeorgeGayno-NOAA added the bug Something isn't working label Dec 9, 2021
@GeorgeGayno-NOAA GeorgeGayno-NOAA self-assigned this Dec 9, 2021
@GeorgeGayno-NOAA
Copy link
Collaborator Author

The system admins said to use this command to determine how much memory a job is using.

sacct -j 3908276 --format=jobid,jobname,state,alloctres%35,maxrss

GeorgeGayno-NOAA added a commit to GeorgeGayno-NOAA/UFS_UTILS that referenced this issue Dec 10, 2021
@GeorgeGayno-NOAA
Copy link
Collaborator Author

Using the saact command, I adjusted the requested memory for each test (b5d6ab6). Then I tested the updated script on Orion.

All tests were successfully run six times in a row. Previously, one test (of the 16) would always fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant