Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in METplus 3.1 #132

Closed
DWesl opened this issue Aug 29, 2024 · 5 comments · Fixed by #135
Closed

Race condition in METplus 3.1 #132

DWesl opened this issue Aug 29, 2024 · 5 comments · Fixed by #135

Comments

@DWesl
Copy link
Contributor

DWesl commented Aug 29, 2024

When running as standalone with multiple times or multiple models, verif-global reports a crash here:
metplus/util/metplus_util.py in preprocess_file:

if not os.path.exists(dirname):
    os.makedirs(dirname, mode=0o0775)

verif-global launches multiple instances of METplus, which all reach the test at the same time, verify that the destination directory does not currently exist, then proceed to the next line, at which point all but one instance complains that it can't create the directory because it already exists.
Changing the call to os.makedirs(..., exist_ok=True) should work, as would enclosing the makedirs in a

try:
   ...
except FileExistsError:
    pass

construct. METplus 5 fixed this (METplus 4.1 has the same problem), but I suspect there's reasons this project hasn't switched already.

Given that, the simplest resolution within this project would likely be staggering the start of the METplus runs by five-to-ten seconds so the later scripts don't get to the check until the first has finished creating the directory. Unfortunately, I can't figure out where this would be so I can test it.

EDIT: I suspect the jobs are started as part of the same srun invocation, which doesn't have an obvious delay-start-by-task-number feature. The next option would be modifying ush/create_METplus_job_scripts.py to put a 'sleep {:d}; '.format(iproc-1) before the job script to achieve the delays that way.

@DavidHuber-NOAA
Copy link
Contributor

Yes, these are annoying. For the time being in the global-workflow, we are setting the number of tasks to 1. Unfortunately, verif-global will not work with newer versions of METplus and there won't be a newer release of METplus v3.x. If you are adventurous, you can install your own METplus version and make the changes locally. Eventually, verif-global will be replaced by EVS (as has already be done in production), which uses a newer version of METplus (6.0).

@DWesl
Copy link
Contributor Author

DWesl commented Aug 30, 2024

Closed by NOAA-EMC/global-workflow#2804, which changes the number of threads to one and thereby forces verif-global to run serially.

@DWesl DWesl closed this as completed Aug 30, 2024
@DWesl
Copy link
Contributor Author

DWesl commented Sep 11, 2024

Adding job_file.write('sleep {:d}\n\n'.format(njob)) on line 178 of ush/create_METplus_job_scripts.py also seems to space out the execution enough to avoid the race condition.

job_file = open(job_filename, 'w')
job_file.write('#!/bin/sh\n')
job_file.write('set -x\n')
job_file.write('\n')
# Write environment variables

@DavidHuber-NOAA
Copy link
Contributor

That's great! Would you mind opening a PR? I'd be happy to review it and try turning the parallel runs back on in the global-workflow.

DWesl added a commit to DWesl/EMC_verif-global that referenced this issue Sep 11, 2024
Fixes NOAA-EMC#132 

I'll be glad to get someone else's eyes on this.

Whether this is enough will depend on how long it takes the disk to read in METplus: there's enough caching that if the first process takes too much time the later ones might catch up, but we can tweak how long things sleep if that becomes an issue.
@DWesl
Copy link
Contributor Author

DWesl commented Sep 11, 2024

Created #135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants