-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition in METplus 3.1 #132
Comments
Yes, these are annoying. For the time being in the global-workflow, we are setting the number of tasks to 1. Unfortunately, verif-global will not work with newer versions of METplus and there won't be a newer release of METplus v3.x. If you are adventurous, you can install your own METplus version and make the changes locally. Eventually, verif-global will be replaced by EVS (as has already be done in production), which uses a newer version of METplus (6.0). |
Closed by NOAA-EMC/global-workflow#2804, which changes the number of threads to one and thereby forces verif-global to run serially. |
Adding EMC_verif-global/ush/create_METplus_job_scripts.py Lines 174 to 178 in 92904d2
|
That's great! Would you mind opening a PR? I'd be happy to review it and try turning the parallel runs back on in the global-workflow. |
Fixes NOAA-EMC#132 I'll be glad to get someone else's eyes on this. Whether this is enough will depend on how long it takes the disk to read in METplus: there's enough caching that if the first process takes too much time the later ones might catch up, but we can tweak how long things sleep if that becomes an issue.
Created #135 |
When running as standalone with multiple times or multiple models, verif-global reports a crash here:
metplus/util/metplus_util.py in preprocess_file:
verif-global launches multiple instances of METplus, which all reach the test at the same time, verify that the destination directory does not currently exist, then proceed to the next line, at which point all but one instance complains that it can't create the directory because it already exists.
Changing the call to
os.makedirs(..., exist_ok=True)
should work, as would enclosing themakedirs
in aconstruct. METplus 5 fixed this (METplus 4.1 has the same problem), but I suspect there's reasons this project hasn't switched already.
Given that, the simplest resolution within this project would likely be staggering the start of the METplus runs by five-to-ten seconds so the later scripts don't get to the check until the first has finished creating the directory. Unfortunately, I can't figure out where this would be so I can test it.
EDIT: I suspect the jobs are started as part of the same
srun
invocation, which doesn't have an obvious delay-start-by-task-number feature. The next option would be modifyingush/create_METplus_job_scripts.py
to put a'sleep {:d}; '.format(iproc-1)
before the job script to achieve the delays that way.The text was updated successfully, but these errors were encountered: