Race condition in METplus 3.1 #132

DWesl · 2024-08-29T17:13:25Z

When running as standalone with multiple times or multiple models, verif-global reports a crash here:
metplus/util/metplus_util.py in preprocess_file:

if not os.path.exists(dirname):
    os.makedirs(dirname, mode=0o0775)

verif-global launches multiple instances of METplus, which all reach the test at the same time, verify that the destination directory does not currently exist, then proceed to the next line, at which point all but one instance complains that it can't create the directory because it already exists.
Changing the call to os.makedirs(..., exist_ok=True) should work, as would enclosing the makedirs in a

try:
   ...
except FileExistsError:
    pass

construct. METplus 5 fixed this (METplus 4.1 has the same problem), but I suspect there's reasons this project hasn't switched already.

Given that, the simplest resolution within this project would likely be staggering the start of the METplus runs by five-to-ten seconds so the later scripts don't get to the check until the first has finished creating the directory. Unfortunately, I can't figure out where this would be so I can test it.

EDIT: I suspect the jobs are started as part of the same srun invocation, which doesn't have an obvious delay-start-by-task-number feature. The next option would be modifying ush/create_METplus_job_scripts.py to put a 'sleep {:d}; '.format(iproc-1) before the job script to achieve the delays that way.

The text was updated successfully, but these errors were encountered:

DavidHuber-NOAA · 2024-08-30T11:34:59Z

Yes, these are annoying. For the time being in the global-workflow, we are setting the number of tasks to 1. Unfortunately, verif-global will not work with newer versions of METplus and there won't be a newer release of METplus v3.x. If you are adventurous, you can install your own METplus version and make the changes locally. Eventually, verif-global will be replaced by EVS (as has already be done in production), which uses a newer version of METplus (6.0).

DWesl · 2024-08-30T11:36:47Z

Closed by NOAA-EMC/global-workflow#2804, which changes the number of threads to one and thereby forces verif-global to run serially.

DWesl · 2024-09-11T13:16:05Z

Adding job_file.write('sleep {:d}\n\n'.format(njob)) on line 178 of ush/create_METplus_job_scripts.py also seems to space out the execution enough to avoid the race condition.

EMC_verif-global/ush/create_METplus_job_scripts.py

Lines 174 to 178 in 92904d2

    
           job_file = open(job_filename, 'w') 
        
           job_file.write('#!/bin/sh\n') 
        
           job_file.write('set -x\n') 
        
           job_file.write('\n') 
        
           # Write environment variables

DavidHuber-NOAA · 2024-09-11T14:10:02Z

That's great! Would you mind opening a PR? I'd be happy to review it and try turning the parallel runs back on in the global-workflow.

Fixes NOAA-EMC#132 I'll be glad to get someone else's eyes on this. Whether this is enough will depend on how long it takes the disk to read in METplus: there's enough caching that if the first process takes too much time the later ones might catch up, but we can tweak how long things sleep if that becomes an issue.

DWesl · 2024-09-11T14:18:30Z

Created #135

DWesl closed this as completed Aug 30, 2024

DWesl mentioned this issue Sep 11, 2024

Workaround: Sleep each METplus job by jobnum seconds to avoid race #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in METplus 3.1 #132

Race condition in METplus 3.1 #132

DWesl commented Aug 29, 2024 •

edited

Loading

DavidHuber-NOAA commented Aug 30, 2024

DWesl commented Aug 30, 2024

DWesl commented Sep 11, 2024 •

edited

Loading

DavidHuber-NOAA commented Sep 11, 2024

DWesl commented Sep 11, 2024

Race condition in METplus 3.1 #132

Race condition in METplus 3.1 #132

Comments

DWesl commented Aug 29, 2024 • edited Loading

DavidHuber-NOAA commented Aug 30, 2024

DWesl commented Aug 30, 2024

DWesl commented Sep 11, 2024 • edited Loading

DavidHuber-NOAA commented Sep 11, 2024

DWesl commented Sep 11, 2024

DWesl commented Aug 29, 2024 •

edited

Loading

DWesl commented Sep 11, 2024 •

edited

Loading