Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fixing bug that causes job scripts on Mira to fail after successful c…
…ompletion of parallel job but before postprocessing has completed For historical (?) reasons, some of the CESM mkbatch scripts include a 'wait' command after the parallel job launch (e.g. aprun). For collecting checkpoint data (to guarantee that some data are archived even if the job fails) a background job is spawned before the parallel job launch. This must be explicitly killed after the parallel job completes otherwise the job script hangs on the wait command. This kill command was added to mkbatch.mira even though mkbatch.mira does not have a wait after the runjob command. The background script often dies on Mira before the parallel job finishes. Because of this the kill command fails and the $CASE.run script dies before finishing its postprocessing tasks. The simple fix is to just delete the kill of the background script as it is not needed on Mira, but it seems to be good policy to clean up background jobs anyway. Code is added to test whether the background job has already disappeared before trying to kill it. Code is also added to the background job script to eliminate the primary reason that it dies before the parallel application is complete. (The issue is that parsing the output from qstat to determine the amount of time remaining for the run generates numbers that begin with 0, which the script interprets as octal. Thus the number '09' is illegal as '09' is not a legal octal number.) Note that either of these changes is sufficient to solve the problem, but both are included in case new issues arise with the background script in the future.
- Loading branch information
c4c15fe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes are specific to Mira and just change scripts, not code.