Skip to content

Commit

Permalink
Fixing bug that causes job scripts on Mira to fail after successful c…
Browse files Browse the repository at this point in the history
…ompletion of parallel job but before postprocessing has completed

For historical (?) reasons, some of the CESM mkbatch scripts include a
'wait' command after the parallel job launch (e.g. aprun). For
collecting checkpoint data (to guarantee that some data are archived
even if the job fails) a background job is spawned before the parallel
job launch. This must be explicitly killed after the parallel job
completes otherwise the job script hangs on the wait command.

This kill command was added to mkbatch.mira even though mkbatch.mira
does not have a wait after the runjob command. The background script
often dies on Mira before the parallel job finishes. Because of this
the kill command fails and the $CASE.run script dies before finishing
its postprocessing tasks.

The simple fix is to just delete the kill of the background script as
it is not needed on Mira, but it seems to be good policy to clean up
background jobs anyway.

Code is added to test whether the background job has already
disappeared before trying to kill it.

Code is also added to the background job script to eliminate the
primary reason that it dies before the parallel application is
complete. (The issue is that parsing the output from qstat to determine
the amount of time remaining for the run generates numbers that begin
with 0, which the script interprets as octal. Thus the number '09' is
illegal as '09' is not a legal octal number.)

Note that either of these changes is sufficient to solve the problem,
but both are included in case new issues arise with the background
script in the future.
  • Loading branch information
Patrick Worley committed Aug 29, 2014
1 parent 426d74b commit c4c15fe
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 8 deletions.
6 changes: 5 additions & 1 deletion scripts/ccsm_utils/Machines/mkbatch.mira
Original file line number Diff line number Diff line change
Expand Up @@ -198,7 +198,11 @@ runjob --label short -p \${procs} -n \${ntasks} \${LOCARGS} --envs BG_THREADLAYO
else
\$EXEROOT/cesm.exe >&! cesm.log.\$LID
endif
if (\$syslog_id != 0) kill \$syslog_id
if (\$syslog_id != 0) then
if { kill -0 \$syslog_id } then
kill \$syslog_id
endif
endif
echo "\`date\` -- CSM EXECUTION HAS FINISHED"
set sdate = \`date +"%Y-%m-%d %H:%M:%S"\`
Expand Down
19 changes: 12 additions & 7 deletions scripts/ccsm_utils/Machines/syslog.mira
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,12 @@ while ($outlth < 1)
set outlth = `wc \-l $run/cesm.log.$lid | sed 's/ *\([0-9]*\) *.*/\1/' `
end

set remaining = 0
set rem_hours = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\1/' `
set rem_mins = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\2/' `
set rem_secs = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\3/' `
set rem_hours = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\1/' `
set rem_mins = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\2/' `
set rem_secs = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\3/' `
if ("X$rem_hours" == "X") set rem_hours = 0
if ("X$rem_mins" == "X") set rem_mins = 0
if ("X$rem_secs" == "X") set rem_secs = 0
@ remaining = 3600 * $rem_hours + 60 * $rem_mins + $rem_secs
cat > $run/Walltime.Remaining <<EOF1
$remaining $sample_interval
Expand All @@ -38,9 +40,12 @@ while (1)
cp -p -u $timing/* $dir
chmod a+r $dir/*
sleep $sample_interval
set rem_hours = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\1/' `
set rem_mins = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\2/' `
set rem_secs = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *\([0-9]*\):*\([0-9]*\):*\([0-9]*\) */\3/' `
set rem_hours = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\1/' `
set rem_mins = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\2/' `
set rem_secs = `qstat -lf $jid | grep TimeRemaining | sed 's/ *TimeRemaining *: *0*\([0-9]*\):*0*\([0-9]*\):*0*\([0-9]*\) */\3/' `
if ("X$rem_hours" == "X") set rem_hours = 0
if ("X$rem_mins" == "X") set rem_mins = 0
if ("X$rem_secs" == "X") set rem_secs = 0
@ remaining = 3600 * $rem_hours + 60 * $rem_mins + $rem_secs
cat > $run/Walltime.Remaining << EOF2
$remaining $sample_interval
Expand Down

1 comment on commit c4c15fe

@agsalin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes are specific to Mira and just change scripts, not code.

Please sign in to comment.