Unable to use DDT with aprun on Titan #1384

mrnorman · 2017-04-19T19:30:21Z

We determined the cause of the problem of trying to use DDT on Titan. For DDT to work, we need to call "ddt --connect aprun" instead of "aprun". I tried to do this through env_mach_specific.xml, changing the following:

122 aprun
122 ddt --connect aprun

But when I do this, the flags (e.g., -S8 -N16 -n16 -d1) all get dropped. The output from the standard "aprun" is:

run command is aprun -S 8 -n 16 -N 16 -d 1 /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1

But the output from "ddt --connect aprun" is:

run command is ddt --connect aprun /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1

The DDT job successfully connects to my debugger but, of course, fails because it's only running one MPI task. I need some way of specifying "ddt --connect aprun" instead of "aprun" that keeps those flags in play.

I should also mention that this is pretty critical because PGI said they'd make much better progress if they could use a debugger on Titan to diagnose what's going on with their compiler bugs.

mrnorman · 2017-04-19T19:32:03Z

@rljacob @jgfouca

jedwards4b · 2017-04-19T19:33:22Z

This is related to issue #643 and we would prefer to have a generic debug option to one specific to a single system. Currently the best way to launch a debugger is outside of the cime scripts.

rljacob · 2017-04-19T19:36:01Z

How does one launch a debugger outside the cime scripts but with a model built by create_newcase?

jedwards4b · 2017-04-19T19:39:38Z

build the model executable.
set the environment by sourcing the .env_mach_specific.[sh,csh]
launch ddt from the command line, the rest is done in the ddt gui
set the run directory to the model run dir
set the executable
set the proper number of tasks and threads
start the job.

mrnorman · 2017-04-19T19:50:05Z

It would be significantly easier for everyone if we could just alter the aprun command without it breaking CIME. We need some way of supporting the reverse connect feature of DDT, especially for large debug jobs.

ekluzek · 2017-04-19T19:53:27Z

@jedwards4b the reverse connect method is what the Allinea people recommend. They may start deprecating the above method that you give. In practice I've found doing the above for CESM to be problematic. It makes the debugging very tied to the specific case and getting all the options set exactly right. And it seems to be especially so, if you aren't just using the intel compiler.

Reverse connect can set the right number of tasks and threads, as shown in #643 and can make it much easier to setup DDT.

mrnorman · 2017-04-19T20:01:41Z

I'd like to try the method @ekluzek recommended in the other Issue, but my parameters for aprun on Titan will be different than that example. The aprun parameters for Titan are commented out in cime_config/acme/machines/config_machines.xml. Therefore, it's not clear to me where these flags are actually coming from, and I don't know where to copy and paste from.

jedwards4b · 2017-04-19T20:02:15Z

Okay, I'm up to speed now. @mrnorman the problem is on line 152 and 1122 of case.py

       if executable == "aprun":

should be

       if "aprun" in executable:

mrnorman · 2017-04-19T20:20:41Z

Thanks Jim! I have an older version of CIME, but I found the correct line and changed it. This does restore the aprun flags. However, I still have the problem that it starts the aprun command with "aprun" rather than:

<executable args="default">ddt --connect aprun</executable>

Is there an environment variable in the python scripts that can reference the above XML executable name instead of just "aprun"?

jedwards4b · 2017-04-19T20:28:00Z

It looks like aprun.py is hardcoding aprun in its argument string. Instead of returning aprun plus the arguments it should just return aprun_arguments and use executable for the command.

jedwards4b · 2017-04-19T20:39:38Z

@mrnorman Here is a branch from the current master with a fix I don't have access to titan to test this
To git@github.com:jedwards4b/cime.git

[new branch] aprun_fix -> aprun_fix

mrnorman · 2017-04-19T20:43:29Z

I tested the following diff, and it succeeded on Titan. Is it basically the same as what you implemented?

diff --git a/cime/utils/python/CIME/aprun.py b/cime/utils/python/CIME/aprun.py
index a01d6fb..95acda3 100755
--- a/cime/utils/python/CIME/aprun.py
+++ b/cime/utils/python/CIME/aprun.py
@@ -64,7 +64,7 @@ def _get_aprun_cmd_for_case_impl(ntasks, nthreads, rootpes, pstrids,
 
     # Compute task and thread settings for batch commands
     tasks_per_node, task_count, thread_count, max_thread_count, total_node_count, aprun = \
-        0, 1, maxt[0], maxt[0], 0, "aprun"
+        0, 1, maxt[0], maxt[0], 0, ""
     for c1 in xrange(1, total_tasks):
         if maxt[c1] != thread_count:
             tasks_per_node = min(pes_per_node, max_tasks_per_node / thread_count)
diff --git a/cime/utils/python/CIME/case.py b/cime/utils/python/CIME/case.py
index 33d589e..b078349 100644
--- a/cime/utils/python/CIME/case.py
+++ b/cime/utils/python/CIME/case.py
@@ -1098,8 +1098,8 @@ class Case(object):
         executable, args = env_mach_specific.get_mpirun(self, mpi_attribs, job=job)
 
         # special case for aprun
-        if executable == "aprun":
-            return get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
+        if "aprun" in executable:
+            return executable + " " + get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
         else:
             mpi_arg_string = " ".join(args.values())

jedwards4b · 2017-04-19T20:47:14Z

yes, that's basically the same thing

mrnorman · 2017-04-19T20:49:06Z

Sweet. So, I'm not sure how this works with the CIME merges into ACME, but I assume we'll need to patch this in ACME's CIME for now until we get your CIME patch downwind at a later point?

Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to config_grids.xml, which was inadvertently left out from the high-res configuration. It also includes a domain definition for the corresponding ocn/ice grid, oRRS18to6v3_ICG. [BFB]

jedwards4b mentioned this issue Apr 19, 2017

return only aprun args and use executable from xml #1385

Merged

ghost assigned jedwards4b Apr 19, 2017

ghost added the in progress label Apr 19, 2017

mrnorman mentioned this issue Apr 19, 2017

Add DEBUGGER option to env_run.xml #643

Closed

jgfouca closed this as completed in #1385 Apr 20, 2017

ghost removed the in progress label Apr 20, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use DDT with aprun on Titan #1384

Unable to use DDT with aprun on Titan #1384

mrnorman commented Apr 19, 2017 •

edited

Loading

mrnorman commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

rljacob commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017

ekluzek commented Apr 19, 2017

mrnorman commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017 •

edited

Loading

jedwards4b commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017 •

edited

Loading

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017

Unable to use DDT with aprun on Titan #1384

Unable to use DDT with aprun on Titan #1384

Comments

mrnorman commented Apr 19, 2017 • edited Loading

mrnorman commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

rljacob commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017

ekluzek commented Apr 19, 2017

mrnorman commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017 • edited Loading

jedwards4b commented Apr 19, 2017

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017 • edited Loading

jedwards4b commented Apr 19, 2017

mrnorman commented Apr 19, 2017

mrnorman commented Apr 19, 2017 •

edited

Loading

mrnorman commented Apr 19, 2017 •

edited

Loading

mrnorman commented Apr 19, 2017 •

edited

Loading