Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use DDT with aprun on Titan #1384

Closed
mrnorman opened this issue Apr 19, 2017 · 14 comments
Closed

Unable to use DDT with aprun on Titan #1384

mrnorman opened this issue Apr 19, 2017 · 14 comments
Assignees

Comments

@mrnorman
Copy link
Contributor

mrnorman commented Apr 19, 2017

We determined the cause of the problem of trying to use DDT on Titan. For DDT to work, we need to call "ddt --connect aprun" instead of "aprun". I tried to do this through env_mach_specific.xml, changing the following:

122 aprun
122 ddt --connect aprun

But when I do this, the flags (e.g., -S8 -N16 -n16 -d1) all get dropped. The output from the standard "aprun" is:

run command is aprun -S 8 -n 16 -N 16 -d 1 /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1

But the output from "ddt --connect aprun" is:

run command is ddt --connect aprun /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1

The DDT job successfully connects to my debugger but, of course, fails because it's only running one MPI task. I need some way of specifying "ddt --connect aprun" instead of "aprun" that keeps those flags in play.

I should also mention that this is pretty critical because PGI said they'd make much better progress if they could use a debugger on Titan to diagnose what's going on with their compiler bugs.

@mrnorman
Copy link
Contributor Author

@rljacob @jgfouca

@jedwards4b
Copy link
Contributor

This is related to issue #643 and we would prefer to have a generic debug option to one specific to a single system. Currently the best way to launch a debugger is outside of the cime scripts.

@rljacob
Copy link
Member

rljacob commented Apr 19, 2017

How does one launch a debugger outside the cime scripts but with a model built by create_newcase?

@jedwards4b
Copy link
Contributor

  1. build the model executable.
  2. set the environment by sourcing the .env_mach_specific.[sh,csh]
  3. launch ddt from the command line, the rest is done in the ddt gui
  4. set the run directory to the model run dir
  5. set the executable
  6. set the proper number of tasks and threads
  7. start the job.

@mrnorman
Copy link
Contributor Author

It would be significantly easier for everyone if we could just alter the aprun command without it breaking CIME. We need some way of supporting the reverse connect feature of DDT, especially for large debug jobs.

@ekluzek
Copy link
Contributor

ekluzek commented Apr 19, 2017

@jedwards4b the reverse connect method is what the Allinea people recommend. They may start deprecating the above method that you give. In practice I've found doing the above for CESM to be problematic. It makes the debugging very tied to the specific case and getting all the options set exactly right. And it seems to be especially so, if you aren't just using the intel compiler.

Reverse connect can set the right number of tasks and threads, as shown in #643 and can make it much easier to setup DDT.

@mrnorman
Copy link
Contributor Author

I'd like to try the method @ekluzek recommended in the other Issue, but my parameters for aprun on Titan will be different than that example. The aprun parameters for Titan are commented out in cime_config/acme/machines/config_machines.xml. Therefore, it's not clear to me where these flags are actually coming from, and I don't know where to copy and paste from.

@jedwards4b
Copy link
Contributor

Okay, I'm up to speed now. @mrnorman the problem is on line 152 and 1122 of case.py

       if executable == "aprun":

should be

       if "aprun" in executable:

@mrnorman
Copy link
Contributor Author

mrnorman commented Apr 19, 2017

Thanks Jim! I have an older version of CIME, but I found the correct line and changed it. This does restore the aprun flags. However, I still have the problem that it starts the aprun command with "aprun" rather than:

<executable args="default">ddt --connect aprun</executable>

Is there an environment variable in the python scripts that can reference the above XML executable name instead of just "aprun"?

@jedwards4b
Copy link
Contributor

It looks like aprun.py is hardcoding aprun in its argument string. Instead of returning aprun plus the arguments it should just return aprun_arguments and use executable for the command.

@jedwards4b
Copy link
Contributor

@mrnorman Here is a branch from the current master with a fix I don't have access to titan to test this
To git@github.com:jedwards4b/cime.git

  • [new branch] aprun_fix -> aprun_fix

@mrnorman
Copy link
Contributor Author

mrnorman commented Apr 19, 2017

I tested the following diff, and it succeeded on Titan. Is it basically the same as what you implemented?

diff --git a/cime/utils/python/CIME/aprun.py b/cime/utils/python/CIME/aprun.py
index a01d6fb..95acda3 100755
--- a/cime/utils/python/CIME/aprun.py
+++ b/cime/utils/python/CIME/aprun.py
@@ -64,7 +64,7 @@ def _get_aprun_cmd_for_case_impl(ntasks, nthreads, rootpes, pstrids,
 
     # Compute task and thread settings for batch commands
     tasks_per_node, task_count, thread_count, max_thread_count, total_node_count, aprun = \
-        0, 1, maxt[0], maxt[0], 0, "aprun"
+        0, 1, maxt[0], maxt[0], 0, ""
     for c1 in xrange(1, total_tasks):
         if maxt[c1] != thread_count:
             tasks_per_node = min(pes_per_node, max_tasks_per_node / thread_count)
diff --git a/cime/utils/python/CIME/case.py b/cime/utils/python/CIME/case.py
index 33d589e..b078349 100644
--- a/cime/utils/python/CIME/case.py
+++ b/cime/utils/python/CIME/case.py
@@ -1098,8 +1098,8 @@ class Case(object):
         executable, args = env_mach_specific.get_mpirun(self, mpi_attribs, job=job)
 
         # special case for aprun
-        if executable == "aprun":
-            return get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
+        if "aprun" in executable:
+            return executable + " " + get_aprun_cmd_for_case(self, run_exe)[0] + " " + run_misc_suffix
         else:
             mpi_arg_string = " ".join(args.values())

@jedwards4b
Copy link
Contributor

yes, that's basically the same thing

@mrnorman
Copy link
Contributor Author

Sweet. So, I'm not sure how this works with the CIME merges into ACME, but I assume we'll need to patch this in ACME's CIME for now until we get your CIME patch downwind at a later point?

@ghost ghost assigned jedwards4b Apr 19, 2017
@ghost ghost added the in progress label Apr 19, 2017
@ghost ghost removed the in progress label Apr 20, 2017
jgfouca pushed a commit that referenced this issue Jun 2, 2017
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml

This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to
config_grids.xml, which was inadvertently left out from the high-res
configuration. It also includes a domain definition for the
corresponding ocn/ice grid, oRRS18to6v3_ICG.

[BFB]
jgfouca pushed a commit that referenced this issue Feb 23, 2018
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml

This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to
config_grids.xml, which was inadvertently left out from the high-res
configuration. It also includes a domain definition for the
corresponding ocn/ice grid, oRRS18to6v3_ICG.

[BFB]
jgfouca pushed a commit that referenced this issue Mar 13, 2018
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml

This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to
config_grids.xml, which was inadvertently left out from the high-res
configuration. It also includes a domain definition for the
corresponding ocn/ice grid, oRRS18to6v3_ICG.

[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants