-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to use DDT with aprun on Titan #1384
Comments
This is related to issue #643 and we would prefer to have a generic debug option to one specific to a single system. Currently the best way to launch a debugger is outside of the cime scripts. |
How does one launch a debugger outside the cime scripts but with a model built by create_newcase? |
|
It would be significantly easier for everyone if we could just alter the aprun command without it breaking CIME. We need some way of supporting the reverse connect feature of DDT, especially for large debug jobs. |
@jedwards4b the reverse connect method is what the Allinea people recommend. They may start deprecating the above method that you give. In practice I've found doing the above for CESM to be problematic. It makes the debugging very tied to the specific case and getting all the options set exactly right. And it seems to be especially so, if you aren't just using the intel compiler. Reverse connect can set the right number of tasks and threads, as shown in #643 and can make it much easier to setup DDT. |
I'd like to try the method @ekluzek recommended in the other Issue, but my parameters for aprun on Titan will be different than that example. The aprun parameters for Titan are commented out in cime_config/acme/machines/config_machines.xml. Therefore, it's not clear to me where these flags are actually coming from, and I don't know where to copy and paste from. |
Okay, I'm up to speed now. @mrnorman the problem is on line 152 and 1122 of case.py
should be
|
Thanks Jim! I have an older version of CIME, but I found the correct line and changed it. This does restore the aprun flags. However, I still have the problem that it starts the aprun command with "aprun" rather than: <executable args="default">ddt --connect aprun</executable> Is there an environment variable in the python scripts that can reference the above XML executable name instead of just "aprun"? |
It looks like aprun.py is hardcoding aprun in its argument string. Instead of returning aprun plus the arguments it should just return aprun_arguments and use executable for the command. |
@mrnorman Here is a branch from the current master with a fix I don't have access to titan to test this
|
I tested the following diff, and it succeeded on Titan. Is it basically the same as what you implemented?
|
yes, that's basically the same thing |
Sweet. So, I'm not sure how this works with the CIME merges into ACME, but I assume we'll need to patch this in ACME's CIME for now until we get your CIME patch downwind at a later point? |
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to config_grids.xml, which was inadvertently left out from the high-res configuration. It also includes a domain definition for the corresponding ocn/ice grid, oRRS18to6v3_ICG. [BFB]
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to config_grids.xml, which was inadvertently left out from the high-res configuration. It also includes a domain definition for the corresponding ocn/ice grid, oRRS18to6v3_ICG. [BFB]
Add entry for ne120np4_oRRS18to6v3_ICG to config_grids.xml This PR adds an entry for the ne120np4_oRRS18to6v3_ICG grid to config_grids.xml, which was inadvertently left out from the high-res configuration. It also includes a domain definition for the corresponding ocn/ice grid, oRRS18to6v3_ICG. [BFB]
We determined the cause of the problem of trying to use DDT on Titan. For DDT to work, we need to call "ddt --connect aprun" instead of "aprun". I tried to do this through env_mach_specific.xml, changing the following:
122 aprun
122 ddt --connect aprun
But when I do this, the flags (e.g., -S8 -N16 -n16 -d1) all get dropped. The output from the standard "aprun" is:
run command is aprun -S 8 -n 16 -N 16 -d 1 /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1
But the output from "ddt --connect aprun" is:
run command is ddt --connect aprun /lustre/atlas1/stf006/scratch/imn/enable_debugger17.3/bld/acme.exe >> acme.log.$LID 2>&1
The DDT job successfully connects to my debugger but, of course, fails because it's only running one MPI task. I need some way of specifying "ddt --connect aprun" instead of "aprun" that keeps those flags in play.
I should also mention that this is pretty critical because PGI said they'd make much better progress if they could use a debugger on Titan to diagnose what's going on with their compiler bugs.
The text was updated successfully, but these errors were encountered: