-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix physgrid issues with PE layout and aqua restarts #3204
Conversation
modified: components/cam/src/dynamics/se/inidat.F90
modified: components/cam/src/dynamics/se/dyn_grid.F90 modified: components/cam/src/dynamics/se/fv_physics_coupling_mod.F90 modified: components/cam/src/dynamics/se/inidat.F90
modified: components/cam/src/dynamics/se/inidat.F90
modified: components/cam/src/dynamics/se/fv_physics_coupling_mod.F90
modified: cime/config/e3sm/allactive/config_pesall.xml
modified: components/cam/src/dynamics/se/inidat.F90
modified: cime/config/e3sm/allactive/config_pesall.xml
@@ -475,6 +471,13 @@ subroutine read_inidat( ncid_ini, ncid_topo, dyn_in) | |||
end if | |||
|
|||
if (fv_nphys == 0) then | |||
! Map phis data to FV physics grid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears not to be right. The conditional at line 473 is fv_nphys == 0
, but lines 474 to 479 then perform physgrid operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, shoot, good catch. That's a pretty dumb mistake.
Now I'm curious how my restart tests even passed...
else | ||
fieldname = 'PHIS' | ||
tmp(:,1,:) = 0.0_r8 | ||
if (fv_nphys > 0) then | ||
! Copy phis field to GLL grid | ||
! Load phis field to GLL grid first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this comment is incorrect when separated from the fv_phys_to_dyn_topo function calls. Here, infld is loading physgrid topo data, not GLL data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I'll correct that.
modified: components/cam/src/dynamics/se/inidat.F90
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't comment on the config_pesall.xml changes; I can guess why the underscore is needed but I don't have sufficiently detailed knowledge of the pes mechanics to be sure of the changes. The inidat changes look fine now. The other F90 changes are specific to the LO remap code, and I know you've tested these a fair bit already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. Do you check and make sure there were no other locations where the phys_tmp
variable should be changed to phis_tmp
?
@AaronDonahue , I couldn't find any other instances of phys_tmp |
modified: components/cam/src/dynamics/se/inidat.F90
modified: components/cam/src/dynamics/se/inidat.F90
@whannah1 I see some recent commits, are you still working on this? |
@oksanaguba, I noticed a couple minor things that don't seem to have any effect (when testing on Cori at least). I think this branch is still good to go, but I was gonna ask @AaronDonahue to give it a quick look over again. I just need another day or two to verify these fixes also work for the ECP fork, because there's a (perhaps different?) restart issue there as well that I'm struggling to understand. |
modified: components/cam/src/dynamics/se/inidat.F90
Ok, so the problem that I thought I had appears to be a non-issue. @AaronDonahue, can you take a quick look at the recent changes? Afterwards this PR should be ready to go. |
@AaronDonahue this needs another review from you. |
modified: components/cam/bld/namelist_files/namelist_defaults_cam.xml
2nding @rljacob request for another review from @AaronDonahue ! |
My apologies, I've been in and out of travel and missed this pinging of me. I'll get on this right away. |
@oksanaguba , what's the run length of this test? |
memory output, master and then PR:
|
Are the number of processes the same in master and PR runs? Maybe the PE XML changes are somehow (erroneously) changing that, giving fewer processes with consequently more memory per process. |
Some sanity checks: both clones are from yesterday's master. Ranks: Both have global communicator with 144 ranks, both homme outputs have 96 ranks, 1 element per rank. |
@whannah1 tests ERS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2 ran for < 2 minutes. |
@oksana I meant the run length, since it's not specified in the test name. What's the default run length for an ERS test? |
11 days for that ERS test. |
This might be a dumb question, but could a change in the submodules explain the change in memory use? Maybe one of the clones didn't sync those properly? |
Both runs were done on anvil in the same terminal window. But I will resubmit PR run now. I also looked at some timers and all looks ok. |
I misread your question. submodules. I will resync clones. |
Baselines clone had older commits in submodules, tests are running now. |
ok, let's hope that's the explanation. |
Here are two runs from master, (first) yesterday with older commits for mpp, mpas, and scorpio, and today's the same clone with synced modules (second):
|
If I'm reading these data correctly, (1) the second has >2x larger "max memory highwater" and >5x highwater on rank 0 but (2) the only difference is submodule SHA1s. Do others agree? If I have those two points right, then there seems to be either noise or a real issue due to a submodule change but likely nothing related to this PR. |
yes, has nothing to do with this pr. I will make an issue. |
Reran tests, e3sm_atm_integration, against master with updated clone. all baselines pass, memory issues are not from this pr. |
that's good, thanks for all the extra effort! |
… (PR #3204) This fixes an issues where aqua planet cases were crashing on restart due to not correctly setting the topography to zero. Instead the realistic topography was added on restart, which caused the run to crash. This PR resolves this issue by explicitly resetting the topography to zero when restarting a physgrid run. This also fixes an issue running certain physgrid configurations, such as ne4pg2, because the default NTASK was set higher than the number of physics columns. To get the physgrid specification to match correctly in config_pesall.xml an underscore was to be added to other grid query strings. This should not affect non-physgrid cases. The underscore is used because it always separates the atmosphere grid from the other grids. For example, A typical ne4 grid string is a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 But the string for ne4pg2 is a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get different default PE layouts. Using "ne4np4" without the underscore leads to an error when running physgrid cases. After some discussion we decided to also include a change to the default physgrid mapping algorithm to use the high-order option. This will make all pg2 tests non-BFB. [BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L
merged to next |
melvin does not allow 96 cores for pg2 runs, limit is 60. a fix was pushed, I need to wait for another day. |
… (PR #3204) This fixes an issues where aqua planet cases were crashing on restart due to not correctly setting the topography to zero. Instead the realistic topography was added on restart, which caused the run to crash. This PR resolves this issue by explicitly resetting the topography to zero when restarting a physgrid run. This also fixes an issue running certain physgrid configurations, such as ne4pg2, because the default NTASK was set higher than the number of physics columns. To get the physgrid specification to match correctly in config_pesall.xml an underscore was to be added to other grid query strings. This should not affect non-physgrid cases. The underscore is used because it always separates the atmosphere grid from the other grids. For example, A typical ne4 grid string is a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 But the string for ne4pg2 is a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get different default PE layouts. Using "ne4np4" without the underscore leads to an error when running physgrid cases. After some discussion we decided to also include a change to the default physgrid mapping algorithm to use the high-order option. This will make all pg2 tests non-BFB. [BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L tests
remerged to next |
…er (PR #3204) This fixes an issues where aqua planet cases were crashing on restart due to not correctly setting the topography to zero. Instead the realistic topography was added on restart, which caused the run to crash. This PR resolves this issue by explicitly resetting the topography to zero when restarting a physgrid run. This also fixes an issue running certain physgrid configurations, such as ne4pg2, because the default NTASK was set higher than the number of physics columns. To get the physgrid specification to match correctly in config_pesall.xml an underscore was to be added to other grid query strings. This should not affect non-physgrid cases. The underscore is used because it always separates the atmosphere grid from the other grids. For example, A typical ne4 grid string is a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 But the string for ne4pg2 is a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240 so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get different default PE layouts. Using "ne4np4" without the underscore leads to an error when running physgrid cases. After some discussion we decided to also include a change to the default physgrid mapping algorithm to use the high-order option. This will make all pg2 tests non-BFB. [BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L tests
…id-new-topo-file-format Conflicts: components/cam/src/dynamics/se/inidat.F90 Merge master to get PR E3SM-Project#3204; resolve the conflict in inidat concerning reading PHIS(_d) from the topo file.
This fixes an issues where aqua planet cases were crashing on restart due to not correctly setting the topography to zero. Instead the realistic topography was added on restart, which caused the run to crash. This PR resolves this issue by explicitly resetting the topography to zero when restarting a physgrid run.
This also fixes an issue running certain physgrid configurations, such as ne4pg2, because the default NTASK was set higher than the number of physics columns. To get the physgrid specification to match correctly in config_pesall.xml an underscore was to be added to other grid query strings. This should not affect non-physgrid cases. The underscore is used because it always separates the atmosphere grid from the other grids. For example, A typical ne4 grid string is
a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
But the string for ne4pg2 is
a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get different default PE layouts. Using "ne4np4" without the underscore leads to an error when running physgrid cases.
After some discussion we decided to also include a change to the default physgrid mapping algorithm to use the high-order option. This will make all pg2 tests non-BFB.
[BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L