Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix physgrid issues with PE layout and aqua restarts #3204

Merged
merged 18 commits into from
Nov 11, 2019

Conversation

whannah1
Copy link
Contributor

@whannah1 whannah1 commented Sep 25, 2019

This fixes an issues where aqua planet cases were crashing on restart due to not correctly setting the topography to zero. Instead the realistic topography was added on restart, which caused the run to crash. This PR resolves this issue by explicitly resetting the topography to zero when restarting a physgrid run.

This also fixes an issue running certain physgrid configurations, such as ne4pg2, because the default NTASK was set higher than the number of physics columns. To get the physgrid specification to match correctly in config_pesall.xml an underscore was to be added to other grid query strings. This should not affect non-physgrid cases. The underscore is used because it always separates the atmosphere grid from the other grids. For example, A typical ne4 grid string is
a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
But the string for ne4pg2 is
a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get different default PE layouts. Using "ne4np4" without the underscore leads to an error when running physgrid cases.

After some discussion we decided to also include a change to the default physgrid mapping algorithm to use the high-order option. This will make all pg2 tests non-BFB.

[BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L

	modified:   components/cam/src/dynamics/se/inidat.F90
	modified:   components/cam/src/dynamics/se/dyn_grid.F90
	modified:   components/cam/src/dynamics/se/fv_physics_coupling_mod.F90
	modified:   components/cam/src/dynamics/se/inidat.F90
	modified:   components/cam/src/dynamics/se/inidat.F90
	modified:   components/cam/src/dynamics/se/fv_physics_coupling_mod.F90
	modified:   cime/config/e3sm/allactive/config_pesall.xml
	modified:   components/cam/src/dynamics/se/inidat.F90
	modified:   cime/config/e3sm/allactive/config_pesall.xml
@@ -475,6 +471,13 @@ subroutine read_inidat( ncid_ini, ncid_topo, dyn_in)
end if

if (fv_nphys == 0) then
! Map phis data to FV physics grid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears not to be right. The conditional at line 473 is fv_nphys == 0, but lines 474 to 479 then perform physgrid operations.

Copy link
Contributor Author

@whannah1 whannah1 Sep 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, shoot, good catch. That's a pretty dumb mistake.
Now I'm curious how my restart tests even passed...

else
fieldname = 'PHIS'
tmp(:,1,:) = 0.0_r8
if (fv_nphys > 0) then
! Copy phis field to GLL grid
! Load phis field to GLL grid first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this comment is incorrect when separated from the fv_phys_to_dyn_topo function calls. Here, infld is loading physgrid topo data, not GLL data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'll correct that.

	modified:   components/cam/src/dynamics/se/inidat.F90
Copy link
Member

@ambrad ambrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't comment on the config_pesall.xml changes; I can guess why the underscore is needed but I don't have sufficiently detailed knowledge of the pes mechanics to be sure of the changes. The inidat changes look fine now. The other F90 changes are specific to the LO remap code, and I know you've tested these a fair bit already.

@rljacob rljacob assigned oksanaguba and unassigned rljacob Sep 25, 2019
Copy link
Contributor

@AaronDonahue AaronDonahue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Do you check and make sure there were no other locations where the phys_tmp variable should be changed to phis_tmp?

@whannah1
Copy link
Contributor Author

@AaronDonahue , I couldn't find any other instances of phys_tmp

	modified:   components/cam/src/dynamics/se/inidat.F90
	modified:   components/cam/src/dynamics/se/inidat.F90
@oksanaguba
Copy link
Contributor

@whannah1 I see some recent commits, are you still working on this?

@whannah1
Copy link
Contributor Author

whannah1 commented Oct 1, 2019

@oksanaguba, I noticed a couple minor things that don't seem to have any effect (when testing on Cori at least). I think this branch is still good to go, but I was gonna ask @AaronDonahue to give it a quick look over again. I just need another day or two to verify these fixes also work for the ECP fork, because there's a (perhaps different?) restart issue there as well that I'm struggling to understand.

	modified:   components/cam/src/dynamics/se/inidat.F90
@whannah1
Copy link
Contributor Author

whannah1 commented Oct 3, 2019

Ok, so the problem that I thought I had appears to be a non-issue. @AaronDonahue, can you take a quick look at the recent changes? Afterwards this PR should be ready to go.

@rljacob
Copy link
Member

rljacob commented Oct 10, 2019

@AaronDonahue this needs another review from you.

	modified:   components/cam/bld/namelist_files/namelist_defaults_cam.xml
@mt5555
Copy link
Contributor

mt5555 commented Oct 31, 2019

2nding @rljacob request for another review from @AaronDonahue !

@AaronDonahue
Copy link
Contributor

My apologies, I've been in and out of travel and missed this pinging of me. I'll get on this right away.

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2019

@oksanaguba , what's the run length of this test?
ERS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2

@oksanaguba
Copy link
Contributor

memory output, master and then PR:

(base) [onguba@blueslogin4 run]$ gunzip cpl.log.88244.191105-170925.gz
(base) [onguba@blueslogin4 run]$ grep highwater cpl.log.88244.191105-170925
 memory_write: model date =   00010102       0 memory =  156081.60 MB (highwater)      58845.44 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010103       0 memory =  156081.92 MB (highwater)      59142.72 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010104       0 memory =  156081.92 MB (highwater)      59145.92 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010105       0 memory =  156081.92 MB (highwater)      59150.08 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010106       0 memory =  156081.92 MB (highwater)      59324.48 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
(seq_mct_drv): ===============  pes min memory highwater  (MB)    1120.967  ===============
(seq_mct_drv): ===============  pes max memory highwater  (MB)  465381.000  ===============
(base) [onguba@blueslogin4 run]$
(base) [onguba@blueslogin4 run]$ cd ../../SMS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2.C.20191105_172101_ewo6rl/run/
(base) [onguba@blueslogin4 run]$ gunzip cpl.log.88278.191105-174828.gz
(base) [onguba@blueslogin4 run]$ grep highwater cpl.log.88278.191105-174828
 memory_write: model date =   00010102       0 memory =  185810.67 MB (highwater)      71104.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010103       0 memory =  185811.05 MB (highwater)      71373.33 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010104       0 memory =  185811.05 MB (highwater)      71435.81 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010105       0 memory =  185811.05 MB (highwater)      71438.48 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010106       0 memory =  185811.05 MB (highwater)      71691.05 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
(seq_mct_drv): ===============  pes min memory highwater  (MB)    1109.342  ===============
(seq_mct_drv): ===============  pes max memory highwater  (MB)  248205.867  ===============

@ambrad
Copy link
Member

ambrad commented Nov 6, 2019

Are the number of processes the same in master and PR runs? Maybe the PE XML changes are somehow (erroneously) changing that, giving fewer processes with consequently more memory per process.

@oksanaguba
Copy link
Contributor

Some sanity checks: both clones are from yesterday's master.

Ranks: Both have global communicator with 144 ranks, both homme outputs have 96 ranks, 1 element per rank.

@oksanaguba
Copy link
Contributor

@whannah1 tests ERS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2 ran for < 2 minutes.

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2019

@oksana I meant the run length, since it's not specified in the test name. What's the default run length for an ERS test?

@oksanaguba
Copy link
Contributor

11 days for that ERS test.

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2019

This might be a dumb question, but could a change in the submodules explain the change in memory use? Maybe one of the clones didn't sync those properly?

@oksanaguba
Copy link
Contributor

Both runs were done on anvil in the same terminal window. But I will resubmit PR run now. I also looked at some timers and all looks ok.

@oksanaguba
Copy link
Contributor

I misread your question. submodules. I will resync clones.

@oksanaguba
Copy link
Contributor

Baselines clone had older commits in submodules, tests are running now.

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2019

ok, let's hope that's the explanation.

@oksanaguba
Copy link
Contributor

oksanaguba commented Nov 6, 2019

Here are two runs from master, (first) yesterday with older commits for mpp, mpas, and scorpio, and today's the same clone with synced modules (second):

(base) [onguba@blueslogin2 run]$ cd ../../SMS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2.G.20191105_164323_gq0oc1/run/
(base) [onguba@blueslogin2 run]$ grep highwater cpl.log.*
 memory_write: model date =   00010102       0 memory =  156081.60 MB (highwater)      58845.44 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010103       0 memory =  156081.92 MB (highwater)      59142.72 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010104       0 memory =  156081.92 MB (highwater)      59145.92 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010105       0 memory =  156081.92 MB (highwater)      59150.08 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010106       0 memory =  156081.92 MB (highwater)      59324.48 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
(seq_mct_drv): ===============  pes min memory highwater  (MB)    1120.967  ===============
(seq_mct_drv): ===============  pes max memory highwater  (MB)  465381.000  ===============
(base) [onguba@blueslogin2 run]$ cd ../../SMS.ne4_ne4.FC5AV1C-L.anvil-centos7_intel.cam-thetanh_ftype2.G.20191106_101124_ih5vi9/run/
(base) [onguba@blueslogin2 run]$ grep highwater cpl.log.*
 memory_write: model date =   00010102       0 memory =  975524.00 MB (highwater)     370904.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010103       0 memory =  975526.00 MB (highwater)     372420.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010104       0 memory =  975526.00 MB (highwater)     372646.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010105       0 memory =  975526.00 MB (highwater)     372672.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
 memory_write: model date =   00010106       0 memory =  975526.00 MB (highwater)     373930.00 MB (usage)  (pe=    0 comps= cpl ATM ICE GLC WAV IAC ESP)
(seq_mct_drv): ===============  pes min memory highwater  (MB)    1111.784  ===============
(seq_mct_drv): ===============  pes max memory highwater  (MB) 1249578.667  ===============

@ambrad
Copy link
Member

ambrad commented Nov 6, 2019

If I'm reading these data correctly, (1) the second has >2x larger "max memory highwater" and >5x highwater on rank 0 but (2) the only difference is submodule SHA1s. Do others agree? If I have those two points right, then there seems to be either noise or a real issue due to a submodule change but likely nothing related to this PR.

@oksanaguba
Copy link
Contributor

yes, has nothing to do with this pr. I will make an issue.

@oksanaguba
Copy link
Contributor

Reran tests, e3sm_atm_integration, against master with updated clone. all baselines pass, memory issues are not from this pr.

@whannah1
Copy link
Contributor Author

whannah1 commented Nov 6, 2019

that's good, thanks for all the extra effort!

oksanaguba added a commit that referenced this pull request Nov 7, 2019
… (PR #3204)

This fixes an issues where aqua planet cases were crashing on restart due to
not correctly setting the topography to zero. Instead the realistic topography
was added on restart, which caused the run to crash. This PR resolves this issue
by explicitly resetting the topography to zero when restarting a physgrid run.
This also fixes an issue running certain physgrid configurations, such as ne4pg2,
because the default NTASK was set higher than the number of physics columns.
To get the physgrid specification to match correctly in config_pesall.xml an underscore
was to be added to other grid query strings. This should not affect non-physgrid cases.
The underscore is used because it always separates the atmosphere grid from the other grids.
For example, A typical ne4 grid string is
a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240

But the string for ne4pg2 is
a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240

so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get
different default PE layouts. Using "ne4np4" without the underscore leads to an error
when running physgrid cases.
After some discussion we decided to also include a change to the default physgrid mapping algorithm
to use the high-order option. This will make all pg2 tests non-BFB.

[BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L
@oksanaguba
Copy link
Contributor

merged to next

@oksanaguba
Copy link
Contributor

melvin does not allow 96 cores for pg2 runs, limit is 60. a fix was pushed, I need to wait for another day.

oksanaguba added a commit that referenced this pull request Nov 8, 2019
… (PR #3204)

This fixes an issues where aqua planet cases were crashing on restart due to
not correctly setting the topography to zero. Instead the realistic topography
was added on restart, which caused the run to crash. This PR resolves this issue
by explicitly resetting the topography to zero when restarting a physgrid run.
This also fixes an issue running certain physgrid configurations, such as ne4pg2,
because the default NTASK was set higher than the number of physics columns.
To get the physgrid specification to match correctly in config_pesall.xml an underscore
was to be added to other grid query strings. This should not affect non-physgrid cases.
The underscore is used because it always separates the atmosphere grid from the other grids.
For example, A typical ne4 grid string is
a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
But the string for ne4pg2 is
a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get
different default PE layouts. Using "ne4np4" without the underscore leads to an error
when running physgrid cases.
After some discussion we decided to also include a change to the default physgrid mapping algorithm
to use the high-order option. This will make all pg2 tests non-BFB.

[BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L tests
@oksanaguba
Copy link
Contributor

remerged to next

oksanaguba added a commit that referenced this pull request Nov 11, 2019
…er (PR #3204)

This fixes an issues where aqua planet cases were crashing on restart due to
not correctly setting the topography to zero. Instead the realistic topography
was added on restart, which caused the run to crash. This PR resolves this issue
by explicitly resetting the topography to zero when restarting a physgrid run.
This also fixes an issue running certain physgrid configurations, such as ne4pg2,
because the default NTASK was set higher than the number of physics columns.
To get the physgrid specification to match correctly in config_pesall.xml an underscore
was to be added to other grid query strings. This should not affect non-physgrid cases.
The underscore is used because it always separates the atmosphere grid from the other grids.
For example, A typical ne4 grid string is
a%ne4np4_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
But the string for ne4pg2 is
a%ne4np4.pg2_l%ne4np4_oi%ne4np4_r%null_g%null_w%null_z%null_m%oQU240
so in order to differentiate the grids we use "ne4np4_" and "ne4np4.pg2" to get
different default PE layouts. Using "ne4np4" without the underscore leads to an error
when running physgrid cases.
After some discussion we decided to also include a change to the default physgrid mapping algorithm
to use the high-order option. This will make all pg2 tests non-BFB.

[BFB] except for SMS_Ln5.ne4pg2_ne4pg2.FC5AV1C-L tests
@oksanaguba oksanaguba merged commit 01ee915 into master Nov 11, 2019
@oksanaguba oksanaguba deleted the whannah/atm/fix-physgrid-aqua-restart branch November 11, 2019 17:54
ambrad added a commit to ambrad/E3SM that referenced this pull request Nov 11, 2019
…id-new-topo-file-format

Conflicts:
	components/cam/src/dynamics/se/inidat.F90

Merge master to get PR E3SM-Project#3204; resolve the conflict in inidat concerning reading
PHIS(_d) from the topo file.
@oksanaguba oksanaguba mentioned this pull request Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants