Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

Closed
worleyph opened this issue Apr 24, 2016 · 67 comments
Closed

A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

worleyph opened this issue Apr 24, 2016 · 67 comments

Comments

@worleyph
Copy link
Contributor

I've been trying to find feasible PE layouts for

 -compset A_WCYCL2000 -res ne120_oRRS15

on Cori. I started with a small (1024x1, stacked, noHT) layout, which failed. I then tried 2048x1, stacked, noHT), and most recently 3600x1 for atmosphere, coupler, and land (3616x1) with the other components on their own compute nodes using a 2048x1 decomposition. Again this is all noHT:

 <entry id="MAX_TASKS_PER_NODE"   value="32"  />

I am getting the identical error for all three of these. From cesm.log:

 0000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 0000: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
 0000: 000.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
 0000: Rank 0 [Sat Apr 23 19:39:16 2016] [c1-0c1s13n1] application called MPI_Abort(MPI_COMM_WORLD, 2) - process 0
 0000: forrtl: error (76): Abort trap signal
 ...
 0000: cesm.exe           0000000002F2670F  m_dropdead_mp_die          87  m_dropdead.F90
 0000: cesm.exe           0000000002F258DF  m_die_mp_die2__           165  m_die.F90
 0000: cesm.exe           0000000002EA8106  m_globalsegmap_mp        2433  m_GlobalSegMap.F90
 0000: cesm.exe           0000000002EF9A08  m_router_mp_initp         364  m_Router.F90
 0000: cesm.exe           0000000002EECE6C  m_rearranger_mp_i         153  m_Rearranger.F90
 0000: cesm.exe           0000000002EE3FDF  m_sparsematrixplu         522  m_SparseMatrixPlus.F90
 0000: cesm.exe           0000000002C52CF3  shr_mct_mod_mp_sh         355  shr_mct_mod.F90
 0000: cesm.exe           00000000004B94DD  seq_map_mod_mp_se         191  seq_map_mod.F90
 0000: cesm.exe           000000000044BC91  prep_ocn_mod_mp_p         259  prep_ocn_mod.F90
 0000: cesm.exe           0000000000411C7A  cesm_comp_mod_mp_        1582  cesm_comp_mod.F90

cpl.log ends with

 (seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, glc, wav
 (component_init_cc:mct) : Initialize component atm
 (component_init_cc:mct) : Initialize component lnd
 (component_init_cc:mct) : Initialize component rof
 (component_init_cc:mct) : Initialize component ocn
 (component_init_cc:mct) : Initialize component ice
 (component_init_cc:mct) : Initialize component glc
 (component_init_cc:mct) : Initialize component wav

 ...

 (prep_ocn_init) : Initializing mapper_Sa2o
 (seq_map_init_rcfile)  called for mapper_Sa2o initialization

 (shr_mct_sMatPInitnc) Initializing SparseMatrixPlus
 (shr_mct_sMatPInitnc) SmatP mapname /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatPInitnc) SmatP maptype X
 (shr_mct_sMatReaddnc) reading mapping matrix data decomposed...
 (shr_mct_sMatReaddnc) * file name                  : /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatReaddnc) * matrix dims src x dst      :     777602 x   5778136
 (shr_mct_sMatReaddnc) * number of non-zero elements:   92387502
 (shr_mct_sMatReaddnc) ... done reading file

I'll try another increase in compute nodes for the ocean, but if anyone has any other suggestions, I'd appreciate it. Note that I (personally) do not have this compset working anyplace yet. On Titan there is a failure in ice or ocean initialization, so earlier than this in the execution.

@mt5555
Copy link
Contributor

mt5555 commented Apr 25, 2016

Using pure MPI does require more memory, and trying to fit 1/4 degree on this many nodes, memory is one of the concerns? Hence what about a x4 or x8 configuration? I think @amametjanov may have have experience with this on Mira (trying to find working configurations which use low thread counts but still fit into memory).

@worleyph
Copy link
Contributor Author

worleyph commented Apr 25, 2016

Thanks. I increased to 5400x1 (noHT), with exactly the same error. I think that @ndkeen indicated that he had run an F case with this decomposition on Cori. I'll try even larger when I get the chance, but will also continue debugging at 5400x1 (since it seems to run at least once per day).
@rljacob , latest information is from a debug statement added to initd_ in m_GlobalSegMap.F90. Here

0000: NGSEG = 0
0000: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0

so ngseg is zero (and not negative) after coming out of the loop:

      ngseg = 0
      do i=0,npes-1
         ngseg = ngseg + counts(i)
         if(i == 0) then
            displs(i) = 0
         else
            displs(i) = displs(i-1) + counts(i-1)
         endif
      end do

I assume that this means that counts(:) == 0, but I'll verify as well.

@jonbob
Copy link
Contributor

jonbob commented Apr 25, 2016

@worleyph The maps for ne120np4_oRRS15to5 are largely untested, just as a warning. And I ended up with a map for a different resolution that was bad but the mapping tools gave no warning or error creating it. So just a heads up...

@worleyph
Copy link
Contributor Author

@jonbob, how would I determine whether this is the source of my problems? @rljacob , has anyone run this compset and grid resolution successfully?

@jonbob
Copy link
Contributor

jonbob commented Apr 25, 2016

@worleyph : we might have to build up to the full A_WCYCL compset. If you don't figure out the problem, maybe first make sure atm/lnd compsets work, and ocn/ice as well. And then we could try all the active components together...

@amametjanov
Copy link
Member

Yes, I'd increase thread counts and also increase pio stride; it looks like a re-arranger problem.

@rljacob
Copy link
Member

rljacob commented Apr 25, 2016

No one has run this compset/resolution yet. @worleyph try just the F-case first on Titan. See https://acme-climate.atlassian.net/browse/CSG-163

@worleyph
Copy link
Contributor Author

worleyph commented Apr 25, 2016

@rljacob - already ran an F case on Titan (successfully), a couple of weeks ago.

@amametjanov
Copy link
Member

I'm also getting the same error on Mira:

5218: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
5218: ***.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
5218: Abort(2) on node 5218 (rank 5218 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5218

Tried 3 different PE layouts and PIO settings and all show the ngseg error.

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

@rljacob
Copy link
Member

rljacob commented Apr 27, 2016

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has anything to do with PIO.

@worleyph
Copy link
Contributor Author

@amametjanov , since you are seeing this also, I assume that we can eliminate memory problems (if only because memory problems tend not to have the same signature of Mira and Cori).

@rljacob ,

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has a a anything to do with PIO.

Does this imply a bad map then? Should I keep trying to debug this, or can this be addressed some other way? I've tracked it into the call to

 lsize_ = AttrVect_lsize(sMat%data)

where AttrVect_Isize has

   List_allocated(aV%iList) = .true.
   associated(aV%iAttr)) = .true.
   size(aV%iAttr,2) = 0

   List_allocated(aV%rList) = .true.
   associated(aV%rAttr)) = .true.
   size(aV%rAttr,2) = 0

I'm trying to work backwards from the sMat for this call, and is taking some time (waiting in the Cori queue).

@rljacob
Copy link
Member

rljacob commented Apr 27, 2016

Yes it likely implies a bad map.

@worleyph
Copy link
Contributor Author

And how do we figure this out? Is there a way to do this outside of running the model? It sounds like I am wasting my time continuing with my current approach.

@worleyph worleyph changed the title A_WCYCL2000 ne120_oRRS15 on Cori: m_GlobalSegMap error A_WCYCL2000 ne120_oRRS15: m_GlobalSegMap error (Cori, Mira, and Titan) Apr 27, 2016
@rljacob
Copy link
Member

rljacob commented Apr 27, 2016

Actually from the cpl.log you pasted it read the basic parameters of the map correctly:
(shr_mct_sMatReaddnc) * matrix dims src x dst : 777602 x 5778136
(shr_mct_sMatReaddnc) * number of non-zero elements: 92387502`
But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

@worleyph
Copy link
Contributor Author

My latest debug writes made the ocn/ice init error on Titan disappear. It then died in the same location as I saw on Cori and @amametjanov saw on Mira. The Titan PE layout was 2700x4. So, this is persistent across architectures.

@worleyph
Copy link
Contributor Author

But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

What is a node here? lsize is zero for all processes for this map.

@jonbob
Copy link
Contributor

jonbob commented Apr 27, 2016

@worleyph let me take another look at these maps -- we had another one that I made around the same time turn out to be bad -- despite getting no errors or warnings from the tools that generated them

@worleyph
Copy link
Contributor Author

@jonbob, thanks. I'll keep poking, as a background activity.

@douglasjacobsen
Copy link
Member

@amametjanov:

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

You can do:

-compset GMPAS -res T62_oQU120

To test ocn/ice only.

@jonbob
Copy link
Contributor

jonbob commented Apr 27, 2016

@worleyph - I think at the very least we have a bad domain file for the ocean. I'll try to regenerate it and see if I can get something rational. In the meantime, I don't think there's any point to continued testing.

@jonbob
Copy link
Contributor

jonbob commented Apr 27, 2016

@amametjanov - I know you're also trying to work on this resolution. I have not yet made any maps for the data models to oRRS15to5 -- so nothing like T62_oRRS15to5. I can do that if it would be helpful, but let me try to figure out this domain file issue first.

@rljacob rljacob assigned jonbob and unassigned rljacob May 16, 2016
@amametjanov
Copy link
Member

Not yet, got another error at the same location with a different PE configuration on 2K nodes, trying on 4K nodes.

The stack-trace is similar to yours:

remap_q_ppm
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:527

remap1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:107

vertical_remap
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_advection_mod_base.F90:2142

prim_run_subcycle
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_driver_mod.F90:1507

__dyn_comp_NMOD_dyn_run$$OL$$1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:406

dyn_run
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:392

@jonbob
Copy link
Contributor

jonbob commented May 27, 2016

@amametjanov : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I was using "next" from the repo, to pick up a fix to rtm...

@worleyph
Copy link
Contributor Author

@jonbob, would you advise waiting until you get the scripts updated, or can you tell me how to repeat the experiment with the current master or next? Thanks.

@jonbob
Copy link
Contributor

jonbob commented May 27, 2016

@worleyph : I can point you to my modifications on edison, or just list the namelist changes and pe-layout, whichever is easier. And depending on whether or not you intend to work over this holiday weekend.

@singhbalwinder
Copy link
Contributor

I tried to run this case on Cori and found a bug which is fixed in #903. I tried running it again on EOS with the bug fix and ran out of time. I have resubmitted it again on Cori and EOS to see if it runs there (debug flags with Intel compiler). @jonbob : Are you using the code post #903 fix?

@jonbob
Copy link
Contributor

jonbob commented May 27, 2016

@singhbalwinder : yes, I was running with next from yesterday

@worleyph
Copy link
Contributor Author

@jonbob , I'll wait until next week. I'll bother you again then. Thanks.

@jonbob
Copy link
Contributor

jonbob commented May 27, 2016

@worleyph sounds good -- I hope that means you're getting a real holiday weekend. I'm going to keep pushing a little, at least get it to run a 5-day smoke test successfully on a couple of different platforms.

@worleyph
Copy link
Contributor Author

@jonbob: "I hope that means you're getting a real holiday weekend."

H'mm - has my spouse been talking to you? :-). Thanks for continuning to push this.

@singhbalwinder
Copy link
Contributor

Tagging: @amametjanov, @jonbob, @worleyph
Update: I ran a test on Cori with full debugging options, the run crashed with the following error:

1599: forrtl: error (72): floating overflow
1599: Image              PC                Routine            Line        Source
1599: cesm.exe           000000000ADF0AB5  Unknown               Unknown  Unknown
1599: cesm.exe           000000000ADEE877  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD9EF34  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD9ED46  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD21F86  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD2D2A0  Unknown               Unknown  Unknown
1599: cesm.exe           000000000A6EDE50  Unknown               Unknown  Unknown
1599: cesm.exe           0000000003A62F5A  prim_advance_mod_        2815  prim_advance_mod.F90
1599: cesm.exe           000000000398BF0B  prim_advance_mod_         440  prim_advance_mod.F90
1599: cesm.exe           0000000002F18AB3  prim_driver_mod_m        1730  prim_driver_mod.F90
1599: cesm.exe           0000000002F0E727  prim_driver_mod_m        1485  prim_driver_mod.F90

@worleyph
Copy link
Contributor Author

So, there seem to be two different issues here - the original coupled model problem due to bad mapping and domain files, and a different issue in the atmosphere for ne120. A long time ago (early April) I got this F case to work with ne120 on Titan, but did not have DEBUG enabled. The above problem appears to be repeatable on Cori and Mira at the moment, so may be something new. It is definitely something different. Should there be a separate github issue for this? Or is there already one? (The above seems familar, from more than @amametjanov 's earlier comment.)

@jonbob
Copy link
Contributor

jonbob commented May 31, 2016

@worleyph I know the default timesteps are wrong for virtually all components, as well at the coupling time intervals. I'm trying to modify the scripts to produce the correct settings, but in the meantime, this is what I have been using successfully:
coupler: ATM_NCPL = 288 (and for lnd, ice, ocn, wav)
ROF_NCPL = 8
CAM: dtime = 300
MPAS-CICE: config_dt = 300.0
MPAS-O: config_dt = '00:02:30'
config_write_output_on_startup = .false.
My first tests on edison were just on a 4800x1 sequential layout with all components sharing the same pe's. It's abysmally slow, like 2.5 hours per simulated day. But it does run. On titan, it required more like 8192x1 to run, so I'm testing some different layouts there.

@amametjanov
Copy link
Member

Update: @singhbalwinder, I tried the most recent PR #903 on Mira and the run failed in the same way as before (#864 (comment)).
Trying out @jonbob's timestep mods to see if the run succeeds on Mira too.

@mt5555
Copy link
Contributor

mt5555 commented Jun 1, 2016

in a seperate email, @amametjanov had this configuration getting past initialization and crashing in the atmosphere, with what looked like a stability/spinup issue. So @jonbob 's results are consistent with this - in that he was able to run longer by reducing the timestep in the atmosphere by a factor of 3.

In the past, whenever starting a high-res coupled simulation, using an atmosphere initial condition from an AMIP simulation, we have always had to do some work to spinup a new initial condition file - usually by ~5 days with a small timestep is sufficient. If you set "inithist='DAILY'", the atmosphere will write an initial condition files every model day.

@amametjanov
Copy link
Member

6-hour 2K-node prod-short job timed out while still initializing. Trying the max of 12 hours in prod-long.
PIO stride was 128 and this had 56 PIO tasks for 7200 ATM tasks. Trying with 64; strides of 32 and 256 failed previously.
@jonbob, are you using bilin maps or default aave maps?

@worleyph
Copy link
Contributor Author

worleyph commented Jun 1, 2016

Try changing (in env_run.xml) to

and see if this helps. If this blows out MPI memory, try something at least larger than 64, say 1024.

@worleyph
Copy link
Contributor Author

worleyph commented Jun 1, 2016

@jonbon, I can't translate your suggestions to settings in env_run.xml for the GMPAS case. What I see is

  <entry id="NCPL_BASE_PERIOD"   value="hour"  />
  <entry id="ATM_NCPL"   value="1"  />
  ...
  <entry id="OCN_NCPL"   value="$ATM_NCPL"  />
  ....
  <entry id="CPL_SEQ_OPTION"   value="RASM_OPTION1"  />

What should I change these to, and do I also need to change user_nl_mpas-o and user_nl_mpas-cice (or any of the other user_nl files)?

@jonbob
Copy link
Contributor

jonbob commented Jun 1, 2016

@worleyph Sorry about that. For the fully-coupled case, the NCPL_BASE_PERIOD is set to "day" instead of "hour". Do the settings make any more sense in that context? I have meetings for the next three hours, so I apologize if it takes time to get back to you....

@worleyph
Copy link
Contributor Author

worleyph commented Jun 1, 2016

@jonbob, maybe. If the ocean timestep is already correct, then the only other active component is sea ice - atmosphere and land should be irrelevant here, correct? So the problem to be addressed for the GMPAS run is a bad cice model timestep? How does the coupling frequency come into this then? I'll try setting the following in the appropriate user_nl_XXX files

MPAS-CICE: config_dt = 300.0
MPAS-O: config_dt = '00:02:30'

and setting the coupling frquency to 288 and see what happens.

(Where would
config_write_output_on_startup = .false.
go?)

@amametjanov
Copy link
Member

Pat, the MPAS namelist changes take effect only if you modify the XML files at:

  • components/mpas-o/bld/namelist_files/namelist_defaults_mpas-o.xml
  • components/mpas-cice/bld/namelist_files/namelist_defaults_mpas-cice.xml

@worleyph
Copy link
Contributor Author

worleyph commented Jun 1, 2016

Thanks. @douglasjacobsen , any chance that this will ever change? We should at least add a guard so that changing things in user_nl_mpas-o, user_nl_mpas-cice, or user_nl_mpas-li generates an error message.

@jonbob
Copy link
Contributor

jonbob commented Jun 1, 2016

@worleyph: It is fine to make changes in user_nl_XXX files for the mpas components. That's how I've gotten my changes in, not by changing the defaults. At some point, we'll want to do that but it's not necessary right now. For what it's worth,
config_write_output_on_startup = .false.
would go in user_nl_mpas-o and just keeps it from writing out a large history file with the initial conditions

@worleyph
Copy link
Contributor Author

worleyph commented Jun 1, 2016

@jonbob. thanks for the clarification. I'll go back to this in my future experiments - changed the defaults in my current experiment (in the queue).

@jonbob
Copy link
Contributor

jonbob commented Jun 3, 2016

@worleyph : my test with a 7.5-minute coupling interval worked fine on both titan and edison -- and got better performance. Though maybe the correct adjective is "less horrendous". In any case, I think that's a better default to get into the scripts, and we'll just have to see if it can spin-up and run for any significant time.

@amametjanov
Copy link
Member

Progress: had a run on Mira that went out to timestep 0001-01-01_12:55:00 (or step 154 for ATM) in 6 hours on 2K nodes, before timing out. This was with the old 2.5 minute coupling interval. I'll try the PR with new intervals, when it's available.

@amametjanov
Copy link
Member

Fixed by #924.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants