A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

worleyph · 2016-04-24T15:07:39Z

I've been trying to find feasible PE layouts for

 -compset A_WCYCL2000 -res ne120_oRRS15

on Cori. I started with a small (1024x1, stacked, noHT) layout, which failed. I then tried 2048x1, stacked, noHT), and most recently 3600x1 for atmosphere, coupler, and land (3616x1) with the other components on their own compute nodes using a 2048x1 decomposition. Again this is all noHT:

 <entry id="MAX_TASKS_PER_NODE"   value="32"  />

I am getting the identical error for all three of these. From cesm.log:

 0000: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
 0000: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
 0000: 000.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
 0000: Rank 0 [Sat Apr 23 19:39:16 2016] [c1-0c1s13n1] application called MPI_Abort(MPI_COMM_WORLD, 2) - process 0
 0000: forrtl: error (76): Abort trap signal
 ...
 0000: cesm.exe           0000000002F2670F  m_dropdead_mp_die          87  m_dropdead.F90
 0000: cesm.exe           0000000002F258DF  m_die_mp_die2__           165  m_die.F90
 0000: cesm.exe           0000000002EA8106  m_globalsegmap_mp        2433  m_GlobalSegMap.F90
 0000: cesm.exe           0000000002EF9A08  m_router_mp_initp         364  m_Router.F90
 0000: cesm.exe           0000000002EECE6C  m_rearranger_mp_i         153  m_Rearranger.F90
 0000: cesm.exe           0000000002EE3FDF  m_sparsematrixplu         522  m_SparseMatrixPlus.F90
 0000: cesm.exe           0000000002C52CF3  shr_mct_mod_mp_sh         355  shr_mct_mod.F90
 0000: cesm.exe           00000000004B94DD  seq_map_mod_mp_se         191  seq_map_mod.F90
 0000: cesm.exe           000000000044BC91  prep_ocn_mod_mp_p         259  prep_ocn_mod.F90
 0000: cesm.exe           0000000000411C7A  cesm_comp_mod_mp_        1582  cesm_comp_mod.F90

cpl.log ends with

 (seq_mct_drv) : Initialize each component: atm, lnd, rof, ocn, ice, glc, wav
 (component_init_cc:mct) : Initialize component atm
 (component_init_cc:mct) : Initialize component lnd
 (component_init_cc:mct) : Initialize component rof
 (component_init_cc:mct) : Initialize component ocn
 (component_init_cc:mct) : Initialize component ice
 (component_init_cc:mct) : Initialize component glc
 (component_init_cc:mct) : Initialize component wav

 ...

 (prep_ocn_init) : Initializing mapper_Sa2o
 (seq_map_init_rcfile)  called for mapper_Sa2o initialization

 (shr_mct_sMatPInitnc) Initializing SparseMatrixPlus
 (shr_mct_sMatPInitnc) SmatP mapname /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatPInitnc) SmatP maptype X
 (shr_mct_sMatReaddnc) reading mapping matrix data decomposed...
 (shr_mct_sMatReaddnc) * file name                  : /project/projectdirs/acme/inputdata/cpl/gridmaps/ne120np4/map_ne120np4_to_oRRS15to5_patch.160203.nc
 (shr_mct_sMatReaddnc) * matrix dims src x dst      :     777602 x   5778136
 (shr_mct_sMatReaddnc) * number of non-zero elements:   92387502
 (shr_mct_sMatReaddnc) ... done reading file

I'll try another increase in compute nodes for the ocean, but if anyone has any other suggestions, I'd appreciate it. Note that I (personally) do not have this compset working anyplace yet. On Titan there is a failure in ice or ocean initialization, so earlier than this in the execution.

The text was updated successfully, but these errors were encountered:

mt5555 · 2016-04-25T14:05:07Z

Using pure MPI does require more memory, and trying to fit 1/4 degree on this many nodes, memory is one of the concerns? Hence what about a x4 or x8 configuration? I think @amametjanov may have have experience with this on Mira (trying to find working configurations which use low thread counts but still fit into memory).

worleyph · 2016-04-25T15:13:18Z

Thanks. I increased to 5400x1 (noHT), with exactly the same error. I think that @ndkeen indicated that he had run an F case with this decomposition on Cori. I'll try even larger when I get the chance, but will also continue debugging at 5400x1 (since it seems to run at least once per day).
@rljacob , latest information is from a debug statement added to initd_ in m_GlobalSegMap.F90. Here

0000: NGSEG = 0
0000: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0

so ngseg is zero (and not negative) after coming out of the loop:

      ngseg = 0
      do i=0,npes-1
         ngseg = ngseg + counts(i)
         if(i == 0) then
            displs(i) = 0
         else
            displs(i) = displs(i-1) + counts(i-1)
         endif
      end do

I assume that this means that counts(:) == 0, but I'll verify as well.

jonbob · 2016-04-25T15:27:04Z

@worleyph The maps for ne120np4_oRRS15to5 are largely untested, just as a warning. And I ended up with a map for a different resolution that was bad but the mapping tools gave no warning or error creating it. So just a heads up...

worleyph · 2016-04-25T16:42:15Z

@jonbob, how would I determine whether this is the source of my problems? @rljacob , has anyone run this compset and grid resolution successfully?

jonbob · 2016-04-25T17:28:55Z

@worleyph : we might have to build up to the full A_WCYCL compset. If you don't figure out the problem, maybe first make sure atm/lnd compsets work, and ocn/ice as well. And then we could try all the active components together...

amametjanov · 2016-04-25T18:08:59Z

Yes, I'd increase thread counts and also increase pio stride; it looks like a re-arranger problem.

rljacob · 2016-04-25T18:50:36Z

No one has run this compset/resolution yet. @worleyph try just the F-case first on Titan. See https://acme-climate.atlassian.net/browse/CSG-163

worleyph · 2016-04-25T19:14:18Z

@rljacob - already ran an F case on Titan (successfully), a couple of weeks ago.

amametjanov · 2016-04-27T00:28:24Z

I'm also getting the same error on Mira:

5218: m_GlobalSegMap::initp_: non-positive value of ngseg error, stat =0
5218: ***.MCT(MPEU)::die.: from m_GlobalSegMap::initp_()
5218: Abort(2) on node 5218 (rank 5218 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 2) - process 5218

Tried 3 different PE layouts and PIO settings and all show the ngseg error.

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

rljacob · 2016-04-27T01:06:09Z

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has anything to do with PIO.

worleyph · 2016-04-27T02:06:52Z

@amametjanov , since you are seeing this also, I assume that we can eliminate memory problems (if only because memory problems tend not to have the same signature of Mira and Cori).

@rljacob ,

One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has a a anything to do with PIO.

Does this imply a bad map then? Should I keep trying to debug this, or can this be addressed some other way? I've tracked it into the call to

 lsize_ = AttrVect_lsize(sMat%data)

where AttrVect_Isize has

   List_allocated(aV%iList) = .true.
   associated(aV%iAttr)) = .true.
   size(aV%iAttr,2) = 0

   List_allocated(aV%rList) = .true.
   associated(aV%rAttr)) = .true.
   size(aV%rAttr,2) = 0

I'm trying to work backwards from the sMat for this call, and is taking some time (waiting in the Cori queue).

rljacob · 2016-04-27T02:22:18Z

Yes it likely implies a bad map.

worleyph · 2016-04-27T02:24:06Z

And how do we figure this out? Is there a way to do this outside of running the model? It sounds like I am wasting my time continuing with my current approach.

rljacob · 2016-04-27T02:42:55Z

Actually from the cpl.log you pasted it read the basic parameters of the map correctly:
(shr_mct_sMatReaddnc) * matrix dims src x dst : 777602 x 5778136
(shr_mct_sMatReaddnc) * number of non-zero elements: 92387502`
But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

worleyph · 2016-04-27T02:44:43Z

My latest debug writes made the ocn/ice init error on Titan disappear. It then died in the same location as I saw on Cori and @amametjanov saw on Mira. The Titan PE layout was 2700x4. So, this is persistent across architectures.

worleyph · 2016-04-27T02:46:45Z

But one of the nodes has no mapping data (lsize = 0). I'm not sure how that can happen.

What is a node here? lsize is zero for all processes for this map.

jonbob · 2016-04-27T14:35:53Z

@worleyph let me take another look at these maps -- we had another one that I made around the same time turn out to be bad -- despite getting no errors or warnings from the tools that generated them

worleyph · 2016-04-27T14:38:32Z

@jonbob, thanks. I'll keep poking, as a background activity.

douglasjacobsen · 2016-04-27T14:59:31Z

@amametjanov:

What is the ocn/ice only -compset and -res? Need to rule out inputdata problems.

You can do:

-compset GMPAS -res T62_oQU120

To test ocn/ice only.

jonbob · 2016-04-27T15:13:09Z

@worleyph - I think at the very least we have a bad domain file for the ocean. I'll try to regenerate it and see if I can get something rational. In the meantime, I don't think there's any point to continued testing.

jonbob · 2016-04-27T15:15:04Z

@amametjanov - I know you're also trying to work on this resolution. I have not yet made any maps for the data models to oRRS15to5 -- so nothing like T62_oRRS15to5. I can do that if it would be helpful, but let me try to figure out this domain file issue first.

amametjanov · 2016-05-16T17:25:41Z

Not yet, got another error at the same location with a different PE configuration on 2K nodes, trying on 4K nodes.

The stack-trace is similar to yours:

remap_q_ppm
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:527

remap1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/vertremap_mod_base.F90:107

vertical_remap
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_advection_mod_base.F90:2142

prim_run_subcycle
/gpfs/mira-home/azamatm/repos/ACME-integration/components/homme/src/share/prim_driver_mod.F90:1507

__dyn_comp_NMOD_dyn_run$$OL$$1
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:406

dyn_run
/gpfs/mira-home/azamatm/repos/ACME-integration/components/cam/src/dynamics/se/dyn_comp.F90:392

jonbob · 2016-05-27T14:34:26Z

@amametjanov : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I was using "next" from the repo, to pick up a fix to rtm...

worleyph · 2016-05-27T14:55:31Z

@jonbob, would you advise waiting until you get the scripts updated, or can you tell me how to repeat the experiment with the current master or next? Thanks.

jonbob · 2016-05-27T14:56:57Z

@worleyph : I can point you to my modifications on edison, or just list the namelist changes and pe-layout, whichever is easier. And depending on whether or not you intend to work over this holiday weekend.

singhbalwinder · 2016-05-27T17:06:46Z

I tried to run this case on Cori and found a bug which is fixed in #903. I tried running it again on EOS with the bug fix and ran out of time. I have resubmitted it again on Cori and EOS to see if it runs there (debug flags with Intel compiler). @jonbob : Are you using the code post #903 fix?

jonbob · 2016-05-27T17:31:03Z

@singhbalwinder : yes, I was running with next from yesterday

worleyph · 2016-05-27T19:43:02Z

@jonbob , I'll wait until next week. I'll bother you again then. Thanks.

jonbob · 2016-05-27T19:52:21Z

@worleyph sounds good -- I hope that means you're getting a real holiday weekend. I'm going to keep pushing a little, at least get it to run a 5-day smoke test successfully on a couple of different platforms.

worleyph · 2016-05-27T20:51:32Z

@jonbob: "I hope that means you're getting a real holiday weekend."

H'mm - has my spouse been talking to you? :-). Thanks for continuning to push this.

singhbalwinder · 2016-05-31T17:41:43Z

Tagging: @amametjanov, @jonbob, @worleyph
Update: I ran a test on Cori with full debugging options, the run crashed with the following error:

1599: forrtl: error (72): floating overflow
1599: Image              PC                Routine            Line        Source
1599: cesm.exe           000000000ADF0AB5  Unknown               Unknown  Unknown
1599: cesm.exe           000000000ADEE877  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD9EF34  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD9ED46  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD21F86  Unknown               Unknown  Unknown
1599: cesm.exe           000000000AD2D2A0  Unknown               Unknown  Unknown
1599: cesm.exe           000000000A6EDE50  Unknown               Unknown  Unknown
1599: cesm.exe           0000000003A62F5A  prim_advance_mod_        2815  prim_advance_mod.F90
1599: cesm.exe           000000000398BF0B  prim_advance_mod_         440  prim_advance_mod.F90
1599: cesm.exe           0000000002F18AB3  prim_driver_mod_m        1730  prim_driver_mod.F90
1599: cesm.exe           0000000002F0E727  prim_driver_mod_m        1485  prim_driver_mod.F90

worleyph · 2016-05-31T17:59:41Z

So, there seem to be two different issues here - the original coupled model problem due to bad mapping and domain files, and a different issue in the atmosphere for ne120. A long time ago (early April) I got this F case to work with ne120 on Titan, but did not have DEBUG enabled. The above problem appears to be repeatable on Cori and Mira at the moment, so may be something new. It is definitely something different. Should there be a separate github issue for this? Or is there already one? (The above seems familar, from more than @amametjanov 's earlier comment.)

jonbob · 2016-05-31T18:09:42Z

@worleyph I know the default timesteps are wrong for virtually all components, as well at the coupling time intervals. I'm trying to modify the scripts to produce the correct settings, but in the meantime, this is what I have been using successfully:
coupler: ATM_NCPL = 288 (and for lnd, ice, ocn, wav)
ROF_NCPL = 8
CAM: dtime = 300
MPAS-CICE: config_dt = 300.0
MPAS-O: config_dt = '00:02:30'
config_write_output_on_startup = .false.
My first tests on edison were just on a 4800x1 sequential layout with all components sharing the same pe's. It's abysmally slow, like 2.5 hours per simulated day. But it does run. On titan, it required more like 8192x1 to run, so I'm testing some different layouts there.

amametjanov · 2016-05-31T20:40:04Z

Update: @singhbalwinder, I tried the most recent PR #903 on Mira and the run failed in the same way as before (#864 (comment)).
Trying out @jonbob's timestep mods to see if the run succeeds on Mira too.

mt5555 · 2016-06-01T16:57:53Z

in a seperate email, @amametjanov had this configuration getting past initialization and crashing in the atmosphere, with what looked like a stability/spinup issue. So @jonbob 's results are consistent with this - in that he was able to run longer by reducing the timestep in the atmosphere by a factor of 3.

In the past, whenever starting a high-res coupled simulation, using an atmosphere initial condition from an AMIP simulation, we have always had to do some work to spinup a new initial condition file - usually by ~5 days with a small timestep is sufficient. If you set "inithist='DAILY'", the atmosphere will write an initial condition files every model day.

amametjanov · 2016-06-01T17:20:45Z

6-hour 2K-node prod-short job timed out while still initializing. Trying the max of 12 hours in prod-long.
PIO stride was 128 and this had 56 PIO tasks for 7200 ATM tasks. Trying with 64; strides of 32 and 256 failed previously.
@jonbob, are you using bilin maps or default aave maps?

worleyph · 2016-06-01T17:25:07Z

Try changing (in env_run.xml) to

and see if this helps. If this blows out MPI memory, try something at least larger than 64, say 1024.

worleyph · 2016-06-01T17:48:53Z

@jonbon, I can't translate your suggestions to settings in env_run.xml for the GMPAS case. What I see is

  <entry id="NCPL_BASE_PERIOD"   value="hour"  />
  <entry id="ATM_NCPL"   value="1"  />
  ...
  <entry id="OCN_NCPL"   value="$ATM_NCPL"  />
  ....
  <entry id="CPL_SEQ_OPTION"   value="RASM_OPTION1"  />

What should I change these to, and do I also need to change user_nl_mpas-o and user_nl_mpas-cice (or any of the other user_nl files)?

jonbob · 2016-06-01T17:53:24Z

@worleyph Sorry about that. For the fully-coupled case, the NCPL_BASE_PERIOD is set to "day" instead of "hour". Do the settings make any more sense in that context? I have meetings for the next three hours, so I apologize if it takes time to get back to you....

worleyph · 2016-06-01T18:01:21Z

@jonbob, maybe. If the ocean timestep is already correct, then the only other active component is sea ice - atmosphere and land should be irrelevant here, correct? So the problem to be addressed for the GMPAS run is a bad cice model timestep? How does the coupling frequency come into this then? I'll try setting the following in the appropriate user_nl_XXX files

MPAS-CICE: config_dt = 300.0
MPAS-O: config_dt = '00:02:30'

and setting the coupling frquency to 288 and see what happens.

(Where would
config_write_output_on_startup = .false.
go?)

amametjanov · 2016-06-01T18:01:47Z

Pat, the MPAS namelist changes take effect only if you modify the XML files at:

components/mpas-o/bld/namelist_files/namelist_defaults_mpas-o.xml
components/mpas-cice/bld/namelist_files/namelist_defaults_mpas-cice.xml

worleyph · 2016-06-01T18:04:13Z

Thanks. @douglasjacobsen , any chance that this will ever change? We should at least add a guard so that changing things in user_nl_mpas-o, user_nl_mpas-cice, or user_nl_mpas-li generates an error message.

jonbob · 2016-06-01T19:41:25Z

@worleyph: It is fine to make changes in user_nl_XXX files for the mpas components. That's how I've gotten my changes in, not by changing the defaults. At some point, we'll want to do that but it's not necessary right now. For what it's worth,
config_write_output_on_startup = .false.
would go in user_nl_mpas-o and just keeps it from writing out a large history file with the initial conditions

worleyph · 2016-06-01T19:43:42Z

@jonbob. thanks for the clarification. I'll go back to this in my future experiments - changed the defaults in my current experiment (in the queue).

jonbob · 2016-06-03T21:28:06Z

@worleyph : my test with a 7.5-minute coupling interval worked fine on both titan and edison -- and got better performance. Though maybe the correct adjective is "less horrendous". In any case, I think that's a better default to get into the scripts, and we'll just have to see if it can spin-up and run for any significant time.

amametjanov · 2016-06-16T00:32:21Z

Progress: had a run on Mira that went out to timestep 0001-01-01_12:55:00 (or step 154 for ATM) in 6 hours on 2K nodes, before timing out. This was with the old 2.5 minute coupling interval. I'll try the PR with new intervals, when it's available.

amametjanov · 2016-08-30T19:27:41Z

Fixed by #924.

worleyph assigned rljacob Apr 24, 2016

worleyph added bug help wanted question mpas-ocean Coupler compset Coupled Model labels Apr 24, 2016

worleyph changed the title ~~A_WCYCL2000 ne120_oRRS15 on Cori: m_GlobalSegMap error~~ A_WCYCL2000 ne120_oRRS15: m_GlobalSegMap error (Cori, Mira, and Titan) Apr 27, 2016

rljacob assigned jonbob and unassigned rljacob May 16, 2016

worleyph mentioned this issue May 27, 2016

PIO? Ocean? Titan? failures for A_WCYCL2000 / ne30_oEC #760

Closed

amametjanov closed this as completed Aug 30, 2016

A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864

Comments

worleyph commented Apr 24, 2016

mt5555 commented Apr 25, 2016

worleyph commented Apr 25, 2016 • edited Loading

jonbob commented Apr 25, 2016

worleyph commented Apr 25, 2016

jonbob commented Apr 25, 2016

amametjanov commented Apr 25, 2016

rljacob commented Apr 25, 2016

worleyph commented Apr 25, 2016 • edited Loading

amametjanov commented Apr 27, 2016

rljacob commented Apr 27, 2016

worleyph commented Apr 27, 2016

rljacob commented Apr 27, 2016

worleyph commented Apr 27, 2016

rljacob commented Apr 27, 2016

worleyph commented Apr 27, 2016

worleyph commented Apr 27, 2016

jonbob commented Apr 27, 2016

worleyph commented Apr 27, 2016

douglasjacobsen commented Apr 27, 2016

jonbob commented Apr 27, 2016

jonbob commented Apr 27, 2016

amametjanov commented May 16, 2016

jonbob commented May 27, 2016

worleyph commented May 27, 2016

jonbob commented May 27, 2016

singhbalwinder commented May 27, 2016

jonbob commented May 27, 2016

worleyph commented May 27, 2016

jonbob commented May 27, 2016

worleyph commented May 27, 2016

singhbalwinder commented May 31, 2016

worleyph commented May 31, 2016

jonbob commented May 31, 2016

amametjanov commented May 31, 2016

mt5555 commented Jun 1, 2016

amametjanov commented Jun 1, 2016

worleyph commented Jun 1, 2016

worleyph commented Jun 1, 2016 • edited Loading

jonbob commented Jun 1, 2016

worleyph commented Jun 1, 2016

amametjanov commented Jun 1, 2016

worleyph commented Jun 1, 2016 • edited Loading

jonbob commented Jun 1, 2016

worleyph commented Jun 1, 2016

jonbob commented Jun 3, 2016

amametjanov commented Jun 16, 2016

amametjanov commented Aug 30, 2016

worleyph commented Apr 25, 2016 •

edited

Loading

worleyph commented Apr 25, 2016 •

edited

Loading

worleyph commented Jun 1, 2016 •

edited

Loading

worleyph commented Jun 1, 2016 •

edited

Loading