-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A_WCYCL2000 ne120_oRRS15: mapping error (Cori, Mira, and Titan) #864
Comments
Using pure MPI does require more memory, and trying to fit 1/4 degree on this many nodes, memory is one of the concerns? Hence what about a x4 or x8 configuration? I think @amametjanov may have have experience with this on Mira (trying to find working configurations which use low thread counts but still fit into memory). |
Thanks. I increased to 5400x1 (noHT), with exactly the same error. I think that @ndkeen indicated that he had run an F case with this decomposition on Cori. I'll try even larger when I get the chance, but will also continue debugging at 5400x1 (since it seems to run at least once per day). 0000: NGSEG = 0 so ngseg is zero (and not negative) after coming out of the loop:
I assume that this means that counts(:) == 0, but I'll verify as well. |
@worleyph The maps for ne120np4_oRRS15to5 are largely untested, just as a warning. And I ended up with a map for a different resolution that was bad but the mapping tools gave no warning or error creating it. So just a heads up... |
@worleyph : we might have to build up to the full A_WCYCL compset. If you don't figure out the problem, maybe first make sure atm/lnd compsets work, and ocn/ice as well. And then we could try all the active components together... |
Yes, I'd increase thread counts and also increase pio stride; it looks like a re-arranger problem. |
No one has run this compset/resolution yet. @worleyph try just the F-case first on Titan. See https://acme-climate.atlassian.net/browse/CSG-163 |
@rljacob - already ran an F case on Titan (successfully), a couple of weeks ago. |
I'm also getting the same error on Mira:
Tried 3 different PE layouts and PIO settings and all show the What is the ocn/ice only |
One of the GSMaps created as part of the ocean-coupler interaction is getting bad data (ngseg=0). I don't think this has anything to do with PIO. |
@amametjanov , since you are seeing this also, I assume that we can eliminate memory problems (if only because memory problems tend not to have the same signature of Mira and Cori). @rljacob ,
Does this imply a bad map then? Should I keep trying to debug this, or can this be addressed some other way? I've tracked it into the call to
where AttrVect_Isize has
I'm trying to work backwards from the sMat for this call, and is taking some time (waiting in the Cori queue). |
Yes it likely implies a bad map. |
And how do we figure this out? Is there a way to do this outside of running the model? It sounds like I am wasting my time continuing with my current approach. |
Actually from the cpl.log you pasted it read the basic parameters of the map correctly: |
My latest debug writes made the ocn/ice init error on Titan disappear. It then died in the same location as I saw on Cori and @amametjanov saw on Mira. The Titan PE layout was 2700x4. So, this is persistent across architectures. |
What is a node here? lsize is zero for all processes for this map. |
@worleyph let me take another look at these maps -- we had another one that I made around the same time turn out to be bad -- despite getting no errors or warnings from the tools that generated them |
@jonbob, thanks. I'll keep poking, as a background activity. |
You can do:
To test ocn/ice only. |
@worleyph - I think at the very least we have a bad domain file for the ocean. I'll try to regenerate it and see if I can get something rational. In the meantime, I don't think there's any point to continued testing. |
@amametjanov - I know you're also trying to work on this resolution. I have not yet made any maps for the data models to oRRS15to5 -- so nothing like T62_oRRS15to5. I can do that if it would be helpful, but let me try to figure out this domain file issue first. |
Not yet, got another error at the same location with a different PE configuration on 2K nodes, trying on 4K nodes. The stack-trace is similar to yours:
|
@amametjanov : I did get the A_WCYCL2000 ne120_oRRS15 to run last night on edison, using both the intel and gnu compilers. My tests were under debug mode and only ran a limited number of timesteps, but all components did initialize and run successfully. I'll try today in optimized mode, and work to get necessary model configuration changes into the scripts. I was using "next" from the repo, to pick up a fix to rtm... |
@jonbob, would you advise waiting until you get the scripts updated, or can you tell me how to repeat the experiment with the current master or next? Thanks. |
@worleyph : I can point you to my modifications on edison, or just list the namelist changes and pe-layout, whichever is easier. And depending on whether or not you intend to work over this holiday weekend. |
@singhbalwinder : yes, I was running with next from yesterday |
@jonbob , I'll wait until next week. I'll bother you again then. Thanks. |
@worleyph sounds good -- I hope that means you're getting a real holiday weekend. I'm going to keep pushing a little, at least get it to run a 5-day smoke test successfully on a couple of different platforms. |
@jonbob: "I hope that means you're getting a real holiday weekend." H'mm - has my spouse been talking to you? :-). Thanks for continuning to push this. |
Tagging: @amametjanov, @jonbob, @worleyph
|
So, there seem to be two different issues here - the original coupled model problem due to bad mapping and domain files, and a different issue in the atmosphere for ne120. A long time ago (early April) I got this F case to work with ne120 on Titan, but did not have DEBUG enabled. The above problem appears to be repeatable on Cori and Mira at the moment, so may be something new. It is definitely something different. Should there be a separate github issue for this? Or is there already one? (The above seems familar, from more than @amametjanov 's earlier comment.) |
@worleyph I know the default timesteps are wrong for virtually all components, as well at the coupling time intervals. I'm trying to modify the scripts to produce the correct settings, but in the meantime, this is what I have been using successfully: |
Update: @singhbalwinder, I tried the most recent PR #903 on Mira and the run failed in the same way as before (#864 (comment)). |
in a seperate email, @amametjanov had this configuration getting past initialization and crashing in the atmosphere, with what looked like a stability/spinup issue. So @jonbob 's results are consistent with this - in that he was able to run longer by reducing the timestep in the atmosphere by a factor of 3. In the past, whenever starting a high-res coupled simulation, using an atmosphere initial condition from an AMIP simulation, we have always had to do some work to spinup a new initial condition file - usually by ~5 days with a small timestep is sufficient. If you set "inithist='DAILY'", the atmosphere will write an initial condition files every model day. |
6-hour 2K-node prod-short job timed out while still initializing. Trying the max of 12 hours in prod-long. |
Try changing (in env_run.xml) to and see if this helps. If this blows out MPI memory, try something at least larger than 64, say 1024. |
@jonbon, I can't translate your suggestions to settings in env_run.xml for the GMPAS case. What I see is
What should I change these to, and do I also need to change user_nl_mpas-o and user_nl_mpas-cice (or any of the other user_nl files)? |
@worleyph Sorry about that. For the fully-coupled case, the NCPL_BASE_PERIOD is set to "day" instead of "hour". Do the settings make any more sense in that context? I have meetings for the next three hours, so I apologize if it takes time to get back to you.... |
@jonbob, maybe. If the ocean timestep is already correct, then the only other active component is sea ice - atmosphere and land should be irrelevant here, correct? So the problem to be addressed for the GMPAS run is a bad cice model timestep? How does the coupling frequency come into this then? I'll try setting the following in the appropriate user_nl_XXX files MPAS-CICE: config_dt = 300.0 and setting the coupling frquency to 288 and see what happens. (Where would |
Pat, the MPAS namelist changes take effect only if you modify the XML files at:
|
Thanks. @douglasjacobsen , any chance that this will ever change? We should at least add a guard so that changing things in user_nl_mpas-o, user_nl_mpas-cice, or user_nl_mpas-li generates an error message. |
@worleyph: It is fine to make changes in user_nl_XXX files for the mpas components. That's how I've gotten my changes in, not by changing the defaults. At some point, we'll want to do that but it's not necessary right now. For what it's worth, |
@jonbob. thanks for the clarification. I'll go back to this in my future experiments - changed the defaults in my current experiment (in the queue). |
@worleyph : my test with a 7.5-minute coupling interval worked fine on both titan and edison -- and got better performance. Though maybe the correct adjective is "less horrendous". In any case, I think that's a better default to get into the scripts, and we'll just have to see if it can spin-up and run for any significant time. |
Progress: had a run on Mira that went out to timestep 0001-01-01_12:55:00 (or step 154 for ATM) in 6 hours on 2K nodes, before timing out. This was with the old 2.5 minute coupling interval. I'll try the PR with new intervals, when it's available. |
Fixed by #924. |
I've been trying to find feasible PE layouts for
on Cori. I started with a small (1024x1, stacked, noHT) layout, which failed. I then tried 2048x1, stacked, noHT), and most recently 3600x1 for atmosphere, coupler, and land (3616x1) with the other components on their own compute nodes using a 2048x1 decomposition. Again this is all noHT:
I am getting the identical error for all three of these. From cesm.log:
cpl.log ends with
I'll try another increase in compute nodes for the ocean, but if anyone has any other suggestions, I'd appreciate it. Note that I (personally) do not have this compset working anyplace yet. On Titan there is a failure in ice or ocean initialization, so earlier than this in the execution.
The text was updated successfully, but these errors were encountered: