-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vertical thermo error in ICE #1194
Comments
Is this run with CICE or MPAS_CICE? Is this a clean build? CICE needs to know the layout at compile time. |
As it is a F case, I thought there was no ice? It was a clean build. |
If you look at the F compsets, you will see that most of them use CICE and CLM (exceptions are the aquaplanet and ideal compsets). |
Ah ok, so there ice. This is FC5AV1C-04P. |
this is an issue we've encountered on a regular basis with both the high resolution (0.1 degree POP/CICE) v0 B-cases, as well as the RRS18to6km MPAS G-cases. we're still not sure what causes the problem, but small changes in the setup can cause it to go away. for example, the first time we saw this was on Mira. we moved the run to Titan and the problem went away. you can change the optimization level and it will often go away. or change the dynamics subcycling in the sea ice model. we should probably figure out the root cause at some point. |
sorry--hit the wrong button.... |
Thanks @maltrud . That's why I posted it -- maybe someone has seen it before, or maybe it will help someone verify it's "real". Not causing me a problem at the moment. I could run in some other ways if it helps to investigate. |
The first thing to note is that failures in other parts of the system are often caught in the sea ice vertical thermodynamics since the thermodynamics is iterative with a convergence criterion. Unphysical values generated elsewhere will often propagate until they cause the sea ice vertical thermodynamics not to converge. Here we see a sea ice surface temperature of -1000 C, which suggests a problem with the atmospheric fluxes. Jon has found places in the atmosphere model where after a failure the model continues after adding -999 to fields (#1292). This may have happened here. |
I have not seen this again in a while and I have been running various ne120 F cases. |
Well I just did see this again. Using the beta release of Intel v18 on cori-knl.
|
Is this error reproducible? If yes, we should try to find the root of this error. We might not be using CICE in future but, like @mt5555 mentioned, the error might be coming from somewhere else. We can see if the debug mode reveals more info about this error. @ndkeen : Can you please try |
I realize I'm using a beta version of the compiler, but because I stumbled upon this with an acme_dev test, I thought it could make it easier to find. I started Note: I'm using a branch where I've added this intel18 option for cori-knl. I started with master as of yesterday. I'm ready to make a PR to get it into repo as it's only an option. #1685 |
Those two _D tests did pass. |
Ok. It seems like some kind of memory issue if it passes in debug mode but it can be some other issue also (compiler bug?). Unless there is a better alternative, I think we can proceed with the following: If the error is reproducible in non-debug run, it might be useful to compile the code again using only -g flag (for producing debugging information) and using a debugger. But first we need to make sure that the code fails when we add only -g to the compiler options.... If debugging is taking too much time, we must first evaluate whether it is useful to debug this or not. @mt5555 , @philrasch and @rljacob: any thoughts on this? |
This error is preventing me from using intel18 with a highres G case. Anything we can do to debug with the small test that I noted above? For example, I just ran this again with a recent master: And I get the following:
|
@ndkeen - these issues are unrelated. The error in the DTEST compset is from CICE, while the high-res G-case should be MPAS-CICE. They are different models, though they share some coding. Can you point me at the high-res issue separately? |
OK, I see the difference. I was using the test ( |
@ndkeen - is this still an issue? If not, can you please close it? Thanks |
I'm fine closing this as I don't have an easy way to reproduce it and have not seen it (when using MPAS) in a while. |
Trying to run the ne120 problem with different layouts to get best performance on cori-knl. With 424 nodes, I successfully used 7200 MPI's with 1,8, and 16 threads (all components). But going to 14400 MPI's and 8 threads, I hit this strange error:
The text was updated successfully, but these errors were encountered: