-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
History Crash in UFS w/ newer Intel compilers #2213
Comments
Hmm. The code is really doing something pretty boring at the line that is failing - basically copying a string. My one thought is that perhaps the string argument If you have the ability to recompile and try again, I would suggest adding the following line just before line 2336 in
But hopefully @bena-nasa is back from vacation today and can provide a more complete diagnosis before we trudge down the debugging path ... |
Wait. I think Intel confused me once again, @ulmononian. I believe Intel oneAPI 2022.1.0 is actually ifort 2021.6.0 which is what we use operationally:
Thus, this should work. It's ifort 2021.7 that had problems (and I think 2021.9 on my mac failed as well). |
Even odder, I'm not sure that can ever not be filled. From #941 this code came in: call ESMF_ConfigFindLabel(cfg,trim(string)//'positive:',isPresent=isPresent,_RC)
if (isPresent) then
call ESMF_ConfigGetAttribute(cfg,value=list(n)%positive,_RC)
_ASSERT(list(n)%positive=='down'.or.list(n)%positive=='up',"positive value for collection must be down or up")
else
list(n)%positive = 'down'
end if From what I see of your history, you don't specify a positive, so you must be in the |
I suppose that is something you could try. Add:
to your HISTORY and force the subject? Note: I just tried your inst_aod collection in a GEOS run here with and without that line and no difference. Both ran just fine! |
That string on the right hand side is always filled with "up" or "down", it's either in the History and the user provided it (and if it is not one of those one should die), or if not, it defaults to "down". There's no way you get to that point in the code without it it being one of those two |
Does this fail in the same place with optimization? |
@weiyuan-jiang I think you are going to need to attempt reproducing this on Orion unfortunately. Please work with our NOAA counterparts to get the details. Once you can reproduce, pull @bena-nasa to debug. |
with
if 2021.7.x is the problematic ifort, perhaps this could be a clue |
can i ask what optimization you are referring to? |
thanks for this suggestion! since we build mapl as part of a larger stack using spack, it would be a bit of an endeavor to re-compile w/ this code change, but i can certainly do it. is this something i should pursue or hold off until the team has a chance to look into this a bit more? |
Based on guidance from @bena-nasa I would not bother. At this point the focus should be on getting my team to reproduce the problem on Orion. We can then play with flags and such to see where that takes us. |
@ulmononian Do you have instructions for me to reproduce this on Orion? |
i am not sure this is reproducible on orion... the same ufs model config (w/ the same mapl & gocart version) runs fine on orion, with the only difference being the compiler/mpi version there (and obviously machine architecture/cpu config/etc. between hercules-orion or c5-orion). the specific branches we are testing are for porting the weather model to hercules and gaea c5, and the problem only arises on these two machines. |
Our ability to trouble shoot this is very limited if there is not an environment where we can reproduce this. I thought that was the point of Orion. My only other suggestion would then be "pair debugging", where someone on the NOAA end drives the keyboard and someone on our end screenshares and suggests next steps. It would work, but ... |
i completely understand. i think that anyone with orion access should also be able to access hercules, though, as they are both msu machines and share a filesystem. @weiyuan-jiang are you able to use hercules by chance? to log-in, it is the same ssh command as for orion, but with the |
It seems I can login to hercules. So please give me instruction to reproduce the issue. |
@weiyuan-jiang awesome! to reproduce there, please do the following:
if you want to use debug versions of esmf and mapl for your test, please edit ufs-weather-model/modulefiles/ufs_hercules.intel.lua and add the following lines below the loading of
|
It turned out my access to Hercules is short live. I am asking for help now |
The crash happened earlier in my run than in your run. Should I change something to re-produce your error message? @ulmononian 150: WARNING from PE 0: Unused line in INPUT/MOM_input : ODA_INCUPD_NHOURS = 6 |
@ulmononian Do you have instruction to build MAPL on Hercules? |
interesting. there is nothing more you should need to do. can you share your run directory? i will take a look. as for building mapl on hercules: we are using spack to build it as part of the full stack for the ufs-wm and other applications. i can provide you with instructions on how to install the stack, but i have not installed mapl manually on hercules. the script that spack uses to build mapl is: https://github.com/JCSDA/spack/tree/158dada02ce08a0b42606f82059c51e8f9f02ef0/var/spack/repos/builtin/packages/mapl/package.py. |
Never mind, I think my error message is as same as yours because history init is within the cap init. I will need to build MAPl first so I can insert something and get more information about the crash. |
some follow-up on how mapl is built: these variants are applied:
for the variants: ~ indicates a setting is turned off; + indicates it is turned on. the variants listed above can be found in the packages.py script i linked to in my previous comment. they essentially correspond to cmake flags that are set for the build. |
ok -- yes it is indeed the same error i received when i ran with the NON-debug version of mapl (as in this log https://github.com/ufs-community/ufs-weather-model/files/11682589/hercules_err.txt). re-compiling & running the WM should yield the full error log that points to the history crash... |
I believe it is the problem of the compiler. For example, it reports the error " 0x00000000074e08a8 do_alloc_assign() for_alloc_copy.c:0..." at this line |
that's great news that you can get it to pass by making this adjustment. did the model run completion with these changes, by chance? if these changes would be possible on the mapl level, it would be most appreciated. is there anything we can do on our side? thank you!! |
No, the model didn't complete because of the same assignments error at other locations. I think there are many assignment like that. |
ok -- makes sense. it is not clear to me -- based on your example, in what file is this happening? further: are these assignments restricted to this single file or is this throughout the code? |
We need to collect this information on compiler bugs and feed them back to the Intel compiler team. I meet with them every Friday, so Ican do that if I get all the information. They will want us to try the latest compiler (which we can do next week …)
… On Jun 30, 2023, at 6:20 PM, Cameron Book ***@***.***> wrote:
No, the model didn't complete because of the same assignments error at other locations. I think there are many assignment like that.
ok -- makes sense.
it is not clear to me -- based on your example, in what file is this happening? further: are these assignments restricted to this single file or is this throughout the code?
—
Reply to this email directly, view it on GitHub <#2213 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5C2RI6V2MFUDLUJASMQP3XN5ULVANCNFSM6AAAAAAZVBD76Y>.
You are receiving this because you are subscribed to this thread.
|
Yeah - this is an odd regression for Intel. Hopefully they have lots of other customers with similar bugs, as it can be a bit tricky to boil down a standalone reproducer for these. Clearly most of the allocate-on-assignment statements are working just fine. There is a chance that a small reproducer that just brings in the VerticalData class can reproduce, so that should be attempted. (@weiyuan-jiang ) |
@weiyuan-jiang just wanted to touch base and see if you would be able to create the reproducer patch/bugfix to address this intel compiler issue. it would be most appreciated and help us get the weather model running on hercules (and potentially gaea c5 too)! thank you! |
@ulmononian I was unable to create a simple reproducer. |
thanks for trying. given that, is there anything else that can be done on the mapl side? otherwise, we may have to take this to intel and/or the hercules/gaea sys admins. |
@weiyuan-jiang if you didn't find it, it should be at the top of the
|
this was resolved by upgrading the intel compiler version on hercules and c5 (see JCSDA/spack-stack#673). i believe this issue can be closed. thanks for all the help! |
when testing the ufs weather model in coupled mode (w/ waves and aerosols; i.e. S2SWA) on msu's hercules and gfdl's gaea c5, the model compiles successfully but fails at runtime in what seems to be mapl-related (and happens during the gocart run step). note that the model runs successfully if aerosols are turned off (and mapl in turn is not used).
the tests on both machines are w/ newer intel compilers (
2022.2.1
). the mpi's used areintel-oneapi-mpi/2021.7.1
andcray-mpich/8.1.25
, respectively. we are usingmapl/2.35.2
andesmf/8.4.2
(can provide full lib. stack if useful). the aerosol model (gocart) hashes we've tested w/ are c485cbc and b94145f; the results are the same with each.the model fails at the same place on each machine. for example, on hercules the
err
file (when using esmf + mapl debug versions) shows:and the
out
file shows:@mathomp4 pointed out that the model seems to be dying in History, though he and @bena-nasa did not notice anything particularly wrong with the history file being used (ufs-community/ufs-weather-model#1791 (comment)). @bena-nasa suggested it may be a compiler or memory bug, which was then followed by comments by @mathomp4 about potential issues with mapl & intel compilers newer than
2021.7.x
(again, we are using2022.2.1
).i was just wondering if there is perhaps any further information regarding mapl's compatibility with newer intel compilers? we don't have access to sys admin installed intel compilers older than
2022.2.1
on either hercules or gaea c5 at this time, so we are hoping to find a solution using the available compiler version.some additional details can be found in the ufs weather model issue #1791.
thank you!!!
The text was updated successfully, but these errors were encountered: