-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calculation with NaN in CLM4_5 with PGI compiler on Titan #165
Comments
The possibly culprit could be the values returned by BandDiagonal() in SoilTemperatureMod.F90. Can @worleyph or @daliwang or @acme-y9s provide notes on how to create a case (I or B ) to reproduce this error? |
This "works" (i.e. doesn't work) for me create_newcase -case ne30_ICLM45BGC_pgi_test -compset ICLM45BGC -res ne30_g16 -mach titan -compiler pgi -project cli115 |
@bishtgautam - it appears that the values coming out of BandDiagonal do have problems - t_soisno(c,j) contains NaNs (level 1, so 'soil'). Endrun is called after the first appearance, so others may be NaNs as well. |
My guess is that diagonal entries of While we are at this, possibly initialization of (I'm on travel today, but can take a look at it tomorrow). |
I traced it back to (at least) rt_snow in SetRHSVec (rt_ssw and rt_soil seem okay?) rt_snow is initialized to nan. I checked within the routines that calculate rt_snow and do not see any problems there. However, rt_snow is set only when a number of if-tests are satisfied. It is almost as if rt_snow is not being calcualted for all of the indices that it is referenced for. If so, this would be a real bug, not a pgi problem. |
My latest NaN tests are also reporting NaNs when running with the Intel compiler, but not the ones that "matter". So my diagnosis as to when the NaNs first show up is not accurate. I'll communicate with Gautam directly until we can get this worked out. |
I tried the gnu compiler on blues with CLM45 and it worked (although I haven't tried recently). |
I suspect the issue is the Instead of debugging this error on a global CLM grid, I'm trying to see if I can reproduce this issue on a smaller 1x1 grid ( |
What I am finding at the moment is that the PGI compiler is finding NaNs in |
(1) Is |
I will give it a try with the NAG compiler to see if it provides us with On Wed, Apr 1, 2015 at 9:07 PM, Gautam Bisht notifications@github.com
|
|
With help from @bishtgautam and @singhbalwinder I found a workaround for the problem (for this one case - it will need to be tested more extensively to determine whether it is sufficient). It is a small change, but does appear to be a PGI compiler bug. Someone else will have to decide whether it is worth reporting. According to @bishtgautam , the clm4_5_r097 tag, being brought in for consideration for V2, works fine for this case with PGI. In the routine InitCold in the file TemperatureType.F90, the input parameters are declared
If these are instead declared
then the code works. Note that immediately following this are statements to the effect SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE)) These assert statements are tested only when compiled in DEBUG mode. @singhbalwinder trying doing just this with the NAG compiler, but got a seg. fault, so that wasn't very informative. This style of code is used through this part of CLM45 - leave off upper bounds on input paramters and immediately following test, using SHR_ASSERT_ALL, that the upper bounds are the expected value. I have no idea what the history of this coding style is. In this particular failure mode, if you do not specify the upper bound, 'ubound{em_roof_lun)' returns 1701611158 . Probing what is going on later in the code leads to a seg. fault. Not 'probing' results in certain values not being set (leaving the original NaN initialization). I am reassigning this to @bishtgautam to test and fix. |
Also, our version of InitCold appears identical to that in clm4_5_r097, so the source of the problem is elsewhere. This workaround is just that. A further investigation might find a better fix, but I'll leave that to the Land Group to decide if they want to pursue this. |
I verified that
After adding Q) Is it worth making code changes in ACME Land Model (ALM) that would also work for ED code given that we are going create a fork for ED development? If yes, there were additional ED related changes made in |
I think these fixes are worth bringing into the mainline ALM development. I thought @thorntonpe was making some major changes in TemperatureType.F90, but I have not seen these come in yet. To get past compiler issues, we should bring in these workarounds. |
This decision is clearly the Land Group's. I just want to note that the NCAR commit that eliminated the problem was "inadvertent". It was not the result of a diagnosis and targeted change. We similarly have a modification that is equally effective (and equally a workaround). The decision should be based on whether any of the mods in r88 are justified on their own. Perhaps @thorntonpe 's changes will inadvertently eliminate this problem as well? Or we can rip out any 'use_ed' code blocks, since that worked for @bishtgautam, as I don't think that we are using this in V1? |
A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]
After further testing I realized that |
I enabled all tests of the form |
A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]
) Update mpas-source submodule to pick up indexing fix This PR brings in a new mpas-source submodule that fixes an indexing issue on an array that was causing debug tests to fail in the online time-averaging mpas analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean surface, so is always zero. It was using some uninitialized values when using k=1, so start loop at k=2. See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165) Fixes #2768 [BFB]
Update mpas-source submodule to pick up indexing fix This PR brings in a new mpas-source submodule that fixes an indexing issue on an array that was causing debug tests to fail in the online time-averaging mpas analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean surface, so is always zero. It was using some uninitialized values when using k=1, so start loop at k=2. See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165) Fixes #2768 [BFB]
Update mpas-source submodule to pick up indexing fix This PR brings in a new mpas-source submodule that fixes an indexing issue on an array that was causing debug tests to fail in the online time-averaging mpas analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean surface, so is always zero. It was using some uninitialized values when using k=1, so start loop at k=2. See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165) Fixes #2768 [BFB]
Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.
I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.
The text was updated successfully, but these errors were encountered: