Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

Closed
worleyph opened this issue Apr 1, 2015 · 19 comments · Fixed by #178
Closed

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

worleyph opened this issue Apr 1, 2015 · 19 comments · Fixed by #178

Comments

@worleyph
Copy link
Contributor

worleyph commented Apr 1, 2015

Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.

I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.

@bishtgautam
Copy link
Contributor

The possibly culprit could be the values returned by BandDiagonal() in SoilTemperatureMod.F90. Can @worleyph or @daliwang or @acme-y9s provide notes on how to create a case (I or B ) to reproduce this error?

@worleyph
Copy link
Contributor Author

worleyph commented Apr 1, 2015

This "works" (i.e. doesn't work) for me

create_newcase -case ne30_ICLM45BGC_pgi_test -compset ICLM45BGC -res ne30_g16 -mach titan -compiler pgi -project cli115

@worleyph
Copy link
Contributor Author

worleyph commented Apr 1, 2015

@bishtgautam - it appears that the values coming out of BandDiagonal do have problems - t_soisno(c,j) contains NaNs (level 1, so 'soil'). Endrun is called after the first appearance, so others may be NaNs as well.

@bishtgautam
Copy link
Contributor

My guess is that diagonal entries of bmatrix need to be initialize as 1.d0. And, diagonal entries of bmatrix are assembled from bamatrix_snow, bamatrix_ssw, and bamatrix_soil. Possible initialization modifications should happen at line no. 3128, 3777, and 4522

While we are at this, possibly initialization of tvector should be set to 0.d0 at line no. 387

(I'm on travel today, but can take a look at it tomorrow).

@worleyph
Copy link
Contributor Author

worleyph commented Apr 1, 2015

I traced it back to (at least) rt_snow in SetRHSVec (rt_ssw and rt_soil seem okay?) rt_snow is initialized to nan. I checked within the routines that calculate rt_snow and do not see any problems there. However, rt_snow is set only when a number of if-tests are satisfied. It is almost as if rt_snow is not being calcualted for all of the indices that it is referenced for. If so, this would be a real bug, not a pgi problem.
I've run out of time and simple ideas at the moment. If you start in on this tomorrow, please send me a note to that effect.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 2, 2015

My latest NaN tests are also reporting NaNs when running with the Intel compiler, but not the ones that "matter". So my diagnosis as to when the NaNs first show up is not accurate. I'll communicate with Gautam directly until we can get this worked out.

@rljacob
Copy link
Member

rljacob commented Apr 2, 2015

I tried the gnu compiler on blues with CLM45 and it worked (although I haven't tried recently).

@bishtgautam
Copy link
Contributor

I suspect the issue is the product division or/and multiplication of NaN and zero, which is being handled differently by different compilers within the BandDiagonal() solve. I believe ideally for an inactive layer (corresponding to snow or standing water layer), the diagonal term in the matrix should be set to 1._r8 and off-diagonal term set to 0._r8; while corresponding entry in the rvector should be set to 0._r8.

Instead of debugging this error on a global CLM grid, I'm trying to see if I can reproduce this issue on a smaller 1x1 grid (-res 1x1_urbanc_alpha).

@worleyph
Copy link
Contributor Author

worleyph commented Apr 2, 2015

What I am finding at the moment is that the PGI compiler is finding NaNs in
result(:)=r(ci,jtop(ci):jbot(ci))
in BandDiagonal while the Intel compiler is not. (Here r(:,:) is an unmodified input parameter. )This seems to imply that there are more NaNs coming out of the call to SetRHSVec using PGI than using Intel. Unfortunately there are NaNs for both (since this is how the fields are initialized) and I don't know which indices are valid and which are not.

@bishtgautam
Copy link
Contributor

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
(2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal() be attributed to differences in rvector from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different number of 0._r8 for PGI and Intel.

@singhbalwinder
Copy link
Contributor

I will give it a try with the NAG compiler to see if it provides us with
more info.

On Wed, Apr 1, 2015 at 9:07 PM, Gautam Bisht notifications@github.com
wrote:

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
(2) Even if answer to (1) is yes, how can difference in number of NaNs
from the output of BandDiagonal() be attributed to differences in rvector
from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different
number of 0._r8 for PGI and Intel.


Reply to this email directly or view it on GitHub
#165 (comment).

@worleyph
Copy link
Contributor Author

worleyph commented Apr 2, 2015

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
It clearly is not exactly the same, since PGI has some NaNs and Intel does not.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 3, 2015

With help from @bishtgautam and @singhbalwinder I found a workaround for the problem (for this one case - it will need to be tested more extensively to determine whether it is sufficient). It is a small change, but does appear to be a PGI compiler bug. Someone else will have to decide whether it is worth reporting. According to @bishtgautam , the clm4_5_r097 tag, being brought in for consideration for V2, works fine for this case with PGI.

In the routine InitCold in the file TemperatureType.F90, the input parameters are declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:)

If these are instead declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:bounds%endl)

then the code works. Note that immediately following this are statements to the effect

SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_wall_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_improad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_perroad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))

These assert statements are tested only when compiled in DEBUG mode. @singhbalwinder trying doing just this with the NAG compiler, but got a seg. fault, so that wasn't very informative.

This style of code is used through this part of CLM45 - leave off upper bounds on input paramters and immediately following test, using SHR_ASSERT_ALL, that the upper bounds are the expected value. I have no idea what the history of this coding style is.

In this particular failure mode, if you do not specify the upper bound, 'ubound{em_roof_lun)' returns 1701611158 . Probing what is going on later in the code leads to a seg. fault. Not 'probing' results in certain values not being set (leaving the original NaN initialization).

I am reassigning this to @bishtgautam to test and fix.

@worleyph worleyph assigned bishtgautam and unassigned worleyph Apr 3, 2015
@worleyph
Copy link
Contributor Author

worleyph commented Apr 3, 2015

Also, our version of InitCold appears identical to that in clm4_5_r097, so the source of the problem is elsewhere. This workaround is just that. A further investigation might find a better fix, but I'll leave that to the Land Group to decide if they want to pursue this.

@bishtgautam
Copy link
Contributor

I verified that clm4_5_r088 works with PGI on Titan for the same ICLM45BGC case that fails with ACME master.

Tag name: clm4_5_1_r088
Originator(s): muszala (Stefan Muszala)
Date: Wed Oct 1 09:24:43 MDT 2014
One-line Summary: Pull out ED deps. in TemperatureTypeMod, can now compile with pgi 14.7

Purpose of changes: Pull out the dependency on EDBioType in TemperatureType.F90. The ED
variables related to phenology now reside in EDPhenologyMod.F90. This refactor also had
the effect of getting past a PGI 14.7 ICE which looks like it was due to the use of EDbio_vars
in TemperatureType.F90. When I pulled out lines 1227 and 1226 of biogeophys/TemperatureType.F90
(in clm4_5_1_r087) and passed the two EDbio_vars variables through the argument list the ICE
went away.

This tag breaks ED restart tests. We went ahead with the tag because we had to fix a more
general problem with the CESM and CAM builds and PGI 14.7. the ED v0.1.0 branch does not
have these modifications and may be used as an alternative. A new clm tag will shortly
follow that addresses any remaining problems.

After adding #if 0 and #endif for code between line no. 1227-1267 of TemperatureType.F90, the ICLM45BGC ran successfully on Titan.

Q) Is it worth making code changes in ACME Land Model (ALM) that would also work for ED code given that we are going create a fork for ED development? If yes, there were additional ED related changes made in clm4_5_r086 that we may have to bring in ALM. (@climate-dude, others: Any thoughts?)

@climate-dude
Copy link
Contributor

I think these fixes are worth bringing into the mainline ALM development. I thought @thorntonpe was making some major changes in TemperatureType.F90, but I have not seen these come in yet. To get past compiler issues, we should bring in these workarounds.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 4, 2015

This decision is clearly the Land Group's. I just want to note that the NCAR commit that eliminated the problem was "inadvertent". It was not the result of a diagnosis and targeted change. We similarly have a modification that is equally effective (and equally a workaround). The decision should be based on whether any of the mods in r88 are justified on their own. Perhaps @thorntonpe 's changes will inadvertently eliminate this problem as well? Or we can rip out any 'use_ed' code blocks, since that worked for @bishtgautam, as I don't think that we are using this in V1?

bishtgautam pushed a commit that referenced this issue Apr 4, 2015
A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan,
when compiled with PGI. This commit resolves #165.

[BFB]
@bishtgautam
Copy link
Contributor

After further testing I realized that #ifdef 0 doesn't fixes the problem and Pat's solution of adding bounds%endl is the only thing that does work.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 4, 2015

I enabled all tests of the form
SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
throughout the code and the only failures were in InitCold . (It died with the check on em_roof_lun - I haven't checked the others yet.) So, good news I guess. Making the change that @bishtgautam submitted in his PR addresses the only known impact of this compiler bug.

bishtgautam pushed a commit that referenced this issue Apr 5, 2015
A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan,
when compiled with PGI. This commit resolves #165.

[BFB]
jonbob added a commit that referenced this issue Feb 27, 2019
)

Update mpas-source submodule to pick up indexing fix

This PR brings in a new mpas-source submodule that fixes an indexing issue on an
array that was causing debug tests to fail in the online time-averaging mpas
analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean
surface, so is always zero. It was using some uninitialized values when using
k=1, so start loop at k=2.
See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165)

Fixes #2768

[BFB]
jonbob added a commit that referenced this issue Feb 27, 2019
Update mpas-source submodule to pick up indexing fix

This PR brings in a new mpas-source submodule that fixes an indexing issue on an
array that was causing debug tests to fail in the online time-averaging mpas
analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean
surface, so is always zero. It was using some uninitialized values when using
k=1, so start loop at k=2.
See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165)

Fixes #2768

[BFB]
jonbob added a commit that referenced this issue Feb 28, 2019
Update mpas-source submodule to pick up indexing fix

This PR brings in a new mpas-source submodule that fixes an indexing issue on an
array that was causing debug tests to fail in the online time-averaging mpas
analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean
surface, so is always zero. It was using some uninitialized values when using
k=1, so start loop at k=2.
See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165)

Fixes #2768

[BFB]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment