calculation with NaN in CLM4_5 with PGI compiler on Titan #165

worleyph · 2015-04-01T13:18:08Z

Recent experiments with CLM4_5 (both I and B case) using pgi/14.10 on Titan are failing with "urban net longwave radiation error: no convergence". Identical experiments using the Intel compiler (with all of the recent Intel-specific fixes) does not exhibit this problem.

I tracked this down to computation with NaNs. The field in question is initialized to NaN, but, at first glance, it appears to later be set to a non-NaN value. I am still investigating, but may hand this off to someone if I run out of time.

bishtgautam · 2015-04-01T13:57:40Z

The possibly culprit could be the values returned by BandDiagonal() in SoilTemperatureMod.F90. Can @worleyph or @daliwang or @acme-y9s provide notes on how to create a case (I or B ) to reproduce this error?

worleyph · 2015-04-01T14:28:40Z

This "works" (i.e. doesn't work) for me

create_newcase -case ne30_ICLM45BGC_pgi_test -compset ICLM45BGC -res ne30_g16 -mach titan -compiler pgi -project cli115

worleyph · 2015-04-01T14:39:02Z

@bishtgautam - it appears that the values coming out of BandDiagonal do have problems - t_soisno(c,j) contains NaNs (level 1, so 'soil'). Endrun is called after the first appearance, so others may be NaNs as well.

bishtgautam · 2015-04-01T15:36:11Z

My guess is that diagonal entries of bmatrix need to be initialize as 1.d0. And, diagonal entries of bmatrix are assembled from bamatrix_snow, bamatrix_ssw, and bamatrix_soil. Possible initialization modifications should happen at line no. 3128, 3777, and 4522

While we are at this, possibly initialization of tvector should be set to 0.d0 at line no. 387

(I'm on travel today, but can take a look at it tomorrow).

worleyph · 2015-04-01T20:49:08Z

I traced it back to (at least) rt_snow in SetRHSVec (rt_ssw and rt_soil seem okay?) rt_snow is initialized to nan. I checked within the routines that calculate rt_snow and do not see any problems there. However, rt_snow is set only when a number of if-tests are satisfied. It is almost as if rt_snow is not being calcualted for all of the indices that it is referenced for. If so, this would be a real bug, not a pgi problem.
I've run out of time and simple ideas at the moment. If you start in on this tomorrow, please send me a note to that effect.

worleyph · 2015-04-02T01:46:51Z

My latest NaN tests are also reporting NaNs when running with the Intel compiler, but not the ones that "matter". So my diagnosis as to when the NaNs first show up is not accurate. I'll communicate with Gautam directly until we can get this worked out.

rljacob · 2015-04-02T02:29:52Z

I tried the gnu compiler on blues with CLM45 and it worked (although I haven't tried recently).

bishtgautam · 2015-04-02T03:28:33Z

I suspect the issue is the ~~product~~ division or/and multiplication of NaN and zero, which is being handled differently by different compilers within the BandDiagonal() solve. I believe ideally for an inactive layer (corresponding to snow or standing water layer), the diagonal term in the matrix should be set to 1._r8 and off-diagonal term set to 0._r8; while corresponding entry in the rvector should be set to 0._r8.

Instead of debugging this error on a global CLM grid, I'm trying to see if I can reproduce this issue on a smaller 1x1 grid (-res 1x1_urbanc_alpha).

worleyph · 2015-04-02T03:52:54Z

What I am finding at the moment is that the PGI compiler is finding NaNs in
result(:)=r(ci,jtop(ci):jbot(ci))
in BandDiagonal while the Intel compiler is not. (Here r(:,:) is an unmodified input parameter. )This seems to imply that there are more NaNs coming out of the call to SetRHSVec using PGI than using Intel. Unfortunately there are NaNs for both (since this is how the fields are initialized) and I don't know which indices are valid and which are not.

bishtgautam · 2015-04-02T04:07:48Z

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
(2) Even if answer to (1) is yes, how can difference in number of NaNs from the output of BandDiagonal() be attributed to differences in rvector from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different number of 0._r8 for PGI and Intel.

singhbalwinder · 2015-04-02T04:28:52Z

I will give it a try with the NAG compiler to see if it provides us with
more info.

On Wed, Apr 1, 2015 at 9:07 PM, Gautam Bisht notifications@github.com
wrote:

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
(2) Even if answer to (1) is yes, how can difference in number of NaNs
from the output of BandDiagonal() be attributed to differences in rvector
from SetRHSVec(). Maybe the bmatrix from SetMatrix() also has different
number of 0._r8 for PGI and Intel.

—
Reply to this email directly or view it on GitHub
#165 (comment).

worleyph · 2015-04-02T13:21:24Z

(1) Is jtop(ci):jbot(ci) same for PGI and Intel?
It clearly is not exactly the same, since PGI has some NaNs and Intel does not.

worleyph · 2015-04-03T17:01:38Z

With help from @bishtgautam and @singhbalwinder I found a workaround for the problem (for this one case - it will need to be tested more extensively to determine whether it is sufficient). It is a small change, but does appear to be a PGI compiler bug. Someone else will have to decide whether it is worth reporting. According to @bishtgautam , the clm4_5_r097 tag, being brought in for consideration for V2, works fine for this case with PGI.

In the routine InitCold in the file TemperatureType.F90, the input parameters are declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:)

If these are instead declared

real(r8)          , intent(in) :: em_roof_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_wall_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_improad_lun(bounds%begl:bounds%endl)
real(r8)          , intent(in) :: em_perroad_lun(bounds%begl:bounds%endl)

then the code works. Note that immediately following this are statements to the effect

SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_wall_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_improad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
SHR_ASSERT_ALL((ubound(em_perroad_lun) == (/bounds%endl/)), errMsg(FILE, LINE))

These assert statements are tested only when compiled in DEBUG mode. @singhbalwinder trying doing just this with the NAG compiler, but got a seg. fault, so that wasn't very informative.

This style of code is used through this part of CLM45 - leave off upper bounds on input paramters and immediately following test, using SHR_ASSERT_ALL, that the upper bounds are the expected value. I have no idea what the history of this coding style is.

In this particular failure mode, if you do not specify the upper bound, 'ubound{em_roof_lun)' returns 1701611158 . Probing what is going on later in the code leads to a seg. fault. Not 'probing' results in certain values not being set (leaving the original NaN initialization).

I am reassigning this to @bishtgautam to test and fix.

worleyph · 2015-04-03T17:09:35Z

Also, our version of InitCold appears identical to that in clm4_5_r097, so the source of the problem is elsewhere. This workaround is just that. A further investigation might find a better fix, but I'll leave that to the Land Group to decide if they want to pursue this.

bishtgautam · 2015-04-04T01:35:13Z

I verified that clm4_5_r088 works with PGI on Titan for the same ICLM45BGC case that fails with ACME master.

Tag name: clm4_5_1_r088
Originator(s): muszala (Stefan Muszala)
Date: Wed Oct 1 09:24:43 MDT 2014
One-line Summary: Pull out ED deps. in TemperatureTypeMod, can now compile with pgi 14.7

Purpose of changes: Pull out the dependency on EDBioType in TemperatureType.F90. The ED
variables related to phenology now reside in EDPhenologyMod.F90. This refactor also had
the effect of getting past a PGI 14.7 ICE which looks like it was due to the use of EDbio_vars
in TemperatureType.F90. When I pulled out lines 1227 and 1226 of biogeophys/TemperatureType.F90
(in clm4_5_1_r087) and passed the two EDbio_vars variables through the argument list the ICE
went away.

This tag breaks ED restart tests. We went ahead with the tag because we had to fix a more
general problem with the CESM and CAM builds and PGI 14.7. the ED v0.1.0 branch does not
have these modifications and may be used as an alternative. A new clm tag will shortly
follow that addresses any remaining problems.

After adding #if 0 and #endif for code between line no. 1227-1267 of TemperatureType.F90, the ICLM45BGC ran successfully on Titan.

Q) Is it worth making code changes in ACME Land Model (ALM) that would also work for ED code given that we are going create a fork for ED development? If yes, there were additional ED related changes made in clm4_5_r086 that we may have to bring in ALM. (@climate-dude, others: Any thoughts?)

climate-dude · 2015-04-04T02:00:39Z

I think these fixes are worth bringing into the mainline ALM development. I thought @thorntonpe was making some major changes in TemperatureType.F90, but I have not seen these come in yet. To get past compiler issues, we should bring in these workarounds.

worleyph · 2015-04-04T03:24:10Z

This decision is clearly the Land Group's. I just want to note that the NCAR commit that eliminated the problem was "inadvertent". It was not the result of a diagnosis and targeted change. We similarly have a modification that is equally effective (and equally a workaround). The decision should be based on whether any of the mods in r88 are justified on their own. Perhaps @thorntonpe 's changes will inadvertently eliminate this problem as well? Or we can rip out any 'use_ed' code blocks, since that worked for @bishtgautam, as I don't think that we are using this in V1?

A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]

bishtgautam · 2015-04-04T15:47:32Z

After further testing I realized that #ifdef 0 doesn't fixes the problem and Pat's solution of adding bounds%endl is the only thing that does work.

worleyph · 2015-04-04T23:12:19Z

I enabled all tests of the form
SHR_ASSERT_ALL((ubound(em_roof_lun) == (/bounds%endl/)), errMsg(FILE, LINE))
throughout the code and the only failures were in InitCold . (It died with the check on em_roof_lun - I haven't checked the others yet.) So, good news I guess. Making the change that @bishtgautam submitted in his PR addresses the only known impact of this compiler bug.

A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]

) Update mpas-source submodule to pick up indexing fix This PR brings in a new mpas-source submodule that fixes an indexing issue on an array that was causing debug tests to fail in the online time-averaging mpas analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean surface, so is always zero. It was using some uninitialized values when using k=1, so start loop at k=2. See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165) Fixes #2768 [BFB]

Update mpas-source submodule to pick up indexing fix This PR brings in a new mpas-source submodule that fixes an indexing issue on an array that was causing debug tests to fail in the online time-averaging mpas analysis. Vertical index k=1 on activeTracerVerticalAdvectionTopFlux is the ocean surface, so is always zero. It was using some uninitialized values when using k=1, so start loop at k=2. See [MPAS-Model PR #165](MPAS-Dev/MPAS-Model#165) Fixes #2768 [BFB]

worleyph added bug Land Critical labels Apr 1, 2015

worleyph self-assigned this Apr 1, 2015

worleyph assigned bishtgautam and unassigned worleyph Apr 3, 2015

bishtgautam pushed a commit that referenced this issue Apr 4, 2015

Fixed a bug in CLM related to NaNs when using PGI on Titan.

acb0fd4

A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]

bishtgautam mentioned this issue Apr 4, 2015

Fixed a bug in CLM related to NaNs when using PGI on Titan. #176

Closed

bishtgautam pushed a commit that referenced this issue Apr 5, 2015

Fixed a bug in CLM related to NaNs when using PGI on Titan.

87c99e3

A B1850C5L45BGC and ICLM45BGC case ran successfully after this fix on Titan, when compiled with PGI. This commit resolves #165. [BFB]

bishtgautam mentioned this issue Apr 5, 2015

Fixed a bug in CLM related to NaNs when using PGI on Titan. #178

Merged

daliwang closed this as completed in #178 Apr 6, 2015

bishtgautam mentioned this issue Jul 7, 2015

FC5PM compset fails to buil on Titan with PGI compiler #263

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

worleyph commented Apr 1, 2015

bishtgautam commented Apr 1, 2015

worleyph commented Apr 1, 2015

worleyph commented Apr 1, 2015

bishtgautam commented Apr 1, 2015

worleyph commented Apr 1, 2015

worleyph commented Apr 2, 2015

rljacob commented Apr 2, 2015

bishtgautam commented Apr 2, 2015

worleyph commented Apr 2, 2015

bishtgautam commented Apr 2, 2015

singhbalwinder commented Apr 2, 2015

worleyph commented Apr 2, 2015

worleyph commented Apr 3, 2015

worleyph commented Apr 3, 2015

bishtgautam commented Apr 4, 2015

climate-dude commented Apr 4, 2015

worleyph commented Apr 4, 2015

bishtgautam commented Apr 4, 2015

worleyph commented Apr 4, 2015

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

calculation with NaN in CLM4_5 with PGI compiler on Titan #165

Comments

worleyph commented Apr 1, 2015

bishtgautam commented Apr 1, 2015

worleyph commented Apr 1, 2015

worleyph commented Apr 1, 2015

bishtgautam commented Apr 1, 2015

worleyph commented Apr 1, 2015

worleyph commented Apr 2, 2015

rljacob commented Apr 2, 2015

bishtgautam commented Apr 2, 2015

worleyph commented Apr 2, 2015

bishtgautam commented Apr 2, 2015

singhbalwinder commented Apr 2, 2015

worleyph commented Apr 2, 2015

worleyph commented Apr 3, 2015

worleyph commented Apr 3, 2015

bishtgautam commented Apr 4, 2015

climate-dude commented Apr 4, 2015

worleyph commented Apr 4, 2015

bishtgautam commented Apr 4, 2015

worleyph commented Apr 4, 2015