Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the regional_spp_sppt_shum_skeb regression test because it uses uninitialized memory #1151

Conversation

SamuelTrahanNOAA
Copy link
Collaborator

@SamuelTrahanNOAA SamuelTrahanNOAA commented Apr 1, 2022

The regional_spp_sppt_shum_skeb uses uninitialized memory. (See NCAR/ccpp-physics#891). It is unreasonable to expect the code to be reproducible, or even work, when it has that bug.

This is holding up other work (NOAA-GSL/ccpp-physics#143) which changes a totally unrelated files. That unrelated change is just enough to put unphysical data in the uninitialized variables. Valid development work should not be held up by a broken regression test.

PR Checklist

  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.

  • New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.

Instructions: All subsequent sections of text should be filled in as appropriate.

The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.

Description

The regional_spp_sppt_shum_skeb test in rt.conf is commented out and surrounded by comments explaining why.

Issue(s) addressed

This is a temporary solution to NCAR/ccpp-physics#891 to allow other development to continue, until the stochastic thompson scheme developers can fix the bug.

Testing

This simply removes a regression test, so there is no need to run regression tests.

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • opnReqTest for newly added/changed feature
  • CI

Dependencies

None.

@SamuelTrahanNOAA SamuelTrahanNOAA changed the title Remove the regional_spp_sppt_shum_skeb because it uses uninitialied memory Remove the regional_spp_sppt_shum_skeb regression test because it uses uninitialied memory Apr 1, 2022
@SamuelTrahanNOAA SamuelTrahanNOAA changed the title Remove the regional_spp_sppt_shum_skeb regression test because it uses uninitialied memory Remove the regional_spp_sppt_shum_skeb regression test because it uses uninitialized memory Apr 1, 2022
@junwang-noaa
Copy link
Collaborator

@JeffBeck-NOAA is the test owner of regional_spp_sppt_shum_skeb. @JeffBeck-NOAA please let us know if you are OK to not run the test in RT or if you want to have a fix for this test.

@JeffBeck-NOAA
Copy link
Contributor

@junwang-noaa, I would like to fix this test. There is a declared but uninitialized variable (min_rand) being used in module_mp_thompson.F90 that came from the original WRF SPP code. This part of the code was inactivated until SPP was added to FV3, so we were unaware of the non-reproducibility until now. It is possible that this was done on purpose by the original code developer to produce a random number using an uninitialized variable, but we are not sure at the moment. I'm investigating and will have a fix shortly.

@JeffBeck-NOAA
Copy link
Contributor

Can we temporarily ignore the results from this RT until a fix is applied? That will be coming ASAP as I've now found the original intent of the "min_rand" variable and will open a PR shortly to fix it.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Yes. To temporarily ignore results, we disable the regression test in one PR (mine) and then re-enable it in a later one (yours).

@JeffBeck-NOAA
Copy link
Contributor

JeffBeck-NOAA commented Apr 4, 2022

@junwang-noaa @SamuelTrahanNOAA, I have opened PR #1152, which successfully fixes this problem. I ran the RT suite on Hera which has passed for all normal and debug tests, except for the regional_spp_sppt_shum_skeb RT test, which will now have a new baseline result. Note that the test ran to completion, so the initialization of the abs_min_rand variable also resolves the large temperature value crash that @SamuelTrahanNOAA was experiencing.

@SamuelTrahanNOAA
Copy link
Collaborator Author

If the regional_spp_sppt_shum_skeb does not pass in debug mode, you have not fixed the problem.

@JeffBeck-NOAA
Copy link
Contributor

The regional_spp_sppt_shum_skeb test also passes in debug mode.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Maybe we should add a debug mode regression test?

I can give you the changes to rt.conf and rt_gnu.conf for that if you want.

@JeffBeck-NOAA
Copy link
Contributor

@SamuelTrahanNOAA, sure, we can add that. Please feel free to provide the changes and I'll incorporate it in my PR. Thanks.

@SamuelTrahanNOAA
Copy link
Collaborator Author

The test still fails in debug mode on hera.intel. It reaches the 0 hour forecast output but aborts in the post with "floating overflow" which usually means bad data in the model input.

180:  post_fname=PRSLEV.GrbF00
191: forrtl: error (72): floating overflow
191: Image              PC                Routine            Line        Source
191: fv3.exe            000000000BD484DE  Unknown               Unknown  Unknown
191: libpthread-2.17.s  00002AAD4BD7A630  Unknown               Unknown  Unknown
191: fv3.exe            000000000B528628  surfce_                  5424  SURFCE.f
191: fv3.exe            000000000B4EEA15  process_                  104  PROCESS.f
191: fv3.exe            00000000032D7154  post_regional_mp_         205  post_regional.F90
191: fv3.exe            00000000031B77BF  inline_post_mp_in          50  inline_post.F90
191: fv3.exe            0000000003150F96  module_wrt_grid_c        1539  module_wrt_grid_comp.F90

In particular, it aborts when multiplying the potential evaporation by 1000:

         IF (IGET(137)>0) THEN
            GRID1=SPVAL   
            DO J=JSTA,JEND
              DO I=1,IM
 HERE      ==> if(POTEVP(I,J)/=spval) GRID1(I,J) = POTEVP(I,J)*1000. <==
              ENDDO
            ENDDO
            ID(1:25) = 0
            ITPREC     = NINT(TPREC)

The potential evaporation is a model variable (Diag%ep) sent to the post, but the model only calculates it when lssav is true (in GFS_surface_generic.F90). The namelist does not specify lssav, and it defaults to false, so that variable is never calculated. However, GFS_typedefs.F90 initializes it to zero. I can only see two ways ep could have invalid data:

  1. Something is overwriting ep with invalid data.
  2. The model never sends ep to the post, and the post's potevp variable is filled with junk data as a result.
  3. I'm missing something.

Unfortunately, the model crashes before it can write any NetCDF output, so there's no way to check the potential evaporation (pevr_ave in the diag_table).

@JeffBeck-NOAA
Copy link
Contributor

@SamuelTrahanNOAA, my debug run with the fix succeeded on Hera, and post produced the output files without a problem.

You can find the output directory here:

/scratch2/BMC/fv3lam/HIWT/expt_dirs/MP/2019061500

The model source code is here:

/scratch2/BMC/fv3lam/HIWT/ufs-srweather-app_new_code/src

The model executables are here:

/scratch2/BMC/fv3lam/HIWT/ufs-srweather-app_new_code/bin

The code was built with the SRW App devbuild.sh script (devbuild.sh --build-type=DEBUG).

@SamuelTrahanNOAA
Copy link
Collaborator Author

Unfortunately, that isn't good enough. The ufs-weather-model regression test uses the build system in ufs-weather-model, which has a different environment than the one in the srweather app. You may be using a different version of esmf or the post, or using different compilation options.

You can run the test by putting this in a file in tests/badjob.conf

COMPILE | -DAPP=ATM -DDEBUG=ON -DCCPP_SUITES=FV3_RAP,FV3_RAP_sfcdiff,FV3_HRRR,FV3_RRFS_v1beta,FV3_RRFS_v1nssl -D32BIT=ON                     |                                         | fv3 |
RUN     | regional_spp_sppt_shum_skeb                                                                                             |                                         | fv3 |

Then run rt.sh in the tests/ directory:

./rt.sh -l badjob.conf

You can add a "-e" to that line for ecFlow or -r for Rocoto.

Eventually the job will fail and you'll get the error I showed you.

@SamuelTrahanNOAA
Copy link
Collaborator Author

By the way, this has been a persistent problem. The ufs-srweather-app compiles ufs-weather-model differently than ufs-weather-model does itself. Because of that, some tests that fail for ufs-weather-model will succeed for ufs-srweather-app, and vice-versa. For a long time, the HRRR configuration of RRFS would not run at all on Jet unless you built it with ufs-srweather-app.

In other words, if you're using ufs-srweather-app to test whether ufs-weather-model works, you're doing it wrong.

@SamuelTrahanNOAA
Copy link
Collaborator Author

The problem may also be in the regression test case that was created for regional_spp_sppt_shum_skeb. Someone may have updated the workflow without updating the regression test inputs. You can look at my output here:

HERA: /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-PR/pr1152/ABORT

@JeffBeck-NOAA
Copy link
Contributor

@SamuelTrahanNOAA, I was able to reproduce the error. I took a look at pevr_ave (phyf000.nc file) of the non-debug run of the regional_spp_sppt_shum_skeb RT test and compared it to a non-SPP test. They're identical:
Screen Shot 2022-04-04 at 12 55 49 PM

@SamuelTrahanNOAA
Copy link
Collaborator Author

Maybe the difference is in the post requested fields? Perhaps one of them does not request the potential evaporation as output?

@SamuelTrahanNOAA
Copy link
Collaborator Author

The post only reaches the failing line if the potential evaporation is requested as an output. That's what this line means:

         IF (IGET(137)>0) THEN

@JeffBeck-NOAA
Copy link
Contributor

JeffBeck-NOAA commented Apr 4, 2022

@SamuelTrahanNOAA, I just reran with SPP, SPPT, SHUM, and SKEB turned off and produced the same error. Therefore, the resulting failure is not a problem with the stochastic physics code or the fix in my PR but an apparently unrelated and undiagnosed problem due to the lack of DEBUG tests with FV3_HRRR on the regional grid.

@SamuelTrahanNOAA
Copy link
Collaborator Author

In my opinion, fixing that is beyond the scope of this PR and its related issues.

@JeffBeck-NOAA
Copy link
Contributor

JeffBeck-NOAA commented Apr 4, 2022

In my opinion, fixing that is beyond the scope of this PR and its related issues.

Correct. However, PR #1152 will fix the regional_spp_sppt_shum_skeb RT as it stands, to make it reproducible.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Your bug fix introduces another bug. It won't be consistent across different thread counts:

abs_min_rand = ABS(MINVAL(rand_pert(:,1)))

The rand_pert only contains the values from the current block. For this to be consistent between different thread counts, it needs to be the minimum across all gridpoints (or at least all points in the tile).

@JeffBeck-NOAA
Copy link
Contributor

All blocks are consolidated into a single 1D array in the stochastic physics wrapper (stochastic_physics_wrapper.F90) before spp_wts_mp is passed to module_mp_thompson.F90 as rand_pert:

do nb=1,Atm_block%nblks
      GFS_Data(nb)%Coupling%spp_wts_mp(:,:) = spp_wts(nb,1:GFS_Control%blksz(nb),:,n)
end do

@SamuelTrahanNOAA
Copy link
Collaborator Author

It will still depend on MPI decomposition (layout_x and layout_y) since different ranks will have different parts of the domain.

@SamuelTrahanNOAA
Copy link
Collaborator Author

Closing in favor of proper bugfix in #1152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants