-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the regional_spp_sppt_shum_skeb regression test because it uses uninitialized memory #1151
Conversation
@JeffBeck-NOAA is the test owner of regional_spp_sppt_shum_skeb. @JeffBeck-NOAA please let us know if you are OK to not run the test in RT or if you want to have a fix for this test. |
@junwang-noaa, I would like to fix this test. There is a declared but uninitialized variable (min_rand) being used in module_mp_thompson.F90 that came from the original WRF SPP code. This part of the code was inactivated until SPP was added to FV3, so we were unaware of the non-reproducibility until now. It is possible that this was done on purpose by the original code developer to produce a random number using an uninitialized variable, but we are not sure at the moment. I'm investigating and will have a fix shortly. |
Can we temporarily ignore the results from this RT until a fix is applied? That will be coming ASAP as I've now found the original intent of the "min_rand" variable and will open a PR shortly to fix it. |
Yes. To temporarily ignore results, we disable the regression test in one PR (mine) and then re-enable it in a later one (yours). |
@junwang-noaa @SamuelTrahanNOAA, I have opened PR #1152, which successfully fixes this problem. I ran the RT suite on Hera which has passed for all normal and debug tests, except for the regional_spp_sppt_shum_skeb RT test, which will now have a new baseline result. Note that the test ran to completion, so the initialization of the abs_min_rand variable also resolves the large temperature value crash that @SamuelTrahanNOAA was experiencing. |
If the regional_spp_sppt_shum_skeb does not pass in debug mode, you have not fixed the problem. |
The regional_spp_sppt_shum_skeb test also passes in debug mode. |
Maybe we should add a debug mode regression test? I can give you the changes to rt.conf and rt_gnu.conf for that if you want. |
@SamuelTrahanNOAA, sure, we can add that. Please feel free to provide the changes and I'll incorporate it in my PR. Thanks. |
The test still fails in debug mode on hera.intel. It reaches the 0 hour forecast output but aborts in the post with "floating overflow" which usually means bad data in the model input.
In particular, it aborts when multiplying the potential evaporation by 1000:
The potential evaporation is a model variable (Diag%ep) sent to the post, but the model only calculates it when lssav is true (in GFS_surface_generic.F90). The namelist does not specify lssav, and it defaults to false, so that variable is never calculated. However, GFS_typedefs.F90 initializes it to zero. I can only see two ways ep could have invalid data:
Unfortunately, the model crashes before it can write any NetCDF output, so there's no way to check the potential evaporation (pevr_ave in the diag_table). |
@SamuelTrahanNOAA, my debug run with the fix succeeded on Hera, and post produced the output files without a problem. You can find the output directory here: /scratch2/BMC/fv3lam/HIWT/expt_dirs/MP/2019061500 The model source code is here: /scratch2/BMC/fv3lam/HIWT/ufs-srweather-app_new_code/src The model executables are here: /scratch2/BMC/fv3lam/HIWT/ufs-srweather-app_new_code/bin The code was built with the SRW App devbuild.sh script (devbuild.sh --build-type=DEBUG). |
Unfortunately, that isn't good enough. The ufs-weather-model regression test uses the build system in ufs-weather-model, which has a different environment than the one in the srweather app. You may be using a different version of esmf or the post, or using different compilation options. You can run the test by putting this in a file in tests/badjob.conf
Then run rt.sh in the tests/ directory: ./rt.sh -l badjob.conf You can add a "-e" to that line for ecFlow or -r for Rocoto. Eventually the job will fail and you'll get the error I showed you. |
By the way, this has been a persistent problem. The ufs-srweather-app compiles ufs-weather-model differently than ufs-weather-model does itself. Because of that, some tests that fail for ufs-weather-model will succeed for ufs-srweather-app, and vice-versa. For a long time, the HRRR configuration of RRFS would not run at all on Jet unless you built it with ufs-srweather-app. In other words, if you're using ufs-srweather-app to test whether ufs-weather-model works, you're doing it wrong. |
The problem may also be in the regression test case that was created for regional_spp_sppt_shum_skeb. Someone may have updated the workflow without updating the regression test inputs. You can look at my output here: HERA: /scratch2/BMC/wrfruc/Samuel.Trahan/gsl-PR/pr1152/ABORT |
@SamuelTrahanNOAA, I was able to reproduce the error. I took a look at pevr_ave (phyf000.nc file) of the non-debug run of the regional_spp_sppt_shum_skeb RT test and compared it to a non-SPP test. They're identical: |
Maybe the difference is in the post requested fields? Perhaps one of them does not request the potential evaporation as output? |
The post only reaches the failing line if the potential evaporation is requested as an output. That's what this line means:
|
@SamuelTrahanNOAA, I just reran with SPP, SPPT, SHUM, and SKEB turned off and produced the same error. Therefore, the resulting failure is not a problem with the stochastic physics code or the fix in my PR but an apparently unrelated and undiagnosed problem due to the lack of DEBUG tests with FV3_HRRR on the regional grid. |
In my opinion, fixing that is beyond the scope of this PR and its related issues. |
Correct. However, PR #1152 will fix the regional_spp_sppt_shum_skeb RT as it stands, to make it reproducible. |
Your bug fix introduces another bug. It won't be consistent across different thread counts:
The rand_pert only contains the values from the current block. For this to be consistent between different thread counts, it needs to be the minimum across all gridpoints (or at least all points in the tile). |
All blocks are consolidated into a single 1D array in the stochastic physics wrapper (stochastic_physics_wrapper.F90) before
|
It will still depend on MPI decomposition (layout_x and layout_y) since different ranks will have different parts of the domain. |
Closing in favor of proper bugfix in #1152 |
The regional_spp_sppt_shum_skeb uses uninitialized memory. (See NCAR/ccpp-physics#891). It is unreasonable to expect the code to be reproducible, or even work, when it has that bug.
This is holding up other work (NOAA-GSL/ccpp-physics#143) which changes a totally unrelated files. That unrelated change is just enough to put unphysical data in the uninitialized variables. Valid development work should not be held up by a broken regression test.
PR Checklist
This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.
This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR
An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
are specified below.
Results for one or more of the regression tests change and the reasons for the changes are understood and explained below.New or updated input data is required by this PR. If checked, please work with the code managers to update input data sets on all platforms.Instructions: All subsequent sections of text should be filled in as appropriate.
The information provided below allows the code managers to understand the changes relevant to this PR, whether those changes are in the ufs-weather-model repository or in a subcomponent repository. Ufs-weather-model code managers will use the information provided to add any applicable labels, assign reviewers and place it in the Commit Queue. Once the PR is in the Commit Queue, it is the PR owner's responsiblity to keep the PR up-to-date with the develop branch of ufs-weather-model.
Description
The regional_spp_sppt_shum_skeb test in rt.conf is commented out and surrounded by comments explaining why.
Issue(s) addressed
This is a temporary solution to NCAR/ccpp-physics#891 to allow other development to continue, until the stochastic thompson scheme developers can fix the bug.
Testing
This simply removes a regression test, so there is no need to run regression tests.
Dependencies
None.