Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAFS nested tests failed on several platforms #1085

Closed
junwang-noaa opened this issue Mar 7, 2022 · 11 comments · Fixed by #1227
Closed

HAFS nested tests failed on several platforms #1085

junwang-noaa opened this issue Mar 7, 2022 · 11 comments · Fixed by #1227
Assignees
Labels
bug Something isn't working

Comments

@junwang-noaa
Copy link
Collaborator

Description

The RT test hafs_regional_telescopic_2nests_atm failed a couple times on Orion when new baseline is created. The baseline was created successfully. But the RT test against the baseline failed with error:

Comparing atmf006.nc .........OK
Comparing sfcf006.nc .........OK
Comparing atm.nest02.f006.nc ............ALT CHECK......ERROR

It turned out that the compare_ncfile.py failed when comparing the atm.nest02.f006.nc from baseline and from the RT test:

compare_ncfile.py atm.nest02.f006.nc /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_410503/hafs_regional_telescopic_2nests_atm/atm.nest02.f006.nc
Traceback (most recent call last):
File "/work/noaa/nems/emc.nemspara/autort/pr/867250832/20220304164615/ufs-weather-model/tests/compare_ncfile.py", line 14, in
if np.shape(nc1[varname][:])!=np.shape(nc2[varname][:]):
File "netCDF4/_netCDF4.pyx", line 4408, in netCDF4._netCDF4.Variable.getitem
File "netCDF4/_netCDF4.pyx", line 5352, in netCDF4._netCDF4.Variable._get
File "netCDF4/_netCDF4.pyx", line 1887, in netCDF4._netCDF4._ensure_nc_success

When rerun the test without using baseline creation, there is no issue with file comparison.

To Reproduce:

on orion, check out model code:
cd ufs-weather-model/tests
./rt.sh -c -e
./rt.sh -m -e

The orion log file show:
FAILED TESTS:
Test hafs_regional_telescopic_2nests_atm 104 failed in run_test failed

@junwang-noaa junwang-noaa added the bug Something isn't working label Mar 7, 2022
@junwang-noaa
Copy link
Collaborator Author

@jkbk2004 Would you please take a look of this issue on orion? Thanks

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Mar 7, 2022

@junwang-noaa I will look into that on orion.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Mar 7, 2022

@junwang-noaa it seems ./rt.sh -c -e went thru ok on orion: /work/noaa/epic-ps/jongkim/rt-blcheck/stmp/jongkim/FV3_RT/REGRESSION_TEST_INTEL

@junwang-noaa
Copy link
Collaborator Author

@jkbk2004 Yes, the baseline creation is OK. Can you test if "./rt.sh -m -e" can also run successfully? Thanks

@jkbk2004
Copy link
Collaborator

@junwang-noaa I used debug que to test out. It seems like it runs ok: /work/noaa/epic-ps/jongkim/rt-blcheck/stmp/jongkim/FV3_RT/rt_53203/hafs_regional_telescopic_2nests_atm
Test 001 hafs_regional_telescopic_2nests_atm PASS Tries: 2

@junwang-noaa
Copy link
Collaborator Author

@jkbk2994 Can you try the full RT to see if this is still an issue? Thanks

@jkbk2004
Copy link
Collaborator

@junwang-noaa Full RT tests are also successful: /work/noaa/epic-ps/jongkim/UFS-RT-tests/rt-blcheck/tests. We can close the PR. If we see the issue later, we can re-open then.

@junwang-noaa
Copy link
Collaborator Author

hafs_regional_storm_following_1nest_atm failed on hera in PR#909.

@junwang-noaa junwang-noaa changed the title RT test hafs_regional_telescopic_2nests_atm failed on Orion HAFS nested tests failed on several platforms Apr 15, 2022
@BinLiu-NOAA
Copy link
Contributor

@junwang-noaa, it looks to me these HAFS nesting RT failures are related to the corrupted/incomplete atm/sfc.nest??.fhhh.nc files (even though the file sizes look normal). By any chance, the write grid component did not properly close the netcdf files? Any comments/suggestions/ideas on what might be the cause for these corrupted/incomplete history output files will be much appreciated. Thanks!

@junwang-noaa
Copy link
Collaborator Author

@BinLiu-NOAA Code managers have been tracking this error for a while. This is what we observed:

  1. The error only happened when a new baseline was created. We haven't seen it in the PR against existing baseline. Every time the baseline creation was finished successfully and all the output files including those created after atm.nest02.f006.nc were copied to baseline location. The error happened at RT verification step, it showed that the atm.nest02.f006.nc file had some issue. When the baseline atm.nest02.f006.nc is regenerated, the RT test against it runs successfully. It is confirmed moving baseline is not the issue.
  2. It only happened to atm.nest02.f006.nc file, it would be any of the HAFS test with this output.
  3. It happened on several platforms
  4. It happened randomly.

Currently we suspect it might be a netcdf issue. It would be good to test with netcdf 4.8.1 (current version 4.7.4).

@DeniseWorthen
Copy link
Collaborator

The hafs_regional_storm_following_1nest_atm test on hera.intel was turned off in PR #909 (comment). We need a follow-up PR to switch from netcdf-parallel to netcdf in all the nested HAFS tests and then turn the test back on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants