-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732
[develop] Round 2 of overhaul to WE2E test suites (and other test improvements!) #732
Conversation
…ush/ directory to list of valid tests via symlinks in new directory test_configs/default_configs/. Also include these new tests in the comprehensive suite.
…rectory, standardize function docstrings
…enerate_FV3LAM_wflow.py should always output descriptive error messages if invalid config.yaml is provided
- Fix chdir bug in test_retrieve_data.py - Relax timeout and delay times for wget commands in retrieve_data.py - Various minor code fixes
- Add test date for early RAP data with ICS - Retrieve RAP 09z out to 45 hours
…V3GFS_suite_WoFS_v0 test
…like object, keep it as a string. This fixes NOMADS test error (and any other test using the "days_ago" template)
…"wontfix" test grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
…uite to its original purpose (small set of cheap tests to run on any machine)
…ite_GFS_v16 to grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8
…under 100 core hours
… for now, it is not working; see issue 731
…n_post tasks to use 12 nodes to resolve ufs-community#705
…occasional walltime-related failures
@MichaelLueken thanks for keeping me updated. I have pushed a fix to the NOMADS test (not sure why it was including checks for HPSS and AWS as well). Note that I am not sure if the NOMADS test should work on Gaea (since I can not test it myself), so if it fails again I can move that test to another machine in the coverage set. |
@mkavulich The majority of the Jenkins issues have been addressed. The EPIC role account has also been granted access to HPSS, but we don't currently have rstprod access. This is causing the On other machines, replacing |
… fix errors in wget due to the built-in wget on Orion being quite old
@MichaelLueken Thanks for kicking off the tests again. Note that there may be some failures on Orion still: I discovered some more static data directories that do not have read permissions. Once that is fixed the tests should succeed. I also added "wget" to the list of modules to load on Orion; this was a suggestion from the Orion helpdesk to solve some wget errors for the NOMADs test. This shouldn't affect the results since the NOMADS test is not run on Orion for the coverage tests, but I thought I would mention it in case other failures pop up. |
@mkavulich The GSMGFS test that is failing due to attempting to pull restricted data is On Gaea,
It might be better to replace the |
…capability on Gaea; remove get_from_HPSS_ics_GSMGFS_lbcs_GSMGFS test due to data restrictions...developers have decided that this legacy capability is not worth fretting over
@MichaelLueken The changes are now in for running different tests (I also included a fix for a bug in the plotting script in #742); unless there is a failure I missed the only coverage tests that should need re-running prior to merge are Gaea and Hera/GNU. |
@mkavulich PR #736 was merged earlier this morning. This PR updated the test_retrieve_data.py unittest script. Please merge the latest develop into your branch to correct the conflict. Thanks! |
@mkavulich The latest testing has completed. All tests pass on both Gaea and Hera GNU. A rerun of Cheyenne Intel, however, is still showing two persistent failures (these failures were previously noted, but the cause of the error was blamed on directory naming). The
The Jenkins directory for this experiment is - The
The Jenkins directory for this experiment is - I can see that there is no |
@mkavulich It looks like moving the |
…improvement_round_2
…meoffset_suite_GFS_v16 test, data needs to be staged
…to Hera, intel due to bad libraries problem on Cheyenne
@MichaelLueken I have merged in the latest changes and made changes to the test files so the coverage tests should now succeed. I can not figure out why the test_retrieve_data unit tests are failing. They are running successfully (albeit very slowly due to the large file size) on Hera. Do you have any insight on this? Without being able to replicate it I don't know how to solve this problem. |
@mkavulich Interestingly, while attempting to run the failing unit test on Hera using your branch, I'm seeing the same failures that I have noted below regarding the UFS-CASE-STUDY ICs and LBCs from AWS. Looking at the details for the failed
and
It appears as though the unit test is failing due to the inability to find the the files that are supposed to be pulled from AWS. Looking at the file naming convention for the CAPE 2020 case study, the data being pulled should be correct:
Unfortunately, the DEBUG level prints aren't available in the log for the failed unit test, so there are no messages like:
At this point, I can only think that the files that the unit test is attempting to pull doesn't include |
@mkavulich The unit tests are now passing. I will retest Cheyenne Intel and Hera Intel, then merge this work. Thank you very much! |
Okay, I think I figured out what was happening (thought I still don't understand it). The new tests didn't have the "chdir" commands requested by @danielabdi-noaa, and for some reason those tests specifically were failing without the chdir commands (despite others working). The problem didn't appear for me because I was running those tests individually rather than as part of the full set of tests. I pushed a change to include those lines and it seems to be working now. Regarding the older failures, the I have merged in the latest changes and swapped the |
The unit tests have successfully passed and reruns of Cheyenne Intel and Hera Intel show that the tests pass (with the exceptions of verification tests, outlined in issue #688). Merging this PR to develop now. |
Renamed the |
Note to code manager: The label
run_we2e_fundamental_tests
should be renamed torun_we2e_coverage_tests
after this PR is merged.DESCRIPTION OF CHANGES:
This test continues the overhaul of WE2E test suites as described in Issue #587 (specifically stage 3 and parts of stage 4 in this comment). The changes are summarized below, roughly in order of importance.
comprehensive.<platform>[.<compiler>]
files are included to automatically run only the tests expected to succeedconfig_parser.py
, when populating a jinja template, keep dates in string format rather than converting to adatetime
object (this fixes problem withget_from_NOMADS_ics_FV3GFS_lbcs_FV3GFS
)retrieve_data.py
, there was a bug causing all tests to be run in nested subdirectories that eventually leads to failure when running all tests including HPSS retrievaltest_retrieve_data.yaml
test_retrieve_data.yaml
ush/
directory (config.community.yaml
andconfig.nco.yaml
) are now included as WE2E tests (symbolically linked in thetests/WE2E/test_configs/default_configs/
directory)grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16
(WE2E test "grid_RRFS_CONUS_25km_ics_GSMGFS_lbcs_GSMGFS_suite_GFS_v16" fails with segmentation fault at run_fcst step #359). This is now an old capability with only legacy support (global spectral model was retired in 2019) and there are no immediate plans to fix the bug.WE2E_summary*.txt
files are now written to the experiment directory rather thantests/WE2E
grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v16
togrid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v17_p8
for coverage reasonsprint_info_msg
messages tologging.debug
calls to allow suppression of superfluous output if desiredNotes on current test limitations
run_envir="nco"
) results in random failures for WE2E tests #652 (as inherited from previous "fundamental" testing, the coverage tests for Hera with intel are all run in NCO mode). These should be run with caution until these issues are resolved.grid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
is a known failure currently, and so has been removed from the pool of tests for now. This problem is described in issue WE2E testgrid_RRFS_NA_3km_ics_FV3GFS_lbcs_FV3GFS_suite_RRFS_v1beta
fails at the run_fcst step #731Type of change
TESTS CONDUCTED:
test_retrieve_data.py
successfullyDEPENDENCIES:
None.
DOCUMENTATION:
Documentation for WE2E tests, including the table of test descriptions, has been updated. You can review the built documentation here: https://ufs-srweather-app-mkavulich.readthedocs.io/en/latest/WE2Etests.html
ISSUE:
retrieve_data.py
will not run successfully on Jet, Hera #727CHECKLIST