Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] feat: predefined grid support for smoke/dust component #1190

Conversation

benkozi
Copy link
Collaborator

@benkozi benkozi commented Feb 3, 2025

DESCRIPTION OF CHANGES:

  • Adds support for SRW predefined grids to the smoke/dust component. NA grids currently fail in prep tasks due to incorrect sources for ICs (need GFS not RAP/HRRR). Currently supported grids: CONUS 13km, CONUS 3km, CONUS 25km. Note that predefined grid data is currently staged on Hera.
  • Smoke/dust task runs in parallel.
  • Adds unit tests.
  • Adds support for T1 platforms.
  • Scripts refactored to use an initialize-run-finalize strategy. Scripts have been broken into modules more reminiscent of a library.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

TESTS CONDUCTED:

  • derecho.intel
  • gaea.intel
  • gaea-c6.intel
  • hera.gnu
  • hera.intel
  • hercules.intel
  • jet.intel
  • orion.intel
  • wcoss2.intel
  • NOAA Cloud (indicate which platform)
  • Jenkins
  • fundamental test suite
  • comprehensive tests (specify which if a subset was used)

DOCUMENTATION:

ISSUE:

CHECKLIST

  • My code follows the style guidelines in the Contributor's Guide
  • I have performed a self-review of my own code using the Code Reviewer's Guide
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes do not require updates to the documentation (explain).
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

CONTRIBUTORS (optional):

@chan-hoo @JohanaRomeroAlvarez @gspetro-NOAA

chan-hoo and others added 30 commits January 23, 2025 13:48
[feature/add_sd] Update TechDocs/ush to allow for the TechDocs to pass
…oncile' into feat/predefined-grid-support-reconcile
@MichaelLueken
Copy link
Collaborator

@benkozi -

In order to add support to all Tier-1 platforms, please add:

load(pathJoin("nco", os.getenv("nco_ver") or "5.0.6"))
load(pathJoin("prod_util", os.getenv("prod_util_ver") or "2.1.1"))

to the following modulefiles:

  • build_gaea_intel.lua

These modifications are already present in build_hera_intel.lua, build_gaea-c6_intel.lua, build_hercules_intel.lua, and build_orion_intel.lua. In PR #1195, I have updated build_hera_gnu.intel and @EdwardSnyder-NOAA has updated build_noaacloud_intel.lua.

It also looks like you will need to apply changes that you made to the ush/machine/gaea-c6.yaml file to other machines files as well. On Derecho, I'm seeing this while attempting to run generate_emissions.py:

+ 24 + mpirun -n '$nprocs' python /glade/derecho/scratch/mlueken/ufs-srweather-app/derecho/ush/smoke_dust/generate_emissions.py --staticdir /glade/work/epicufsrt/contrib/UFS_SRW_data/develop/fix/fix_smoke/RRFS_CONUS_3km --ravedir /glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs/../nco_dirs/test_smoke_dust/tmp/smoke_dust.2019072200.7820263.desched1 --intp-dir /glade/derecho/scratch/mlueken/ufs-srweather-app/expt_dirs/../nco_dirs/test_smoke_dust/tmp/DATA_SHARE/RAVE_fire_intp --predef-grid RRFS_CONUS_3km --ebb-dcycle 1 --restart-interval '6 12 18 24' --persistence False --rave-qa-filter none --exit-on-error True --log-level info

It fails without any message following this command.

On Gaea-C5, the error message is as follows:

+ 103 + mpirun -n '$nprocs' python /gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/ush/smoke_dust/generate_emissions.py --staticdir /gpfs/f5/epic/world-shared/UFS_SRW_data/develop/fix/fix_smoke/RRFS_CONUS_3km --ravedir /gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/expt_dirs/../nco_dirs/test_smoke_dust/tmp/smoke_dust.2019072200.135378377 --intp-dir /gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/expt_dirs/../nco_dirs/test_smoke_dust/tmp/DATA_SHARE/RAVE_fire_intp --predef-grid RRFS_CONUS_3km --ebb-dcycle 1 --restart-interval '6 12 18 24' --persistence False --rave-qa-filter none --exit-on-error True --log-level info
/gpfs/f5/epic/scratch/Michael.Lueken/ufs-srweather-app/gaeac5/scripts/exsrw_smoke_dust.sh: line 110: mpirun: command not found

These were tests run using ush/config.smoke_dust.yaml. I also need to test the WE2E test to see how those run on the various platforms.

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 21, 2025

@MichaelLueken - Regarding derecho and gaea-c5, it appears that the srw_sd conda environment is not loaded appropriately. For example, I get these paths for mpirun/mpiexec and python on derecho (I assume the same is happening on c5 too):

+ 23 + which mpirun
/opt/cray/pe/pals/1.2.11/bin/mpirun
+ 23 + which python
/glade/derecho/scratch/benkoz/sandbox/srw/ufs-srweather-app/conda/envs/srw_app/bin/python
+ 23 + which mpiexec
/opt/cray/pe/pals/1.2.11/bin/mpiexec

Do you know why this may be happening on these platforms before I dig further?

@MichaelLueken
Copy link
Collaborator

@benkozi -

I think I might have an idea of what is happening. For most machines, setting:

RUN_CMD_SMOKE_DUST: mpirun -n $nprocs python

in ush/config_defaults.yaml is fine. However, Gaea-C5 and Derecho don't use mpirun. Gaea-C5 uses srun and Derecho uses mpiexec.

It's not clear to me how Gaea-C6 is able to successfully run the mpirun command, since it also uses srun rather than mpirun.

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 24, 2025

@MichaelLueken - After a little investigation, we need to be loading python_srw_sd on the problem platforms similar to: https://github.com/ufs-community/ufs-srweather-app/blob/develop/modulefiles/tasks/orion/smoke_dust.local.lua#L1. I'll be looking into it. A minimal reproducer indicated that with the correct conda env loaded, mpirun will work as expected on derecho.

@MichaelLueken
Copy link
Collaborator

@benkozi -

It looks like you will need to make the following changes to allow smoke and dust to run on all Tier-1 platforms:

  1. Please add both smoke_dust.local.lua and prepstart.local.lua files from one of the currently supported modulefiles/tasks directories into both derecho and gaeac5.
  2. Please add load(pathJoin("nco", os.getenv("nco_ver") or "5.0.6")) to modulefiles/build_derecho_intel.lua:
load("srw_common")

load(pathJoin("nco", os.getenv("nco_ver") or "5.0.6"))
load(pathJoin("prod_util", os.getenv("prod_util_ver") or "2.1.1"))

setenv("CMAKE_Platform","derecho.intel")

With these changes, the smoke and dust tests work on Derecho:

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
201907220000               make_grid                     7842850           SUCCEEDED                   0         1          78.0
201907220000               make_orog                     7843054           SUCCEEDED                   0         1         363.0
201907220000          make_sfc_climo                     7843234           SUCCEEDED                   0         1         110.0
201907220000              smoke_dust                     7843294           SUCCEEDED                   0         1         192.0
201907220000               prepstart                     7843356           SUCCEEDED                   0         1          69.0
201907220000           get_extrn_ics                     7842852           SUCCEEDED                   0         1          58.0
201907220000          get_extrn_lbcs                     7842854           SUCCEEDED                   0         1          57.0
201907220000         make_ics_mem000                     7843295           SUCCEEDED                   0         1         334.0
201907220000        make_lbcs_mem000                     7843297           SUCCEEDED                   0         1         181.0
201907220000         run_fcst_mem000                     7843387           SUCCEEDED                   0         1        1235.0
201907220000    run_post_mem000_f000                     7843674           SUCCEEDED                   0         1         182.0
201907220000    run_post_mem000_f001                     7843673           SUCCEEDED                   0         1         197.0
201907220000    run_post_mem000_f002                     7843678           SUCCEEDED                   0         1         189.0
201907220000    run_post_mem000_f003                     7843675           SUCCEEDED                   0         1         211.0
201907220000    run_post_mem000_f004                     7843676           SUCCEEDED                   0         1         214.0
201907220000    run_post_mem000_f005                     7843677           SUCCEEDED                   0         1         210.0
201907220000    run_post_mem000_f006                     7843679           SUCCEEDED                   0         1         214.0
================================================================================================================================
201907220600              smoke_dust                     7843680           SUCCEEDED                   0         1         128.0
201907220600               prepstart                     7843826           SUCCEEDED                   0         1         166.0
201907220600           get_extrn_ics                     7842853           SUCCEEDED                   0         1          56.0
201907220600          get_extrn_lbcs                     7842855           SUCCEEDED                   0         1          57.0
201907220600         make_ics_mem000                     7843296           SUCCEEDED                   0         1         320.0
201907220600        make_lbcs_mem000                     7843298           SUCCEEDED                   0         1         132.0
201907220600         run_fcst_mem000                     7843900           SUCCEEDED                   0         1        1255.0
201907220600    run_post_mem000_f000                     7844299           SUCCEEDED                   0         1         186.0
201907220600    run_post_mem000_f001                     7844300           SUCCEEDED                   0         1         209.0
201907220600    run_post_mem000_f002                     7844301           SUCCEEDED                   0         1         217.0
201907220600    run_post_mem000_f003                     7844302           SUCCEEDED                   0         1         214.0
201907220600    run_post_mem000_f004                     7844304           SUCCEEDED                   0         1         214.0
201907220600    run_post_mem000_f005                     7844303           SUCCEEDED                   0         1         217.0
201907220600    run_post_mem000_f006                     7844305           SUCCEEDED                   0         1         221.0

and Gaea-C5:

       CYCLE                    TASK                       JOBID               STATE         EXIT STATUS     TRIES      DURATION
================================================================================================================================
201907220000               make_grid                   135380312           SUCCEEDED                   0         1          45.0
201907220000               make_orog                   135380343           SUCCEEDED                   0         1         429.0
201907220000          make_sfc_climo                   135380360           SUCCEEDED                   0         1         131.0
201907220000              smoke_dust                   135380368           SUCCEEDED                   0         1         144.0
201907220000               prepstart                   135380374           SUCCEEDED                   0         1         101.0
201907220000           get_extrn_ics                    68779831           SUCCEEDED                   0         1          51.0
201907220000          get_extrn_lbcs                    68779832           SUCCEEDED                   0         1          58.0
201907220000         make_ics_mem000                   135380369           SUCCEEDED                   0         1         154.0
201907220000        make_lbcs_mem000                   135380370           SUCCEEDED                   0         1         112.0
201907220000         run_fcst_mem000                   135380381           SUCCEEDED                   0         1        1131.0
201907220000    run_post_mem000_f000                   135380389           SUCCEEDED                   0         1         158.0
201907220000    run_post_mem000_f001                   135380390           SUCCEEDED                   0         1         178.0
201907220000    run_post_mem000_f002                   135380391           SUCCEEDED                   0         1         173.0
201907220000    run_post_mem000_f003                   135380392           SUCCEEDED                   0         1         176.0
201907220000    run_post_mem000_f004                   135380393           SUCCEEDED                   0         1         186.0
201907220000    run_post_mem000_f005                   135380394           SUCCEEDED                   0         1         199.0
201907220000    run_post_mem000_f006                   135380395           SUCCEEDED                   0         1         207.0
================================================================================================================================
201907220600              smoke_dust                   135380396           SUCCEEDED                   0         1         133.0
201907220600               prepstart                   135380401           SUCCEEDED                   0         1          96.0
201907220600           get_extrn_ics                    68779833           SUCCEEDED                   0         1          78.0
201907220600          get_extrn_lbcs                    68779834           SUCCEEDED                   0         1          45.0
201907220600         make_ics_mem000                   135380378           SUCCEEDED                   0         1         135.0
201907220600        make_lbcs_mem000                   135380372           SUCCEEDED                   0         1         129.0
201907220600         run_fcst_mem000                   135380402           SUCCEEDED                   0         1        1124.0
201907220600    run_post_mem000_f000                   135380419           SUCCEEDED                   0         1         190.0
201907220600    run_post_mem000_f001                   135380421           SUCCEEDED                   0         1         204.0
201907220600    run_post_mem000_f002                   135380420           SUCCEEDED                   0         1         196.0
201907220600    run_post_mem000_f003                   135380422           SUCCEEDED                   0         1         201.0
201907220600    run_post_mem000_f004                   135380423           SUCCEEDED                   0         1         179.0
201907220600    run_post_mem000_f005                   135380424           SUCCEEDED                   0         1         181.0
201907220600    run_post_mem000_f006                   135380425           SUCCEEDED                   0         1         195.0

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 24, 2025

That's great news @MichaelLueken! I pushed your recommended changes here: benkozi@8ecd735. Thanks for testing and identifying the necessary changes.

Copy link
Collaborator

@MichaelLueken MichaelLueken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @benkozi!

The retests have successfully passed on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf_20250224134111        COMPLETE            1710.77
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1710.77

and Gaea-C5:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf_20250224164212        COMPLETE            1907.48
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            1907.48

Approving now.

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 25, 2025

Thank you @MichaelLueken and @chan-hoo for the approvals!

@MichaelLueken MichaelLueken added the run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests label Feb 25, 2025
@MichaelLueken
Copy link
Collaborator

The Jenkins runner is currently down for Gaea-C6. Manually ran the Jenkins scripts on that platform. All coverage WE2E tests successfully passed:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
community_20250225140325                                           COMPLETE              22.41
custom_ESGgrid_NewZealand_3km_20250225140325                       COMPLETE              73.62
grid_RRFS_CONUScompact_13km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_2  COMPLETE              36.41
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_RAP_20250225140  COMPLETE              39.42
grid_RRFS_CONUS_13km_ics_FV3GFS_lbcs_FV3GFS_suite_HRRR_2025022514  COMPLETE              37.92
grid_RRFS_CONUS_3km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15_thompson  COMPLETE             528.24
grid_RRFS_CONUScompact_25km_ics_HRRR_lbcs_HRRR_suite_HRRR_2025022  COMPLETE              30.06
grid_RRFS_CONUScompact_3km_ics_HRRR_lbcs_RAP_suite_RRFS_v1beta_20  COMPLETE             422.00
grid_SUBCONUS_Ind_3km_ics_RAP_lbcs_RAP_suite_RRFS_v1beta_plot_202  COMPLETE              10.88
smoke_dust_grid_RRFS_CONUS_3km_suite_HRRR_gf_20250225140330        COMPLETE            1178.06
2020_CAPE_20250225140331                                           COMPLETE              39.94
2020_easter_storm_20250225140331                                   COMPLETE              39.91
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2458.87

Awaiting the rest of the Jenkins automated tests before moving forward.

@MichaelLueken
Copy link
Collaborator

The UFS_FIRE WE2E tests have successfully passed via Jenkins srw-fire-aqm pipeline on Hercules.

The AQM WE2E test, however, is failing in both nexus_emission and point_source on Hercules. In point_source, the failure is due to necessary emissivity fixed files no longer present in the EPIC fix space:

FileNotFoundError: [Errno 2] No such file or directory: '/work/noaa/epic/role-epic/contrib/UFS_SRW_data/develop/fix/fix_emis/NEI2016v1'

For nexus, the failure appears to be associated with loading the AQM nexus_emission task modulefile:

+ 28 + eval srun --export=ALL /work/noaa/epic/mlueken/ufs-srweather-app/hercules/exec/nexus -c NEXUS_Config.rc -r grid_spec.nc -o NEXUS_Expt_split.nc
+ 82 + postamble load_modules_run_task.sh 1740517759 137
+ 82 + set +x
End load_modules_run_task.sh at Tue Feb 25 21:10:41 UTC 2025 with error code 137 (time elapsed: 00:01:22)

We can't move forward with these changes until the issues with the AQM WE2E test have been resolved.

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 25, 2025

The AQM WE2E test, however, is failing in both nexus_emission and point_source on Hercules. In point_source, the failure is due to necessary emissivity fixed files no longer present in the EPIC fix space:

@MichaelLueken - I reached out to the US/SI team for help creating the links on orion/hercules. I also reached out to US/SI about derecho's fix_emis staging.

@benkozi
Copy link
Collaborator Author

benkozi commented Feb 25, 2025

@MichaelLueken - @EdwardSnyder-NOAA updated links on hercules and derecho. Let me know if you encounter any more issues with the AQM test.

@MichaelLueken
Copy link
Collaborator

The Jenkins automated tests have successfully completed for all machines.

The AQM WE2E tests have successfully passed on Derecho:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used 
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20250226072021                   COMPLETE            3366.13
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            3366.13

and Hera:

----------------------------------------------------------------------------------------------------
Experiment name                                                  | Status    | Core hours used
----------------------------------------------------------------------------------------------------
aqm_grid_AQM_NA13km_suite_GFS_v16_20250225220912                   COMPLETE            2903.74
----------------------------------------------------------------------------------------------------
Total                                                              COMPLETE            2903.74

Hercules and Orion are undergoing maintenance, so tests can't be run on those platforms.

Moving forward with merging this work.

@MichaelLueken MichaelLueken merged commit 33f4587 into ufs-community:develop Feb 26, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run_we2e_coverage_tests Run the coverage set of SRW end-to-end tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants