Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRT with ecflow #54

Merged
merged 9 commits into from
Sep 17, 2024
Merged

NRT with ecflow #54

merged 9 commits into from
Sep 17, 2024

Conversation

aerorahul
Copy link
Contributor

@aerorahul aerorahul commented Sep 13, 2024

This PR:

  • adds a suite definition template for running WAFS 4xdaily (wafs_nrt.def.tmpl)
  • adds cycle specific triggers are set as follow: 00z (0335), 06z (0935), 12z (1535), and 18z (2135)
  • updates the usage of dev/ecf/setup_ecf.sh to allow for running a single date PDYcyc or NRT.

Additionally, through this PR, the triggering of UPP, GCIP, GRIB jobs are determined by the existence of atm, sfc, and master files from GFS and not just the log file. Without this, their were failures in the NRT run due to some files being copied over from primary to backup machine before others.

I started the parallel at the 12z cycle today 09/13/2024 on Dogwood.
The details are:

PACKAGEHOME: /lfs/h2/emc/eib/noscrub/rahul.mahajan/EE2/wafs.v7.0.0
COM: /lfs/h2/emc/ptmp/rahul.mahajan/wafsNRT/com/wafs/v7.0
OUTPUT: /lfs/h2/emc/stmp/rahul.mahajan/wafsNRT/output/
DATAROOT: /lfs/h2/emc/stmp/rahul.mahajan/wafsNRT/tmp/

Screenshot 2024-09-13 at 4 51 04 PM

I did hit some wallclock failures (UPP, GCIP jobs), but they succeeded on rewinding the jobs. We will likely need to examine why. The logs will be in the output directory above, if you wish to take a look.

@aerorahul
Copy link
Contributor Author

aerorahul commented Sep 14, 2024

I am getting failures in realtime runs for the 18Z gcip f003 job that have this message:

*************************************************************
*** WARNING !! COULD NOT FIND GLOBCOMPVIS Satellite Data  ***
*************************************************************

One or more GLOBCOMPVIS Satellite Data files are missing, including
   /lfs/h1/ops/prod/dcom/20240913/mcidas/GLOBCOMPSIR.2024091321

wafs_gcip_f003_18 will gracfully exit

When this job is run much later, it ran successfully. This is seen in the timeline of this job. It ran 3 times. The second run was immediately resubmitted upon the first failure. The third run (success) was much after.
Screenshot 2024-09-13 at 10 05 29 PM
Is it possible that the DCOM data is not synced from primary to backup machine before the GFS data gets synced.

This should not be a problem in operations, when DCOM and GFS data are on the same primary machine at the same time.

If that is not the case, we will need a different mechanism to trigger the gcip jobs based on availability of dcom data.

@YaliMao-NOAA
Copy link
Collaborator

YaliMao-NOAA commented Sep 15, 2024

As described in /lfs/h1/ops/prod/packages/gfs.v16.3.17/ecf/defs/gfs_v16_3.def, except GRIB1 is triggered by GFS data itself, other jobs have time trigger:
gcip: 04:40
others: 4:30

@aerorahul
Copy link
Contributor Author

Thanks @YaliMao-NOAA
It wasn't really clear or explained anywhere, why that time trigger exists and for what purpose.
The WAFS EE2 kickoff slide 6, 7 just has this information:
Screenshot 2024-09-15 at 11 26 55 PM
The release notes should capture this information, as well as the manual in the ecf script should note this.

@aerorahul
Copy link
Contributor Author

Till this morning's 06z, the realtime was running with a 1 day offset; i.e PDY=PDYm1 to compensate for the failures in GCIP jobs.
This is now reverted, and the GCIP jobs will be triggered at the appropriate time. (0440 for 00z cycle). It is not clear if the GCIP data will be available on the backup system at this time. If it is not, 0440 will need to be deferred further.
12z run will be monitored and reported in this PR.

@aerorahul
Copy link
Contributor Author

aerorahul commented Sep 16, 2024

The 12z realtime run left a message in
wafsNRT/com/wafs/v7.0/wafs.20240916/12/grib2/0p25/blending/wmo/wifs_0p25_admin_msg as

NOXX10 KKCI 161200
WAFC WASHINGTON ADVISES ALL USERS OF CB CLOUD, ICING AND TURBULENCE WAFS FORECASTS IN GRIB2 FORMAT AT 0.25 DEGREE THAT PRODUCTION PROBLEMS HAVE TRIGGERED CONTINGENCY MEASURES AND THE ISSUANCE OF NON-HARMONIZED FORECASTS.

STANDARD WAFS FORECAST PARAMETERS IN GRIB2 FORMAT (WIND, TEMPERATURE, HUMIDITY, TROP HEIGHT, MAX WIND, MAX WIND HEIGHT) ARE UNAFFECTED, AND ARE AVAILABLE AS NORMAL.

WAFC WASHINGTON APOLOGIZES FOR ANY INCONVENIENCE CAUSED DUE TO THIS ISSUE.

This happened because:
/lfs/h1/ops/dev/dcom/test/20240916/wgrbbul/ukmet_wafs/egrr_wafshzds_unblended_turb_0p25_2024-09-16T12:00Z_t030.grib2 was not found in the runtime of the blending job wafs_grib2_0p25_blending_f030_12.o189606243

I checked for the above file on the primary machine, and it was not present there either.

@YaliMao-NOAA
Copy link
Collaborator

@aerorahul I didn't really know ecflow before you taught me. I am adding time trigger to the kickout slides, may you please let me your thoughts for UPP and grib1? Does the manager need a time trigger too?
https://docs.google.com/presentation/d/1yhdTfTHoBvV7K6jR2nfvkNAWn_eDJ2lTvDueRp9C89w/edit#slide=id.g2eeab8aa817_0_0

@YaliMao-NOAA
Copy link
Collaborator

Please count in time trigger. The blending process should start at T+4:30, plus 25 minutes of waiting time, it won't quit waiting till T+4:55. Now it's 16:45, T+4:45 for 12z.

The 12z realtime run left a message in wafsNRT/com/wafs/v7.0/wafs.20240916/12/grib2/0p25/blending/wmo/wifs_0p25_admin_msg as

NOXX10 KKCI 161200
WAFC WASHINGTON ADVISES ALL USERS OF CB CLOUD, ICING AND TURBULENCE WAFS FORECASTS IN GRIB2 FORMAT AT 0.25 DEGREE THAT PRODUCTION PROBLEMS HAVE TRIGGERED CONTINGENCY MEASURES AND THE ISSUANCE OF NON-HARMONIZED FORECASTS.

STANDARD WAFS FORECAST PARAMETERS IN GRIB2 FORMAT (WIND, TEMPERATURE, HUMIDITY, TROP HEIGHT, MAX WIND, MAX WIND HEIGHT) ARE UNAFFECTED, AND ARE AVAILABLE AS NORMAL.

WAFC WASHINGTON APOLOGIZES FOR ANY INCONVENIENCE CAUSED DUE TO THIS ISSUE.

This happened because: /lfs/h1/ops/dev/dcom/test/20240916/wgrbbul/ukmet_wafs/egrr_wafshzds_unblended_turb_0p25_2024-09-16T12:00Z_t030.grib2 was not found in the runtime of the blending job wafs_grib2_0p25_blending_f030_12.o189606243

I checked for the above file on the primary machine, and it was not present there either.

@aerorahul
Copy link
Contributor Author

The 12z cycle finished with the above noted exceptions.
I will now shut down the near realtime parallel as it has served its purpose.

The new information about triggers timings should be addressed in a separate PR.

@aerorahul
Copy link
Contributor Author

A capture of the 12z completion
Screenshot 2024-09-16 at 12 55 26 PM

The run details are captured in the PR description for paths to the package, output and com.

@aerorahul
Copy link
Contributor Author

@aerorahul I didn't really know ecflow before you taught me. I am adding time trigger to the kickout slides, may you please let me your thoughts for UPP and grib1? Does the manager need a time trigger too? https://docs.google.com/presentation/d/1yhdTfTHoBvV7K6jR2nfvkNAWn_eDJ2lTvDueRp9C89w/edit#slide=id.g2eeab8aa817_0_0

The WAFS_GFS_MANAGER does not need a time trigger in ops (The time trigger is added for NRT because we cannot connect to the operational ecflow suite). In ops, it will be triggered based on JGFS_FORECAST job.
The WAFS_UPP and WAFS_GRIB jobs do not need time triggers. They are released based on WAFS_GFS_MANAGER job
The WAFS_GCIP job is released by an event trigger from WAFS_GFS_MANAGER AND a time trigger (time trigger has been added to account for the data from DCOM to arrive).
WAFS_GRIB2_1P25 does not need a time trigger. It is triggered based on WAFS_UPP
WAFS_GRIB2_0P25 does not need a time trigger. It is triggered based on WAFS_UPP
WAFS_GRIB2_0P25_BLENDING may need a time trigger to account for data arriving from UK in addition to being triggered based on the status of WAFS_GRIB2_0P25

@aerorahul
Copy link
Contributor Author

@YaliMao-NOAA
I added the blending time trigger based on the updated EE2 kickoff slide.

@YaliMao-NOAA
Copy link
Collaborator

Sorry I didn't see you are going to open a new PR to address time trigger. I am merging this PR.

The 12z cycle finished with the above noted exceptions. I will now shut down the near realtime parallel as it has served its purpose.

The new information about triggers timings should be addressed in a separate PR.

@YaliMao-NOAA YaliMao-NOAA merged commit 9396f2c into release/wafs.v7 Sep 17, 2024
@aerorahul aerorahul deleted the feature/nrt branch September 17, 2024 19:01
aerorahul added a commit that referenced this pull request Oct 8, 2024
* remove rdhpcs options (#42)

* remove hera/orion modulefiles. rename drivers without wcoss2 and remove detect_machine.sh ush scripts

* cleanup versions and make fix files not exec

* Update README.md

* EE2 review updates (#44)

* update wafs_upp to EE2

* update upp job per EE2 standards

* fix scripting errors

* ignore the dirty upp.fd directory as it creates build artifacts that are not captured in its .gitignore

* itag is not a namlist in this version of UPP.  Go Figure!

* remove copying of analysis master file, and move setting of some variables to exscript

* EE2 mods for grib2 1p25 and 0p25 (no blending)

* update blending scripts for EE2

* fix grib1 jobs

* apply EE2 fixes to gcip

* some more updates on gcip

* Bugfixes on previous PR that was merged prior to testing (#45)

* remove unnecessary hours for grib, the offline UPP executable should match EE2 convention, setting up ecflow for development use with multiple expids

* revert changes .gitmodules

* move upp.fd to wafs_upp.fd per EE2

* ensure git submodule update is performed in the right directory

* fix a couple of COMIN bugs

* update experiment paths

* update JWAFS_GFS_MANAGER so it is similar to all other jjobs

* bugfixes discovered while testing

* Copy the folder of upp parm to WAFS/parm after copying gtg.config.gfs from GTG repository to upp parm

* UPP didn't generate WAFS master file correctly. To fix it,
add a line (even blank) between 'flxfile' and '&nampgb' to UPP itag.

* Made the non-ecflow version back to work and added HOMEwafs flexibility

* only copy relevant UPP parm files to WAFS vertical structure

* update doco

* add draft of release_notes

---------

Co-authored-by: yali mao <yali.mao@clogin03.cactus.wcoss2.ncep.noaa.gov>

* Update script document blocks, bug fixes of previous PRs (#48)

* Change all command with "``" to "$()"

* GCIP doesn't need SENDDBN.

* Don't need wmo folder since GRIB2_0P25 products are not added WMO headers.

* Change back to {EXECwafs}/${pgm} from {DATA}/${pgm}

* For UPP, move environment variables from scripts to jobs

* Add SENDDBN_NTC to jobs and correct dbn_alert for SENDDBN_NTC and SENDDBN

* Update document blocks of the scripts

* For WAFS GRIB1 scripts, move defination of jobsuffix from ush/mkwfsgbl.sh to
script/exwafs_grib.sh since fhr doesn't have the same value.

* Add descriptions of JWAFS_GFS_MANAGER

---------

Co-authored-by: yali mao <yali.mao@dlogin08.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin07.dogwood.wcoss2.ncep.noaa.gov>

* Extend waiting time window of UK data to 25 minutes (#49)

Co-authored-by: yali mao <yali.mao@dlogin07.dogwood.wcoss2.ncep.noaa.gov>

* Adjust forecast hours up to 36 for the additional levels per AWC request (#50)

* AWC needs extra levels up to F036

* Update branch of UPP in .gitmodules

---------

Co-authored-by: yali mao <yali.mao@dlogin09.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin06.dogwood.wcoss2.ncep.noaa.gov>

* Update UPP tag to upp_wafs_v7.0.0 (#52)

* update UPP code revision to upp_wafs_v7.0.0

* Update UPP tag in .gitmodules

---------

Co-authored-by: yali mao <yali.mao@dlogin06.dogwood.wcoss2.ncep.noaa.gov>

* AWC request adjusted, blending wall time extended (#53)

* Extend the wall time of the job card for the additional 5 minutes of waiting UK data

* Modified scripts for the additional levels on the second request from AWC

---------

Co-authored-by: yali mao <yali.mao@dlogin06.dogwood.wcoss2.ncep.noaa.gov>

* NRT with ecflow (#54)

* add possibilty of doing in NRT

* depend on all GFS data, not just log files

* fix extensions to atm and sfc files

* GCIP jobs in addition to JWAFS_GFS_MANAGER, have a time trigger in NRT

* gcip time trigger can be anytime after the time specified

* remove GFS forecast job triggers for NRT and rely on time

* remove GFS job triggers in experimental runs.

* add time triggers for blending jobs based on PR review comments

* First version of Release Note for WAFS.v7.0.0 (#55)

* First version of Release Note for WAFS.v7.0.0

* Update docs/Release_Notes.md

Co-authored-by: Kate Friedman <kate.friedman@noaa.gov>

* Update docs/Release_Notes.md

Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>

* Update docs/Release_Notes.md

Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>

* Modified Release Notes from feedback from Rahual and Huiya

* Adjust a table in Release Notes

* Update Release Notes according to the WAFS separation kickout slides

---------

Co-authored-by: yali mao <yali.mao@dlogin01.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: Kate Friedman <kate.friedman@noaa.gov>
Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>
Co-authored-by: yali mao <yali.mao@dlogin07.dogwood.wcoss2.ncep.noaa.gov>

* Add ecflow manual text to .ecf files (#58)

Co-authored-by: yali mao <yali.mao@dlogin08.dogwood.wcoss2.ncep.noaa.gov>

* Remove processing for fhrs = 1,2,3,4,5 for UPP in WAFS (#59)

* remove processing of forecast hours 1-5 for UPP in WAFS

* unindent the task

* update exwafs_gfs_manager.sh for hrs 1-5 in upp

* Update release note and .ecf manuals (#60)

* Add more details to .ecf manuals of upp and grib2_0p25

* Update Release Notes of stopping WAFS master files when FFF is between [001-005]

* Update UPP com size after removing WAFS master files for forecast hours between [001-005]

---------

Co-authored-by: yali mao <yali.mao@clogin01.cactus.wcoss2.ncep.noaa.gov>

* Update blending script to send email when UK data is missing (#61)

* Update blending script to send email when UK data is missing
1. usonly.emailbody is differentiated for each forecast hour with missing UK data
2. Remove the condition of sending UK unblended data if US unblended data is missing. It won't happen because the job itself won't get triggered if US unblended data is missing

* Added an ecflow client test script

* Update dev/ecf/README.md

* Update ecf README.md

---------

Co-authored-by: yali mao <yali.mao@clogin03.cactus.wcoss2.ncep.noaa.gov>

* make the NRT suite repeat daily (#62)

* To fix bugzilla 1370 and 1371 for WAFS blending job, (#68)

1. change variable name 'maillist' to 'MAILTO'
2. assign the value in job cards instead of in scripts

Co-authored-by: yali mao <yali.mao@clogin05.cactus.wcoss2.ncep.noaa.gov>

* Change blending job to MPMD to fix bugzilla 1593. Fix bugzilla 1226 (#69)

* Change blending	job to MPMD to fix bugzilla 1593, meanwhile fix	bugzilla 1226

The MPMD change for bugzilla 1593 is for NCO who wants to receive one single email
combining all forecast hours with missing UK data

For bugzilla 1226, AWC is fine with dbn_alert of US unblended data earlier in JWAFS_GRIB2_0P25 job

Bugzilla 1593 -	Improve email notification for missing UK WAFS data
Bugzilla 1226 - Eliminate the duplicated dbn_alert for unblended gfs wafs data

* Update release note and modify the driver

* 1. If US unblended data is missing, don't quit silently, instead send out email and dbn_alert.
2. Add not-blended email and dbn_alert if both UK and US unblended files are missing
3. Change fhours from a string to an array
4. Bug fix and code improvement

* Update blending scripts

* Bug fix

* Update the way of handling err and removing np variable for MPIRUN

* Update scripts/exwafs_grib2_0p25_blending.sh

Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>

---------

Co-authored-by: yali mao <yali.mao@clogin09.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin03.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin07.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin05.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>

* Update ecflow after switching blending to MPMD parallel run. (#72)

* Update ecflow after switching blending to MPMD parallel run.
1. Don't need to setup ecflow links for blending
2. In ecflow definations, change event triggers of each forecast hour to f048 of the upstream completion
Change COMROOT from 'com' to '%ENVIR%/com'

* Remove ecf/scripts/grib2/0p25/blending/jwafs_grib2_0p25_blending_f*.ecf from .gitignore

---------

Co-authored-by: yali mao <yali.mao@clogin05.cactus.wcoss2.ncep.noaa.gov>

---------

Co-authored-by: Rahul Mahajan <aerorahul@users.noreply.github.com>
Co-authored-by: yali mao <yali.mao@clogin03.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin08.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin07.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin09.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin06.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@dlogin01.dogwood.wcoss2.ncep.noaa.gov>
Co-authored-by: Kate Friedman <kate.friedman@noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin01.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin05.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin09.cactus.wcoss2.ncep.noaa.gov>
Co-authored-by: yali mao <yali.mao@clogin07.cactus.wcoss2.ncep.noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants