-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFS v16.3 realtime parallel for implementation #952
Comments
The realtime parallel started August 3rd. on Dogwood. |
August 3rd evening, the HPSS transfer speed become very slow. Many archive jobs waiting in queue. There is a scheduled Dogwood outage on Aug. 4th. Therefore, pause realtime parallel on CDATE=2022062112 to allow all HPSS transfer jobs to complete. Will resume/recover parallel when machine return from the scheduled outage. |
Aug 4th morning, NCO have an emergency production switch. The Dogwood is now the production machine. This parallel is on halt CDATE=2022062112. |
Tag: @emilyhcliu @dtkleist @junwang-noaa |
Multiple archive job failed due to system issue on Aug 5th evening. |
This parallel is resumed in morning Aug. 8th. |
Management made decision on Aug. 8th to rerun this parallel starting with CDATE=2022073118. Kill all jobs the clean up is in progress. |
Realtime with new starting CDATE=2022073118 is now started on Dogwood (8/9 3:00p). |
Current (8/10 9:00a) is on CDATE=2022080218. Therefore current performance is around 2:30 for each cycle and 9 cycles a day. |
There are many transfer jobs failure from system issues: Example of job failed on system issue: |
The transfer speed remain slow. Over 60 archive jobs in queue. This parallel is now paused on CDATE=2022080318 for archive job to finish. |
post is now using new tag update tag: upp_v8.2.0 (02086a8) and wafs is using tag gfs_wafs.v6.3.1 (da909f). Effective CDATE=2022080318. The wafs will remain turned off for this parallel. |
Dogwood has system issue caused job fail with: |
Rerun 55 failed archive jobs from system error hpss_WriteList error -5000 issue. Currently on CDATE=2022080418 at 10AM Aug. 12th. |
As of 5:00p on Sunday 8/14, this parallel is on CDATE=2022080712. Estimated performance is around 3 days per calendar day. |
realtime parallel on pause CDATE=2022080918 due to high archive queue and transfer job slowness. |
Parallel is now resumed on CDATE=2022080918. |
Emergency failover of production to cactus. This parallel is now paused in preparation to run on dev machine. |
This parallel resumed last night but a zombie job caused it to halt. Condition is now resolved. Parallel is resumed. |
efcs grp 21 failed with system issue: Rerun is in progress. |
This parallel is paused due to production switch. Archive job rerun is in progress. |
The EUPD job for CDATE=2022081612 failed. Debug and rerun is on-going. Parallel is paused as of 8/20. |
The gdas analysis job for CDATE=2022081612 failed with: Log file:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today/gdas_atmos_analysis_12.o16190817 |
The log file indicates
A check confirms neither oznstat nor radstat files are present in A check of Suggest a rewind and rerun of the 2022081612 enkfgdas select_obs, diag, and update. |
A rewind of the 2022081612 enkfgdas select_obs, diag, and update was done. Job failed with same issue: |
Thanks @lgannoaa . This is indeed interesting. Upon closer examination the enkfgdas_select_obs_12 job log files indicate a mismatch between the assimilation window and the dump time for several GDA dump files
A comparison of the GDA and operational dumps, using omi as the example, shows an inconsistency. enkfgdas_select_obs_12 points at the correct omi dump file
The rt1-v16-ecf file links to the correct GDA file
However, the GDA file is not correct. Comparison of GDA and operations shows an inconsistency
As enkfgdas_select_obs indicates, what GDA labels as
I wonder if this finding has any bearing on the gdas_atmos_analysis_12 failure. @KateFriedman-NOAA , the 2022081612 gdas GDA files are not correct. At least some are actually dump files for 2022081712. I'm not sure if this issue only affected 2022081612 gdas or also other dumps and/or cycles. |
@KateFriedman-NOAA . Other gdas cycles also corrupted
The temporal sequence of the date/time stamps for files look odd.
|
I do not have production wcoss2 access. @lgannoaa had kindly made three log files (enkfgdas_select_obs, s, enkfgdas_update, and enkfdiag) on the development machine for me. In the enkfgdas_select_obs log file, I found the following:
This is just an example for ATMS. In this cycle, all read_obs_check failed (see list attached below)
|
Many archive jobs failed with system issue: |
The dust from HPSS transfer slowness finally come to an end. All archive jobs from previous cycle were completed. |
Looks like the HPSS speed improvement is solid on WCOSS2 now. Modify this parallel to write restart files to HPSS everyday. This change is now in place effective CDATE=2022083006. |
Congratulations to this parallel hit the true realtime on CDATE=2022090106. |
Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Dogwood) to stay. |
|
Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:
As of the morning on Sep. 7th, the full cycle test is completed. |
Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022090806 to checkout and build GSI package. |
Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022090800. |
For the real-time run, we will rewind 4 days and restart on the 2022090800 cycle. |
This parallel is on realtime from CDATE=2022091500. |
There is an emergency production switch on 9/21 morning. There are 30 archive jobs failed and some jobs failed due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE=2022092100 and 2022092106. |
NCO executed a production switch on 9/22. Dogwood is now back to the prod machine. |
A safety check is now in place to stop parallel if there is an unplanned production switch. This approach will reduce the chance for cycle corruption. |
Question for @lgannoaa and @emilyhcliu Should we find We have |
An online change to ecflow workflow is in place to copy the gdas_atmos_enkf_chgres_${cyc}.o$$ log file to /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc} |
Thank you, @lgannoaa . It will be good to have the |
A check of rt1-v16-ecf |
Effective CDATE=2022092806
|
As indicated in an email exchange between obsproc @ilianagenkova and NCO. A group decision has been made for the following change to ecflow workflow obsproc prep jobs: Continue point EMC dump archive /lfs/h2/emc/global/noscrub/emc.global/dump/gdas and gfs (not the gdasx/gfsx) location for this parallel. Parallel is resumed. |
It was discovered the nsstbufr file from GDA gdasx/gfsx not present for CDATE=2022100600. This cycle will be rerun. |
VIIRS radiances are missing from the real-time run because NESDIS discontinued the VIIRS brightness temperatures (BTs).without prior notification to users. NESDIS began to provide the VIIRS BTs product starting on October 5, 2022. This leaves us very little time to test to ensure the product's quality. Therefore, we decided to turn VIIRS BTs data into monitoring mode. Based on the decision made above, we are working on the following three things: Notes: |
Effective CDATE=2022101112, this parallel will point obsproc directory from NCO /lfs/h1/ops/para/com/obsproc location instead of EMC dump archive. |
This parallel was paused on CDATE=2022101112 for helping NCO warm start the 30 days parallel. It resumed and on the realtime after that. The wafs bufr.t00z gempak wmo in COM were mirrored to Cactus for code manager to review from CDATE 2022101112 to 2022101406. |
Due to NCO prepbufr missing. This parallel has been paused since CDATE=2022102100 |
Dogwood dbqs currently suffer system issue errno=111. Realtime parallel is halt on CDATE= 2022103118. |
The METPlus gplot will be a bit late because the control stat file for 20221101 missing. This delay is due to the parallel production test. |
The implementation is delay until 11/30. Dogwood white space job is going to be stopped after implementation is successful. |
This parallel is done as the GFS implementation taking place. The last full cycle completed is CDATE=2022113000. |
Description
This issue is to document the GFS v16.3 realtime parallel for implementation. Referenced to #776
@emilyhcliu is the implementation POC
The configuration for this parallel is:
First full cycle starting CDATE is retro 2022073118
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: rt1-v16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/rt1-v16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/rt1-v16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/rt1-v16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/rt1-v16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/rt1-v16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/rt1-v16-ecf
FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)
obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)
prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)
HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3
METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1
verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)
The text was updated successfully, but these errors were encountered: