-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GFS v16.3 retro parallel for implementation #951
Comments
Status Update from DA - issues, diagnostics, solution and moving forwardBackgroundgfs.v16.3.0 retrospective parallel started from 2021101518z on Cactus. So far, we have about 3-4 week results. The overall forecast skills show degradation in NH. The DA team investigated to look for possible causes and solutions. The run configured and maintained by @lgannoaa has been very helpful for the DA team to spot a couple of issues from the gfsda.v16.3.0 package. Issues, diagnostics, bug fixes, and tests(1) An initialization problem for satellite bias correction coefficients were found for sensors with coefficients initialized from zero. The quasi-mode initialization procedure was skipped due to a bug merged from the GSI develop to gfs.v16.3.0 The issue and diagnostics are documented in GSI Issue #438 A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix (2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly. The issue and diagnostics in detail are documented in GSI Issue #449 The bug fix is documented in GSI PR #448 TestA short gfs.v16.3.0 real-time parallel (starting from 2022061918z; v163ctl) with the bug fixes from (1) and (2) is currently running on Dogwood to verify the bug fixes. We will keep this running for a few days.... Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/ We should stop the retrospective parallel on Cactus and re-run it with the bug fixes. |
NCO announced that Cactus will become dev machine in the evening of Aug 4th. Retro will start with CDATE=2021101518. |
Retro is now started on Aug. 4th. evening. |
retro paused on CDATE=2021101900 in the morning of Aug. 5th due to HPSS transfer slowness which caused high COM usage. |
Tag: @emilyhcliu @dtkleist @junwang-noaa |
Cactus have multiple system issues include job submit issue, missing jobs, zero size files, and archive jobs disappear on Aug 5th evening. Multiple rerun and clean up was performed. Resumed on CDATE=2021101900. |
Cactus has file system issue caused para check job failed. Example message: |
This parallel is resumed in morning Aug. 8th. |
8/9 increate eupd job by 10 minutes because it has multiple fail due to wall clock. |
The transfer speed remain slow over the night on 8/9 until morning 8/10 PTMP reached critical limit because archive job can't finish. Parallel is paused on CDATE=2021102512 until transfer jobs finished. |
Parallel resumed on CDATE=2021102518 for one cycle. It will be paused on CDATE=2021102600 in preparation of wafs testing. Starting CDATE=2021102518 post is now using new tag update upp_v8.2.0 (02086a8) |
Parallel paused on CDATE=2021102806 due to system error in archive jobs and high PTMP usage. |
Rerun a few archive zombie jobs to keep parallel going and PTMP clean up to continue. At 10;00 EST 8/13 current CDATE=2021103106. |
PTMP fill up last night, parallel paused for a few hours. It is resumed at CDATE=2021110406. |
wafs testing is now completed. Code manager checked output and log found no issue. |
Gempak and awips downstream code manager checked output and log on a 00Z test found no issue. |
Bufr sounding code manager checked output and log on a 00Z test found no issue. |
Parallel paused for a few hours due to transfer job system issue. After rerun 34 jobs, the parallel is now resumed on CDATE=2021110612. |
Emergency failover of production to cactus. This parallel is now paused in preparation to run on white space. |
This parallel is resumed. |
Zombie job found with gfs fcst. Rerun using restart |
This parallel is paused due to production switch. Archive job rerun is in progress. |
gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107. |
@lgannoaa I will look into these jobs, but these jobs are known to be highly reactive to file system issues and in general have longer run times for us versus for NCO. I'm looking to see if there's any red-flags, but likely the wall clock time just needs to be extended and these jobs should re-run to completion within the longer wall clock time. @emilyhcliu @JessicaMeixner-NOAA This parallel is on pause because this job continue to fail for all cycles. |
@lgannoaa @emilyhcliu Parallel is now resumed on CDATE=2021110812 with all four cycles gfs_wave_post_bndpntbll jobs turned off. |
Sounds like a good plan @emilyhcliu |
Parallel resumed on CDATE=2021112100 after PTMP full issue is resolved. |
gfs wave post bndpntbll jobs for CDATE=2021112212 and 2021112218 was turned on for test requested from helpdesk in helping debug job failure issue. |
Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Cactus) to stay. |
@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel |
It has been moved to #952 . |
Rerun 35 archive jobs due to system issues previously known. |
Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:
As of the morning on Sep. 7th, the full cycle test is completed. One exception is the gempak job that does not have canned data. |
Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022010700 to checkout and build GSI package. |
Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022010600. |
For the retrospective run, we will rewind 14 days and restart on the 2022010600 cycle. |
There is an emergency production switch on 9/21 morning. There are 15 archive jobs, Metplus jobs and regular jobs failure due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE= 2022020100, 2022020106, 2022020112. The ecen 2022020112 failed. The debug effort traced and found the previous cycle 2022020106 job corrupted due to the production switch. Therefore, this parallel is now rewind two cycles. Rerun from 2022020106. The rerun from 2022020106 resolved the issue. |
NCO executed a production switch on 9/22. Cactus is now back to the dev machine. |
A safety check is now in place to stop parallel if there is an unplanned production switch. This approach will reduce the chance for cycle corruption. |
Question for @lgannoaa and @emilyhcliu Should we find We have |
An online change to ecflow workflow is in place to copy the gdas_atmos_enkf_chgres_${cyc}.o$$ log file to /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc} |
Thank you, @lgannoaa . It will be good to have the |
This parallel is paused on CDATE=2022021606. This cycle will be rerun with three updates:
|
NOAA-EMC/GSI tag |
Note: 48-h HPSS downtime Oct 4-5 will likely negatively impact all parallels for many days as NCO will need to catch up on their own transfers once service is restored. |
A check of retro1-v16-ecf |
Effective CDATE=2022021606
|
As indicated in an email exchange between obsproc @ilianagenkova and NCO. A group decision has been made for the following change to ecflow workflow obsproc prep jobs:
Parallel is resumed. |
This parallel is paused on CDATE=2022031212 due to the HPSS outage caused PTMP to be 95% full. |
The /lfs/h2/emc/stmp usage on Cactus is 96.7%. This parallel will be paused on CDATE=2022032400 until stmp space clean up by the developers. |
Many jobs failed over the weekend of 10/16 due to group STMP full. Recover in progress. Currently on CDATE=2022060412 |
Management has made a decision to stop the retro parallel. Last cycle will be CDATE=2022041400. |
This parallel reached the CDATE=2022041400. METPlus cron is removed. |
Rename COM in preparation to delete: |
COM removed. Close this ticket. |
Description
This issue is to document the GFS v16.3 retro parallel for implementation. Referenced to #776
@emilyhcliu is the implementation POC
The configuration for this parallel is:
First full cycle starting CDATE is retro 2021101518
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: retro1-v16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/retro1-v16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/retro1-v16-ecf
FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)
obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)
prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)
HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3
METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1
verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)
The text was updated successfully, but these errors were encountered: