Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFS v16.3 retro parallel for implementation #951

Closed
lgannoaa opened this issue Aug 2, 2022 · 67 comments
Closed

GFS v16.3 retro parallel for implementation #951

lgannoaa opened this issue Aug 2, 2022 · 67 comments
Assignees

Comments

@lgannoaa
Copy link
Contributor

lgannoaa commented Aug 2, 2022

Description

This issue is to document the GFS v16.3 retro parallel for implementation. Referenced to #776
@emilyhcliu is the implementation POC

The configuration for this parallel is:
First full cycle starting CDATE is retro 2021101518
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: retro1-v16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/retro1-v16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/retro1-v16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/retro1-v16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/retro1-v16-ecf

FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)

obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)

prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)

HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3

METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1

verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)

@lgannoaa lgannoaa self-assigned this Aug 2, 2022
@emilyhcliu
Copy link
Contributor

emilyhcliu commented Aug 2, 2022

Status Update from DA - issues, diagnostics, solution and moving forward

Background

gfs.v16.3.0 retrospective parallel started from 2021101518z on Cactus. So far, we have about 3-4 week results. The overall forecast skills show degradation in NH. The DA team investigated to look for possible causes and solutions. The run configured and maintained by @lgannoaa has been very helpful for the DA team to spot a couple of issues from the gfsda.v16.3.0 package.

Issues, diagnostics, bug fixes, and tests

(1) An initialization problem for satellite bias correction coefficients were found for sensors with coefficients initialized from zero. The quasi-mode initialization procedure was skipped due to a bug merged from the GSI develop to gfs.v16.3.0

The issue and diagnostics are documented in GSI Issue #438
The bug fix is provided in GSI PR #439
The bug fix had been merged into gfsda.v16.3.0

A short gfs.v16.3.0 parallel test (v163t) was performed to verify the bug fix

(2) Increasing NSST biases and RMS of O-F (no bias) are observed in the time seires of AVHRR MetOp-B channel 3 and the window channels from hyperspectral sensors (IASI, CrIS). Foundation temperature bias and rms compared to operational GFS and OSTIA increase with time. It was found that the NSST increment file from GSI was not passing into the global cycle properly.

The issue and diagnostics in detail are documented in GSI Issue #449

The bug fix is documented in GSI PR #448

Test

A short gfs.v16.3.0 real-time parallel (starting from 2022061918z; v163ctl) with the bug fixes from (1) and (2) is currently running on Dogwood to verify the bug fixes.

We will keep this running for a few days....

Here is the link to the Verification page: https://www.emc.ncep.noaa.gov/gc_wmb/eliu/v163ctl/

We should stop the retrospective parallel on Cactus and re-run it with the bug fixes.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 4, 2022

NCO announced that Cactus will become dev machine in the evening of Aug 4th. Retro will start with CDATE=2021101518.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 4, 2022

Retro is now started on Aug. 4th. evening.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 5, 2022

retro paused on CDATE=2021101900 in the morning of Aug. 5th due to HPSS transfer slowness which caused high COM usage.
Aug. 5th evening, the transfer speed remain slow. Parallel remain paused.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 5, 2022

Tag: @emilyhcliu @dtkleist @junwang-noaa
@emilyhcliu and @dtkleist today made a decision to modify this parallel to write restart files and archive to HPSS every 7 days. This change is now in place.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 6, 2022

Cactus have multiple system issues include job submit issue, missing jobs, zero size files, and archive jobs disappear on Aug 5th evening. Multiple rerun and clean up was performed. Resumed on CDATE=2021101900.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 6, 2022

Cactus has file system issue caused para check job failed. Example message:
mkdir: cannot create directory â /retro1-v16-ecf2021101818checkâ : Permission denied
Cactus has hpss transfer system issue cause multiple archive job failed. Example error message:
Cannot send after transport endpoint shutdown
ecen, efcs jobs become zombie jobs.
Archive jobs continue to fail after several attempt try to recover the parallel. Therefore, this parallel is paused on CDATE=2021101906 for the remaining weekend.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 8, 2022

This parallel is resumed in morning Aug. 8th.
Cactus archive job continue impacted with system issue "Cannot send after transport endpoint shutdown".
Helpdesk ticket sent: Ticket#2022080810000045
NCO fixed the system issue. The parallel is now resumed.
However, due to the system issue some files are already cleanout in PTMP. Caused incomplete archive jobs:
Impacted CDATE=2021101518 to 2021101718, 2021101800, 2021101806, 2021101906 and 2021102012, 2021102018.
@emilyhcliu agreed to continue the parallel as is in a meeting on Aug 8th.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 9, 2022

8/9 increate eupd job by 10 minutes because it has multiple fail due to wall clock.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 10, 2022

The transfer speed remain slow over the night on 8/9 until morning 8/10 PTMP reached critical limit because archive job can't finish. Parallel is paused on CDATE=2021102512 until transfer jobs finished.
Tag: @emilyhcliu @dtkleist @junwang-noaa

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 10, 2022

Parallel resumed on CDATE=2021102518 for one cycle. It will be paused on CDATE=2021102600 in preparation of wafs testing. Starting CDATE=2021102518 post is now using new tag update upp_v8.2.0 (02086a8)
and wafs is using tag gfs_wafs.v6.3.1 (da909f).

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 12, 2022

Parallel paused on CDATE=2021102806 due to system error in archive jobs and high PTMP usage.
Disk quota exceeded on group PTMP.

@lgannoaa
Copy link
Contributor Author

Rerun a few archive zombie jobs to keep parallel going and PTMP clean up to continue. At 10;00 EST 8/13 current CDATE=2021103106.
Tested WAFS GCIP job on CDATE=2021103100. It failed. Email has been sent to the developer.

@lgannoaa
Copy link
Contributor Author

PTMP fill up last night, parallel paused for a few hours. It is resumed at CDATE=2021110406.

@lgannoaa
Copy link
Contributor Author

wafs testing is now completed. Code manager checked output and log found no issue.

@lgannoaa
Copy link
Contributor Author

Gempak and awips downstream code manager checked output and log on a 00Z test found no issue.

@lgannoaa
Copy link
Contributor Author

Bufr sounding code manager checked output and log on a 00Z test found no issue.

@lgannoaa
Copy link
Contributor Author

Parallel paused for a few hours due to transfer job system issue. After rerun 34 jobs, the parallel is now resumed on CDATE=2021110612.

@lgannoaa
Copy link
Contributor Author

Emergency failover of production to cactus. This parallel is now paused in preparation to run on white space.
Cactus is now the production machine. Effective immediately CDATE=2021110618

@lgannoaa
Copy link
Contributor Author

This parallel is resumed.

@lgannoaa
Copy link
Contributor Author

Zombie job found with gfs fcst. Rerun using restart
RERUN_RESTART/20211111.060000.coupler.res

@lgannoaa
Copy link
Contributor Author

This parallel is paused due to production switch. Archive job rerun is in progress.

@lgannoaa
Copy link
Contributor Author

gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107.
Debug is in progress.

@JessicaMeixner-NOAA
Copy link
Contributor

JessicaMeixner-NOAA commented Aug 18, 2022

gfs_wave_post_bndpntbll job continue to hit wall clock since late August 17th. Impact all 4 cycle in: PDY=20211107.
Debug is in progress.

@lgannoaa I will look into these jobs, but these jobs are known to be highly reactive to file system issues and in general have longer run times for us versus for NCO. I'm looking to see if there's any red-flags, but likely the wall clock time just needs to be extended and these jobs should re-run to completion within the longer wall clock time.

@emilyhcliu @JessicaMeixner-NOAA
May I know who is looking at this job for output at this time? I asked this question because it looks to me, the output of this job gfswave.t00z.ibpcbull_tar and gfswave.t00z.ibpbull_tar are not been archived on HPSS. Can this job be turn off for this parallel?

This parallel is on pause because this job continue to fail for all cycles.

@emilyhcliu
Copy link
Contributor

emilyhcliu commented Aug 18, 2022

@lgannoaa
Since the failed post job is a known problem in WAVES and the outputs from the job are not used in the following cycles. So, let's skip these jobs and move the parallel forward.

@emilyhcliu Parallel is now resumed on CDATE=2021110812 with all four cycles gfs_wave_post_bndpntbll jobs turned off.

@JessicaMeixner-NOAA
Copy link
Contributor

Sounds like a good plan @emilyhcliu

@lgannoaa
Copy link
Contributor Author

Parallel resumed on CDATE=2021112100 after PTMP full issue is resolved.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 23, 2022

gfs wave post bndpntbll jobs for CDATE=2021112212 and 2021112218 was turned on for test requested from helpdesk in helping debug job failure issue.
Both of these jobs completed at 40 minutes. These jobs are new resumed to run in this parallel.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 1, 2022

Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Cactus) to stay.

@emilyhcliu
Copy link
Contributor

@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel

@XuLi-NOAA
Copy link
Contributor

XuLi-NOAA commented Sep 1, 2022

@XuLi-NOAA Looks like SH performs better than the NH. These plots should be posted in the issue for real-time parallel

It has been moved to #952 .

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 3, 2022

Rerun 35 archive jobs due to system issues previously known.
Rerun 32 archive jobs due to system issues previously known.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 5, 2022

Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:

  • Current HOMEgfs is preserved
  • Checkout GFSv16.3.0 and apply library updates
  • Build executable
  • Modify ecflow workflow to pause on CDATE=2022010400
  • Resume parallel with the library updates package going forward

As of the morning on Sep. 7th, the full cycle test is completed. One exception is the gempak job that does not have canned data.
Management has made a decision to only update module bufr_ver to 11.7.0. All other library remain the same as prior to this full cycle run. Therefore, on Sep. 7th. The HOMEgfs has been updated with this change and rebuild. The current parallel is resumed on CDATE=2022010406.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 8, 2022

Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022010700 to checkout and build GSI package.

@lgannoaa
Copy link
Contributor Author

Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022010600.

@emilyhcliu
Copy link
Contributor

emilyhcliu commented Sep 12, 2022

For the retrospective run, we will rewind 14 days and restart on the 2022010600 cycle.
With Lin's revised and improved global-workflow with ecflow and the better HPSS transfer rate, it is not a setback to rewind the parallel run. The most important thing is that we caught the issue, fixed it, and move forward.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 21, 2022

There is an emergency production switch on 9/21 morning. There are 15 archive jobs, Metplus jobs and regular jobs failure due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE= 2022020100, 2022020106, 2022020112.

The ecen 2022020112 failed. The debug effort traced and found the previous cycle 2022020106 job corrupted due to the production switch. Therefore, this parallel is now rewind two cycles. Rerun from 2022020106.

The rerun from 2022020106 resolved the issue.

@lgannoaa
Copy link
Contributor Author

NCO executed a production switch on 9/22. Cactus is now back to the dev machine.
This parallel will resume on CDATE=2022020206.

@XuLi-NOAA
Copy link
Contributor

XuLi-NOAA commented Sep 22, 2022

RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118
RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118_N Pole
RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118_N Mid
RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118_N Tropics
RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118_S Mid
RMS_BIAS_to_ostia_retro_v163_4exps_2021101600_2022013118_S Pole
Update on the NSST foundation temperature analysis performance monitoring in GFSv16.3 retrospective run (retro1-v16-ecf). This is an extension of the figure reported 28days ago. And 5 more areas are included tis time: Global, N.Pole, N.Mid, Tropics, S.Mid, S.Pole.
From the figures, we can see, RMS has been improved across the whole period (about 3 and half months). However, there is a worry, i.e, the bias is getting worse, from the global one, the bias was improved in the beginning (about 10 days), then, getting even colder than operational. From the smaller area ones, we can see the issue is mainly occurs in Tropics and S.Mid areas. The NSST package had been tested but never to this long time period. At least, an alert.

@lgannoaa
Copy link
Contributor Author

A safety check is now in place to stop parallel if there is an unplanned production switch. This approach will reduce the chance for cycle corruption.

@RussTreadon-NOAA
Copy link
Contributor

Question for @lgannoaa and @emilyhcliu

Should we find gdas_atmos_enkf_chgres job log files in /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc}?

We have gdas_atmos_enkf_chgres_${cyc}.o* in /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/output/prod/today It seems these log files should be copied to the appropriate ${PDY}${cyc} directory in /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3/logs.

@lgannoaa
Copy link
Contributor Author

An online change to ecflow workflow is in place to copy the gdas_atmos_enkf_chgres_${cyc}.o$$ log file to /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc}
Effective CDATE=2022020712

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @lgannoaa . It will be good to have the gdas_atmos_enkf_chgres job log file with the other log files in /logs.

@lgannoaa
Copy link
Contributor Author

This parallel is paused on CDATE=2022021606. This cycle will be rerun with three updates:

  1. Rebuild GSI using solution in Incorrect analysis date in calc_analysis.x atmanl.nc file GSI#478 - address issue with pgrb2.1p00.anl contained the wrong analysis time.
  2. Update to fv3gfs.fd package to #ec31f35b9a
  3. Update post to #cc4d3c2ff
    The parallel will resume its run after review process is completed.

@RussTreadon-NOAA
Copy link
Contributor

NOAA-EMC/GSI tag gfsda.v16.3.0 recreated at e05d692.

@WalterKolczynski-NOAA
Copy link
Contributor

Note: 48-h HPSS downtime Oct 4-5 will likely negatively impact all parallels for many days as NCO will need to catch up on their own transfers once service is restored.

@RussTreadon-NOAA
Copy link
Contributor

A check of retro1-v16-ecf gfs_atmos_analysis_calc and gdas_atmos_analysis_calc log files and output for 2022021606 confirms that the correct analysis date is written to atmanl.nc. Cycles for 2022021600 and before write the wrong analysis date to atmanl.nc.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 30, 2022

Effective CDATE=2022021606
This parallel has been updated using:

  1. GSI issue 748 - Incorrect analysis date in calc_analysis.x atmanl.nc file
    The GSI package has been checkout at #e05d6923
  2. FV3 - Updated to support upp_v8.2.0 file generation
    The fv3gfs.fd package has been checkout at #ec31f35
  3. gfs_post.fd - Updated to upp_v8.2.0 with crtm 2.4.0
    The gfs_post.fd package has been checkout at #cc4d3c2f
  4. gplot starting date modified to 20220101 to allow precip step 2 to fully complete.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 3, 2022

As indicated in an email exchange between obsproc @ilianagenkova and NCO.
This parallel will be updated to use the new obsproc package:
obproc v1.1.0, bufr-dump.v1.1.0 and prepobs v1.0.1
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.1.0 (package location for obproc v1.1.0)
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1 (package location for prepobs v1.0.1)
Effective CDATE=2022030600
ecflow workflow obsproc COMOUT will be located in /lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/obsproc/v1.1

A group decision has been made for the following change to ecflow workflow obsproc prep jobs:

  • Continue point EMC dump archive /lfs/h2/emc/global/noscrub/emc.global/dump/gdas and gfs (not the gdasx/gfsx) location for this parallel.
  • Copy the ${RUN}.${CYCLE}.nsstbufr from EMC dump archive gdasx/gfsx location to obsproc COMOUT to replace the output from the prep jobs. This resulting bit identical nsstbufr file:
    cdecflow02:/lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/obsproc/v1.1/gdas.20220306/00/atmos> cmp gdas.t00z.nsstbufr /lfs/h2/emc/global/noscrub/emc.global/dump/gdasx.20220306/00/atmos/gdas.t00z.nsstbufr

Parallel is resumed.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 6, 2022

This parallel is paused on CDATE=2022031212 due to the HPSS outage caused PTMP to be 95% full.
Over the night and early morning on 10/6, HPSS transfer rate remain slow (50~75% of normal). Many transfer end up failed. The archive job recover is still on-going. This parallel remain paused.
Depending on the archive job and HPSS transfer speed today, the COM maybe touched to ensure this parallel remain healthy.
As of 11:30a on 10/6 HPSS transfer network performance is at above 90%. Archive jobs start going through. PTMP usage is down to 83%.
This parallel is resumed.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 10, 2022

The /lfs/h2/emc/stmp usage on Cactus is 96.7%. This parallel will be paused on CDATE=2022032400 until stmp space clean up by the developers.
This parallel is resumed on 10/10 7:50a. The stmp usage has reduced to 78.6%

@lgannoaa
Copy link
Contributor Author

Many jobs failed over the weekend of 10/16 due to group STMP full. Recover in progress. Currently on CDATE=2022060412

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 20, 2022

Management has made a decision to stop the retro parallel. Last cycle will be CDATE=2022041400.
Currently the PTMP usage is 94%. The COM for this parallel will be removed from PTMP after Oct. 21 5:00p to free up space.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 21, 2022

This parallel reached the CDATE=2022041400. METPlus cron is removed.
Last METPlus stat will be on PDY=20220413.
Last gplot will be on PDY=20220412.
Last FIT2OBS will be on CDATE=2022041300.

@lgannoaa
Copy link
Contributor Author

Rename COM in preparation to delete:
/lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3 is now
/lfs/h2/emc/ptmp/lin.gan/retro1-v16-ecf/para/com/gfs/v16.3-TOBEDELETED

@lgannoaa
Copy link
Contributor Author

COM removed. Close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants