Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GFS v16.3 realtime parallel for implementation #952

Closed
lgannoaa opened this issue Aug 2, 2022 · 65 comments
Closed

GFS v16.3 realtime parallel for implementation #952

lgannoaa opened this issue Aug 2, 2022 · 65 comments
Assignees

Comments

@lgannoaa
Copy link
Contributor

lgannoaa commented Aug 2, 2022

Description

This issue is to document the GFS v16.3 realtime parallel for implementation. Referenced to #776
@emilyhcliu is the implementation POC

The configuration for this parallel is:
First full cycle starting CDATE is retro 2022073118
HOMEgfs: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0
pslot: rt1-v16-ecf
EXPDIR: /lfs/h2/emc/global/noscrub/lin.gan/git/gfsda.v16.3.0/parm/config
COM: /lfs/h2/emc/ptmp/Lin.Gan/rt1-v16-ecf/para/com/gfs/v16.3
log: /lfs/h2/emc/ptmp/Lin.Gan/rt1-v16-ecf/para/com/output/prod/today
on-line archive: /lfs/h2/emc/global/noscrub/lin.gan/archive/rt1-v16-ecf
METPlus stat files: /lfs/h2/emc/global/noscrub/lin.gan/archive/metplus_data
FIT2OBS: /lfs/h2/emc/global/noscrub/lin.gan/archive/rt1-v16-ecf/fits
Verification Web site: https://www.emc.ncep.noaa.gov/gmb/Lin.Gan/metplus/rt1-v16-ecf
(Updated daily at 14:00 UTC on PDY-1)
HPSS archive: /NCEPDEV/emc-global/5year/lin.gan/WCOSS2/scratch/rt1-v16-ecf

FIT2OBS:
/lfs/h2/emc/global/save/emc.global/git/Fit2Obs/newm.1.5
df1827cb (HEAD, tag: newm.1.5, origin/newmaster, origin/HEAD)

obsproc:
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.0.2
83992615 (HEAD, tag: OT.obsproc.v1.0.2_20220628, origin/develop, origin/HEAD)

prepobs
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1
5d0b36fba (HEAD, tag: OT.prepobs.v1.0.1_20220628, origin/develop, origin/HEAD)

HOMEMET
/apps/ops/para/libs/intel/19.1.3.304/met/9.1.3

METplus
/apps/ops/para/libs/intel/19.1.3.304/metplus/3.1.1

verif_global
/lfs/h2/emc/global/noscrub/lin.gan/para/packages/gfs.v16.3.0/sorc/verif-global.fd
1aabae3aa (HEAD, tag: verif_global_v2.9.4)

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 3, 2022

The realtime parallel started August 3rd. on Dogwood.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 4, 2022

August 3rd evening, the HPSS transfer speed become very slow. Many archive jobs waiting in queue. There is a scheduled Dogwood outage on Aug. 4th. Therefore, pause realtime parallel on CDATE=2022062112 to allow all HPSS transfer jobs to complete. Will resume/recover parallel when machine return from the scheduled outage.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 4, 2022

Aug 4th morning, NCO have an emergency production switch. The Dogwood is now the production machine. This parallel is on halt CDATE=2022062112.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 5, 2022

Tag: @emilyhcliu @dtkleist @junwang-noaa
@emilyhcliu and @dtkleist today made a decision to modify this parallel to write restart files and archive to HPSS every 7 days. This change is now in place.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 6, 2022

Multiple archive job failed due to system issue on Aug 5th evening.
Such as:
ERROR: hpss_WriteList error -5000, file offset 0 - aborting
###WARNING htar returned non-zero exit status.
This parallel is paused on CDATE=2022062206 for the remaining weekend.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 8, 2022

This parallel is resumed in morning Aug. 8th.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 8, 2022

Management made decision on Aug. 8th to rerun this parallel starting with CDATE=2022073118. Kill all jobs the clean up is in progress.
Tag: @emilyhcliu @dtkleist @junwang-noaa @aerorahul

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 9, 2022

Realtime with new starting CDATE=2022073118 is now started on Dogwood (8/9 3:00p).

@lgannoaa
Copy link
Contributor Author

Current (8/10 9:00a) is on CDATE=2022080218. Therefore current performance is around 2:30 for each cycle and 9 cycles a day.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 10, 2022

There are many transfer jobs failure from system issues:
For example, zombie jobs, job failed because system issue...
Rerun and debug is on going.

Example of job failed on system issue:
Job 13982095.dbqs01
ERROR: hpss_WriteList error -5000, file offset 0 - aborting
Example of zombie jobs:
Job Id: 13927854.dbqs01
Job_Name = gdas_HPSS_ARCHIVE_gdas_2022080100
This job was submitted but not executed and disappear from the system quote
Transfer job slowness:
There are 74 archive jobs current in queue waiting to be run on Dogwood.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 10, 2022

The transfer speed remain slow. Over 60 archive jobs in queue. This parallel is now paused on CDATE=2022080318 for archive job to finish.
@emilyhcliu @dtkleist @junwang-noaa

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 10, 2022

post is now using new tag update tag: upp_v8.2.0 (02086a8) and wafs is using tag gfs_wafs.v6.3.1 (da909f). Effective CDATE=2022080318. The wafs will remain turned off for this parallel.
Additional 30 archive jobs failed with system error 141. This parallel remain paused.
Over night of 8/10~8/11 AM, a total of 51 archive jobs failed with system error:
ERROR: hpss_WriteList error -5000, file offset 6442450944 - aborting
status=141

@lgannoaa
Copy link
Contributor Author

Dogwood has system issue caused job fail with:
sed: can't read /var/spool/pbs/aux/14343353.dbqs01: No such file or directory
sed: can't read /var/spool/pbs/aux/14343353.dbqs01: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
grep: /tmp/qstat.14343353: No such file or directory
000 - nid001325 : Job 14343353.dbqs01 - DEBUG-DMESG: Unable to find NFS stats file: /tmp/nfsstats.14343353.dbqs01
000 - nid001325 Job 14343353.dbqs01 - DEBUG-DMESG: Unable to find Mount stats file: /tmp/mntstats.begin.14343353.dbqs01
Epilogue: Enabling ASLR...

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 12, 2022

Rerun 55 failed archive jobs from system error hpss_WriteList error -5000 issue. Currently on CDATE=2022080418 at 10AM Aug. 12th.
Transfer speed remain slow. The archive job in queue is high. Archive jobs in transfer queue disappear due to system error. Parallel is now paused on CDATE=2022080506.
There are still 87 archive jobs waiting in queue. Will keep this parallel paused.
As of 10;00 EST 8/13 Dogwood still have over 70 archive job not done. This parallel remains paused.

@lgannoaa
Copy link
Contributor Author

As of 5:00p on Sunday 8/14, this parallel is on CDATE=2022080712. Estimated performance is around 3 days per calendar day.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 15, 2022

realtime parallel on pause CDATE=2022080918 due to high archive queue and transfer job slowness.
As of Aug. 15th at night, the archive job to HPSS still backed up. This parallel continue to be on halt.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 16, 2022

Parallel is now resumed on CDATE=2022080918.

@lgannoaa
Copy link
Contributor Author

Emergency failover of production to cactus. This parallel is now paused in preparation to run on dev machine.
Dogwood is now the been prepared as dev machine. Parallel will resume when machine return to developer.
Effective immediately CDATE=2022080918

@lgannoaa
Copy link
Contributor Author

This parallel resumed last night but a zombie job caused it to halt. Condition is now resolved. Parallel is resumed.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Aug 17, 2022

efcs grp 21 failed with system issue:
sed: can't read /var/spool/pbs/aux/15767924.dbqs01: No such file or directory
sed: can't read /var/spool/pbs/aux/15767924.dbqs01: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
grep: /tmp/qstat.15767924: No such file or directory
000 - nid001433 : Job 15767924.dbqs01 - DEBUG-DMESG: Unable to find NFS stats file: /tmp/nfsstats.15767924.dbqs01
000 - nid001433 Job 15767924.dbqs01 - DEBUG-DMESG: Unable to find Mount stats file: /tmp/mntstats.begin.15767924.dbqs01

Rerun is in progress.
eupd also hit with same issue: enkfgdas_update_06.o15771263, gfs_forecast_06.o15771791

@lgannoaa
Copy link
Contributor Author

This parallel is paused due to production switch. Archive job rerun is in progress.

@lgannoaa
Copy link
Contributor Author

The EUPD job for CDATE=2022081612 failed. Debug and rerun is on-going. Parallel is paused as of 8/20.
Failed job log: Dogwood:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today/enkfgdas_update_12.o16191080

@lgannoaa
Copy link
Contributor Author

The gdas analysis job for CDATE=2022081612 failed with:
/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/gdas.20220816/12/atmos/
gdas.t12z.seaice.5min.nid001160.dogwood.wcoss2.ncep.noaa.gov 0: blend.grb
warning:cycl terminating search and and setting gdata to -999
range max= 15
ice concentration analysis read error
nid001160.dogwood.wcoss2.ncep.noaa.gov 0: abort:
nid001165.dogwood.wcoss2.ncep.noaa.gov: rank 4 exited with code 134

Log file:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today/gdas_atmos_analysis_12.o16190817
Debug in progress.

@RussTreadon-NOAA
Copy link
Contributor

The EUPD job for CDATE=2022081612 failed. Debug and rerun is on-going. Parallel is paused as of 8/20. Failed job log: Dogwood:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today/enkfgdas_update_12.o16191080

The log file indicates

nid001122.dogwood.wcoss2.ncep.noaa.gov 0: tar: /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/enkfgdas.20220816/12/atmos/gdas.t12z.oznstat.ensmean: Cannot open: No such file or directory

A check confirms neither oznstat nor radstat files are present in /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/enkfgdas.20220816/12/atmos/ The 12Z eobs job log file shows that no ozone or radiance data was processed. This is odd.

A check of /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.0/gdas.20220816/12/atmos/ shows the dump files exist. The links point to valid files.

Suggest a rewind and rerun of the 2022081612 enkfgdas select_obs, diag, and update.

@lgannoaa
Copy link
Contributor Author

A rewind of the 2022081612 enkfgdas select_obs, diag, and update was done. Job failed with same issue:
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 02:01 enkfgdas_update_12.o16160361
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 02:02 enkfgdas_update_12.o16160623 (retry)
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 02:11 enkfgdas_update_12.o16162227
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 02:12 enkfgdas_update_12.o16162467 (retry)
A complete rewind and rerun from the previous cycle 2022081606 GDAS/enkf still result same failure:
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 05:22 enkfgdas_update_12.o16191080
-rw-r--r-- 1 lin.gan emc 1.1M Aug 20 05:22 enkfgdas_update_12.o16191237 (retry)

@RussTreadon-NOAA
Copy link
Contributor

Thanks @lgannoaa . This is indeed interesting. Upon closer examination the enkfgdas_select_obs_12 job log files indicate a mismatch between the assimilation window and the dump time for several GDA dump files

 Analysis start  :  2022081609
 Analysis end    :  2022081615
 Observation time:  2022081712
nid001049.dogwood.wcoss2.ncep.noaa.gov 20:  read_obs_check: bufr file omi       aura       not available omibufr
nid001089.dogwood.wcoss2.ncep.noaa.gov 46:  ***read_obs_check*** incompatable analysis and observation date/timeompsnpbufr
 ompsnp

A comparison of the GDA and operational dumps, using omi as the example, shows an inconsistency. enkfgdas_select_obs_12 points at the correct omi dump file

1 + OMIBF=/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.0/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d
1 + /bin/ln -sf /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.0/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d omibufr

The rt1-v16-ecf file links to the correct GDA file

lrwxrwxrwx 1 lin.gan emc 91 Aug 20 05:13 /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.0/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d -> /lfs/h2/emc/global/noscrub/emc.global/dump/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d

However, the GDA file is not correct. Comparison of GDA and operations shows an inconsistency

-rw-rw-r-- 1 ops.prod   prod    4395016 Aug 16 17:52 /lfs/h1/ops/prod/com/obsproc/v1.0/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d
-rw-r--r-- 1 emc.global global 10108936 Aug 17 17:52 /lfs/h2/emc/global/noscrub/emc.global/dump/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d

As enkfgdas_select_obs indicates, what GDA labels as gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d is actually gdas.20220817/12/atmos/gdas.t12z.omi.tm00.bufr_d.

-rw-rw-r-- 1 ops.prod   prod   10108936 Aug 17 17:52 /lfs/h1/ops/prod/com/obsproc/v1.0/gdas.20220817/12/atmos/gdas.t12z.omi.tm00.bufr_d
-rw-r--r-- 1 emc.global global 10108936 Aug 17 17:52 /lfs/h2/emc/global/noscrub/emc.global/dump/gdas.20220816/12/atmos/gdas.t12z.omi.tm00.bufr_d
russ.treadon@dlogin04:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today> cmp /lfs/h2/emc/global/noscrub/emc.global/dump/gdas.20220816/12/atmos/
gdas.t12z.omi.tm00.bufr_d /lfs/h1/ops/prod/com/obsproc/v1.0/gdas.20220817/12/atmos/gdas.t12z.omi.tm00.bufr_d                                                

I wonder if this finding has any bearing on the gdas_atmos_analysis_12 failure.

@KateFriedman-NOAA , the 2022081612 gdas GDA files are not correct. At least some are actually dump files for 2022081712. I'm not sure if this issue only affected 2022081612 gdas or also other dumps and/or cycles.

@RussTreadon-NOAA
Copy link
Contributor

@KateFriedman-NOAA . Other gdas cycles also corrupted

cmp shows dump files on consecutive days are identical.

russ.treadon@dlogin04:/lfs/h2/emc/global/noscrub/emc.global/dump> cmp gdas.20220816/12/atmos/gdas.t12z.crisf4.tm00.bufr_d gdas.20220817/12/atmos/gdas.t12z.crisf4.tm00.bufr_d

russ.treadon@dlogin04:/lfs/h2/emc/global/noscrub/emc.global/dump> cmp gdas.20220816/18/atmos/gdas.t18z.crisf4.tm00.bufr_d gdas.20220817/18/atmos/gdas.t18z.crisf4.tm00.bufr_d                                                                                                                                            

russ.treadon@dlogin04:/lfs/h2/emc/global/noscrub/emc.global/dump> cmp gdas.20220817/00/atmos/gdas.t00z.crisf4.tm00.bufr_d gdas.20220818/00/atmos/gdas.t00z.crisf4.tm00.bufr_d 

The temporal sequence of the date/time stamps for files look odd.

-rw-r--r-- 1 emc.global global 680586904 Aug 17 17:52 gdas.20220816/12/atmos/gdas.t12z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 653250648 Aug 17 23:52 gdas.20220816/18/atmos/gdas.t18z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 681129800 Aug 18 05:52 gdas.20220817/00/atmos/gdas.t00z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 672099224 Aug 17 11:52 gdas.20220817/06/atmos/gdas.t06z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 680586904 Aug 17 17:52 gdas.20220817/12/atmos/gdas.t12z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 653250648 Aug 17 23:52 gdas.20220817/18/atmos/gdas.t18z.crisf4.tm00.bufr_d
-rw-r--r-- 1 emc.global global 681129800 Aug 18 05:52 gdas.20220818/00/atmos/gdas.t00z.crisf4.tm00.bufr_d

@emilyhcliu
Copy link
Contributor

emilyhcliu commented Aug 20, 2022

I do not have production wcoss2 access. @lgannoaa had kindly made three log files (enkfgdas_select_obs, s, enkfgdas_update, and enkfdiag) on the development machine for me.

In the enkfgdas_select_obs log file, I found the following:

nid001024.dogwood.wcoss2.ncep.noaa.gov 29:  ***read_obs_check*** incompatable analysis and observation date/timeatmsbufr
 atms
 Analysis start  :  2022081609
 Analysis end    :  2022081615
 Observation time:  2022081712
 read_obs_check: bufr file atms      npp        not available atmsbufr

This is just an example for ATMS.
In GSI, there is a routine: read_obs_check, which checks the consistency between the analysis time and the time stamps from the observation files.

In this cycle, all read_obs_check failed (see list attached below)
I do not have log file from the analysis step. If we check the gsistat file, the number of observation assimilation should be zero.

read_obs_check: bufr file q                    not available hdobbufr
read_obs_check: bufr file sndrd2    g15        not available gsnd1bufr
read_obs_check: bufr file uv                   not available hdobbufr
read_obs_check: bufr file amsua     aqua       not available airsbufr
read_obs_check: bufr file t                    not available hdobbufr
read_obs_check: bufr file sndrd4    g15        not available gsnd1bufr
read_obs_check: bufr file sndrd3    g15        not available gsnd1bufr
read_obs_check: bufr file airs      aqua       not available airsbufr
read_obs_check: bufr file sndrd1    g15        not available gsnd1bufr
read_obs_check: bufr file uv                   not available oscatbufr
read_obs_check: bufr file uv                   not available rapidscatbufr
read_obs_check: bufr file saphir    meghat     not available saphirbufr
read_obs_check: bufr file uv                   not available satwndbufr
read_obs_check: bufr file ompstc8   n20        not available ompstcbufr
read_obs_check: bufr file ompstc8   npp        not available ompstcbufr
read_obs_check: bufr file avhrr     metop-a    not available avhambufr
read_obs_check: bufr file ompsnp    npp        not available ompsnpbufr
read_obs_check: bufr file avhrr     metop-b    not available avhambufr
read_obs_check: bufr file avhrr     n18        not available avhpmbufr
read_obs_check: bufr file atms      npp        not available atmsbufr
read_obs_check: bufr file avhrr     n19        not available avhpmbufr
read_obs_check: bufr file avhrr     metop-c    not available avhambufr
read_obs_check: bufr file amsua     metop-c    not available amsuabufr
read_obs_check: bufr file amsua     n15        not available amsuabufr
read_obs_check: bufr file amsua     metop-b    not available amsuabufr
read_obs_check: bufr file atms      n20        not available atmsbufr
read_obs_check: bufr file amsua     n18        not available amsuabufr
read_obs_check: bufr file amsua     n19        not available amsuabufr
read_obs_check: bufr file amsua     metop-a    not available amsuabufr
read_obs_check: bufr file sst       nsst       not available nsstbufr
read_obs_check: bufr file omi       aura       not available omibufr
read_obs_check: bufr file gps_bnd              not available gpsrobufr
read_obs_check: bufr file ahi       himawari8  not available ahibufr
read_obs_check: bufr file seviri    m11        not available seviribufr
read_obs_check: bufr file seviri    m08        not available seviribufr
read_obs_check: bufr file iasi      metop-b    not available iasibufr
read_obs_check: bufr file mhs       metop-c    not available mhsbufr
read_obs_check: bufr file mhs       metop-c    not available mhsbufrears
read_obs_check: bufr file mhs       metop-c    not available mhsbufr_db
read_obs_check: bufr file cris-fsr  n20        not available crisfsbufr
read_obs_check: bufr file cris-fsr  n20        not available crisfsbufrears
read_obs_check: bufr file amsua     metop-c    not available amsuabufrears
read_obs_check: bufr file amsua     metop-c    not available amsuabufr_db
read_obs_check: bufr file iasi      metop-c    not available iasibufr
read_obs_check: bufr file amsua     n18        not available amsuabufrears
read_obs_check: bufr file amsua     metop-a    not available amsuabufrears
read_obs_check: bufr file amsua     n19        not available amsuabufrears
read_obs_check: bufr file mhs       n19        not available mhsbufr
read_obs_check: bufr file amsua     metop-b    not available amsuabufrears
read_obs_check: bufr file mhs       n19        not available mhsbufrears
read_obs_check: bufr file amsua     n18        not available amsuabufr_db
read_obs_check: bufr file amsua     metop-a    not available amsuabufr_db
read_obs_check: bufr file amsua     n19        not available amsuabufr_db
read_obs_check: bufr file mhs       n19        not available mhsbufr_db
read_obs_check: bufr file amsua     metop-b    not available amsuabufr_db
read_obs_check: bufr file mhs       metop-b    not available mhsbufr
read_obs_check: bufr file mhs       metop-b    not available mhsbufrears
read_obs_check: bufr file mhs       metop-b    not available mhsbufr_db
read_obs_check: bufr file abi       g16        not available abibufr
read_obs_check: bufr file ssmis     f17        not available ssmisbufr
read_obs_check: bufr file atms      n20        not available atmsbufrears
read_obs_check: bufr file atms      npp        not available atmsbufrears
read_obs_check: bufr file iasi      metop-b    not available iasibufrears
read_obs_check: bufr file iasi      metop-c    not available iasibufrears
read_obs_check: bufr file cris-fsr  n20        not available crisfsbufr_db
read_obs_check: bufr file atms      n20        not available atmsbufr_db
read_obs_check: bufr file atms      npp        not available atmsbufr_db
read_obs_check: bufr file iasi      metop-b    not available iasibufr_db
read_obs_check: bufr file iasi      metop-c    not available iasibufr_db

@lgannoaa
Copy link
Contributor Author

Many archive jobs failed with system issue:
Connection timed out
Rerun in progress.

@lgannoaa
Copy link
Contributor Author

The dust from HPSS transfer slowness finally come to an end. All archive jobs from previous cycle were completed.

@lgannoaa
Copy link
Contributor Author

Looks like the HPSS speed improvement is solid on WCOSS2 now. Modify this parallel to write restart files to HPSS everyday. This change is now in place effective CDATE=2022083006.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 1, 2022

Congratulations to this parallel hit the true realtime on CDATE=2022090106.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 1, 2022

Still see impacts during the night when production transfer jobs takes higher priority. Some of our transfer jobs gets cancelled by the HPSS system due to slow transfer speed. The HPSS helpdesk respond with acknowledge on the ticket. Therefore, issue with failed transfer jobs is here (on Dogwood) to stay.

@XuLi-NOAA
Copy link
Contributor

XuLi-NOAA commented Sep 1, 2022

RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_S Pole
RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_S Mid
RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_Tropics
RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_N Mid
RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_N Pole
RMS_BIAS_to_ostia_opr_rt1_cmc_2022080100_2022083118_Global
The same monitor has been done for rt1-v16-ecf in the same was as for retro1-v16-ecf (see #951 ) on the NSST foundation temperature analysis for the time period of 2022080100 to 2022083018. There is some warning/alert: In terms of RMS to OSTIA one, globally, it becomes smaller/better in the beginning few days, however, it is getting larger and becomes larger than operational NSST Tf analysis, after a few days. From the latitude bands, it can be seen that the larger RMS occurs in N.Pole and N.Mid areas. In other latitude bands (Tropics, S.Mid and S.Pole), RMS is significantly smaller than operational NSST Tf, as seen in retro1-v16-ecf and my experiments done with the operational GFS. This indicates that the VIIRS observations are critical with the current GSI and also, maybe, the thinning is an issue as well. More diagnosis is needed to understand more.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 5, 2022

Management requested to run a full cycle with the library updates in GFSv16.3.0. In preparation, the following modification is in plan:

  • Current HOMEgfs is preserved
  • Checkout GFSv16.3.0 and apply library updates
  • Build executable
  • Modify ecflow workflow to pause on CDATE=2022090600
  • Resume parallel with the library updates package going forward

As of the morning on Sep. 7th, the full cycle test is completed.
Management has made a decision to only update module bufr_ver to 11.7.0. All other library remain the same as prior to this full cycle run. Therefore, on Sep. 7th. The HOMEgfs has been updated with this change and rebuild. The current parallel is resumed on CDATE=2022090606.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 8, 2022

Management has made a decision on update GSI and model package. The GSI package is ready and model package is still pending. This parallel is paused on CDATE=2022090806 to checkout and build GSI package.

@lgannoaa
Copy link
Contributor Author

Due to the process of switch between using library updates, bufr_ver only and update GSI. The crtm version update was left out. The old version of crtm 2.3.0 is now update to crtm 2.4.0. GSI rebuild with crtm 2.4.0. This parallel is in progress to rerun from 2022090800.

@emilyhcliu
Copy link
Contributor

For the real-time run, we will rewind 4 days and restart on the 2022090800 cycle.
With Lin's revised and improved global-workflow with ecflow and the better HPSS transfer rate, it is not a setback to rewind the parallel run. The most important thing is that we caught the issue, fixed it, and move forward.

@lgannoaa
Copy link
Contributor Author

This parallel is on realtime from CDATE=2022091500.

@lgannoaa
Copy link
Contributor Author

There is an emergency production switch on 9/21 morning. There are 30 archive jobs failed and some jobs failed due to the switch. Debug/rerun/recover is in progress. Impactful jobs is in CDATE=2022092100 and 2022092106.

@lgannoaa
Copy link
Contributor Author

NCO executed a production switch on 9/22. Dogwood is now back to the prod machine.
This parallel will resume on CDATE=2022092200.

@lgannoaa
Copy link
Contributor Author

A safety check is now in place to stop parallel if there is an unplanned production switch. This approach will reduce the chance for cycle corruption.

@RussTreadon-NOAA
Copy link
Contributor

Question for @lgannoaa and @emilyhcliu

Should we find gdas_atmos_enkf_chgres job log files in /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc}?

We have gdas_atmos_enkf_chgres_${cyc}.o* in /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/output/prod/today It seems these log files should be copied to the appropriate ${PDY}${cyc} directory in /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/logs.

@lgannoaa
Copy link
Contributor Author

An online change to ecflow workflow is in place to copy the gdas_atmos_enkf_chgres_${cyc}.o$$ log file to /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/gfs/v16.3/logs/${PDY}${cyc}
Effective CDATE=2022092406

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @lgannoaa . It will be good to have the gdas_atmos_enkf_chgres job log file with the other log files in /logs.

@RussTreadon-NOAA
Copy link
Contributor

A check of rt1-v16-ecf gfs_atmos_analysis_calc and gdas_atmos_analysis_calc log files and output for 2022092806 confirms that the correct analysis date is written to atmanl.nc. Cycles for 2022092800 and before write the wrong analysis date to atmanl.nc.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Sep 30, 2022

Effective CDATE=2022092806
This parallel has been updated using:

  1. GSI issue 748 - Incorrect analysis date in calc_analysis.x atmanl.nc file
    The GSI package has been checkout at #e05d6923
  2. FV3 - Updated to support upp_v8.2.0 file generation
    The fv3gfs.fd package has been checkout at #ec31f35
  3. gfs_post.fd - Updated to upp_v8.2.0 with crtm 2.4.0
    The gfs_post.fd package has been checkout at #cc4d3c2f

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 3, 2022

As indicated in an email exchange between obsproc @ilianagenkova and NCO.
This parallel will be updated to use the new obsproc package:
obproc v1.1.0, bufr-dump.v1.1.0 and prepobs v1.0.1
/lfs/h2/emc/global/save/emc.global/git/obsproc/v1.1.0 (package location for obproc v1.1.0)
/lfs/h2/emc/global/save/emc.global/git/prepobs/v1.0.1 (package location for prepobs v1.0.1)
Effective CDATE=2022100312
ecflow workflow obsproc COMOUT will be located in /lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.1

A group decision has been made for the following change to ecflow workflow obsproc prep jobs:

Continue point EMC dump archive /lfs/h2/emc/global/noscrub/emc.global/dump/gdas and gfs (not the gdasx/gfsx) location for this parallel.
Copy the {CYCLE}.nsstbufr from EMC dump archive gdasx/gfsx location to obsproc COMOUT to replace the output from the prep jobs. This resulting bit identical nsstbufr files:
lin.gan@ddecflow02:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.1/gfs.20221003/12/atmos> cmp gfs.t12z.nsstbufr /lfs/h2/emc/global/noscrub/emc.global/dump/gfsx.20221003/12/atmos/gfs.t12z.nsstbufr
lin.gan@ddecflow02:/lfs/h2/emc/ptmp/lin.gan/rt1-v16-ecf/para/com/obsproc/v1.1/gdas.20221003/12/atmos> cmp gdas.t12z.nsstbufr /lfs/h2/emc/globa
l/noscrub/emc.global/dump/gdasx.20221003/12/atmos/gdas.t12z.nsstbufr

Parallel is resumed.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 6, 2022

It was discovered the nsstbufr file from GDA gdasx/gfsx not present for CDATE=2022100600. This cycle will be rerun.

@emilyhcliu
Copy link
Contributor

VIIRS radiances are missing from the real-time run because NESDIS discontinued the VIIRS brightness temperatures (BTs).without prior notification to users. NESDIS began to provide the VIIRS BTs product starting on October 5, 2022. This leaves us very little time to test to ensure the product's quality. Therefore, we decided to turn VIIRS BTs data into monitoring mode.

Based on the decision made above, we are working on the following three things:
(1) To turn VIIRS BTs off (monitoring mode), we modified global_satinfo.txt to set the use flags for VIIRS from 1 to -1
We will update the gfsda.v16.3.0 tag.
(2) For EMC real-time parallel, VIIRS data were turned off from 20221006 00z cycle.
(3) EMC DA team, with the help from @lgannoaa will run a real-time parallel with VIIRS data turned on to assess the impact.

Notes:
(1) The NCO 30-day stability test should also run with VIIRS data off.
(2) Once we ensure the NCO 30-day stability test is identical to EMC real-time parallel, the EMC-real-time parallel will turn on VIIRS.

@lgannoaa
Copy link
Contributor Author

Effective CDATE=2022101112, this parallel will point obsproc directory from NCO /lfs/h1/ops/para/com/obsproc location instead of EMC dump archive.

@lgannoaa
Copy link
Contributor Author

This parallel was paused on CDATE=2022101112 for helping NCO warm start the 30 days parallel. It resumed and on the realtime after that. The wafs bufr.t00z gempak wmo in COM were mirrored to Cactus for code manager to review from CDATE 2022101112 to 2022101406.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Oct 21, 2022

Due to NCO prepbufr missing. This parallel has been paused since CDATE=2022102100
This file has been moved to prod space. Will modify ecflow workflow to use this new location and resume parallel.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Nov 1, 2022

Dogwood dbqs currently suffer system issue errno=111. Realtime parallel is halt on CDATE= 2022103118.
Resumed and caught on realtime

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Nov 2, 2022

The METPlus gplot will be a bit late because the control stat file for 20221101 missing. This delay is due to the parallel production test.

@lgannoaa
Copy link
Contributor Author

lgannoaa commented Nov 28, 2022

The implementation is delay until 11/30. Dogwood white space job is going to be stopped after implementation is successful.

@lgannoaa
Copy link
Contributor Author

This parallel is done as the GFS implementation taking place. The last full cycle completed is CDATE=2022113000.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants