Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failure to restart-reproduce if using a restart from 15th of month #2588

Open
DeniseWorthen opened this issue Feb 3, 2025 · 24 comments · May be fixed by #2625
Open

failure to restart-reproduce if using a restart from 15th of month #2588

DeniseWorthen opened this issue Feb 3, 2025 · 24 comments · May be fixed by #2625
Labels
bug Something isn't working

Comments

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Feb 3, 2025

Description

As part of debugging Issue #2562, I was passed a run directory for the SFS C192mx025 by @ShanSunNOAA. While working on that issue, I found I was not able to restart-reproduce if I used a restart file from the middle of the month (specifically on 2005-11-15-00).

I then set up a test case using a modified cpld_control_sfs test and the HR4 tag (fcc9f84). The modifications were to align w/ the run-directory I was debugging for C192-mx025 (no waves, atm-thread=2).

I ran that test case out long enough to capture restarts every 24h through to 2021-04-26-06 . I found that I was able to reproduce using the restart at 04-14-06, but not at 04-15-06.

To enable easier debugging, I set up cpld_control_sfs cases using artificially advanced start times---ie, I set the start year/date to 04-13-06 and wrote restarts every 6 hours. I found was able to restart-reproduce using the restart at 04-14-18 but not at 04-15-00.

I repeated the test using an executable which did not have -D32BIT=ON -DHYDRO=ON and the restart again failed to reproduce using a restart on the 04-15-00.

Using mediator history files, I find that that the none of the fields imported from the ATM on restart are B4B using a restart from 04-15-00.

To Reproduce:

Currently all test cases reside in my own sandboxes on hera /scratch1/NCEPDEV/stmp2/Denise.Worthen/sfs.restart

Additional context

I am currently testing the develop branch using the control_c48, control_p8 and the cpld_control_p8 tests.

@DeniseWorthen DeniseWorthen added the bug Something isn't working label Feb 3, 2025
@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Feb 4, 2025

I've created a reproducer branch which reproduces this error in the control_p8 test.

https://github.com/DeniseWorthen/ufs-weather-model/tree/bugfix/d15restart

It can be run using ./rt.sh -ek -l rt.rst15 -a nems >output 2>&1 &. This will run a control and then three restart tests, one using the 041418 restarts, one using the 041500 and one using the 041506. It doesn't depend on creating a baseline first, but that means that the files need to be manually compared afterwards. For example:

nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format control_p8_intel/RESTART/20210416.000000.sfc_data.tile1.nc control_restart_p8_1418_intel/RESTART/20210416.000000.sfc_data.tile1.nc

will compare restarts from the control vs the 1418 runs.

And

nccmp -d -S -q -f -g -B --Attribute=checksum --warn=format control_p8_intel/RESTART/20210416.000000.sfc_data.tile1.nc control_restart_p8_1500_intel/RESTART/20210416.000000.sfc_data.tile1.nc

will compare the control and the 1500 runs. In this case, the nccmp result shows

Variable      Group Count          Sum      AbsSum          Min          Max       Range         Mean      StdDev
tsea          /      9193     -21.9127      358.67     -3.74133      3.96841     7.70974  -0.00238363    0.174327
sheleg        /        65    -0.188072    0.959371    -0.191116     0.183119    0.374235  -0.00289341   0.0386586
zorl          /      7314      3.22824      35.123    -0.509531      1.02241     1.53194  0.000441378   0.0325821
canopy        /       929     0.776428     35.4243    -0.525819      1.34609     1.87191  0.000835767   0.0960558
f10m          /      9216   -0.0735146     1.88722   -0.0115725    0.0101339   0.0217064 -7.97684e-06 0.000564475
t2m           /      9216     -28.2394     872.518     -2.74664      2.91128     5.65791  -0.00306417    0.199814
....

@LarissaReames-NOAA
Copy link
Collaborator

@yangfanglin Since @DeniseWorthen's tests suggest that this only happens with restarts on the 15th, which is the date climatology fields are read in, do you think this might be some bug related to climo file read logic?

@HelinWei-NOAA
Copy link
Collaborator

Good catch. During the middle of month the GVF will be updated with a new value based on the monthly climatology. It is very likely when you restart from 15th of month, the model will bypass that step.

@yangfanglin Since @DeniseWorthen's tests suggest that this only happens with restarts on the 15th, which is the date climatology fields are read in, do you think this might be some bug related to climo file read logic?

@HelinWei-NOAA
Copy link
Collaborator

Just for a test, they should reproduce if you set wei1m to 1 in sfcsub.f (not change any fixed fields on the 15th of month)

@DeniseWorthen
Copy link
Collaborator Author

Would you be able to do any debugging on this issue? The reproducer branch is all set up to use control_p8 and then run 3 different FHROT values.

@HelinWei-NOAA
Copy link
Collaborator

for warmstart, the model will assume you can get anything from restart files and won't go through sfcsub.f

@DeniseWorthen
Copy link
Collaborator Author

These are all restarting using the checkpoint restarts written at a particular time. They are using 'warm_start=true'.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Feb 20, 2025

Update: I manually set weim1 in sfcsub.F90 to 1 and obtained the same result---the model restart reproduces using a restart from 04-14-180000 but does not restart reproduce using a restart from 04-15-000000. The idea here was to check if (contrary to expectations), the model does go into sfcsub for some field on the 15th.

@HelinWei-NOAA
Copy link
Collaborator

HelinWei-NOAA commented Feb 20, 2025

@DeniseWorthen Thanks for testing. Can you try another one to turn off the call of sfcsub.f to see if it can reproduce for the restart from 4-15-000000.

@DeniseWorthen
Copy link
Collaborator Author

@HelinWei-NOAA Where exactly would I need to turn off the call to sfcsub?

@HelinWei-NOAA
Copy link
Collaborator

@DeniseWorthen comment out "CALL SFCCYCLE" in gcycle.F90 (ufs-weather-model/FV3/ccpp/physics/physics/Interstitials/UFS_SCM_NEPTUNE)

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Feb 20, 2025

@HelinWei-NOAA I commented out sfccyle and I get the same results (using 04-14-180000 restart reproduces but 04-15-000000 does not).

I'm currently writing ATM restarts every timestep (restart_interval: 0.2 -1). I can compare the first timestep restart (eg, 20210415.001200.sfc_data.tile2.nc ) from the control and restart runs. These are the 62 fields which are different:

2:73: RMS tsea                             4.3024E-03            NORMALIZED  1.4495E-05
4:82: RMS sheleg                           1.8904E-05            NORMALIZED  4.0170E-05
6:98: RMS zorl                             1.0658E-06            NORMALIZED  1.4759E-07
8:156: RMS canopy                           6.6674E-06            NORMALIZED  5.1257E-24
10:165: RMS f10m                             1.0055E-05            NORMALIZED  1.0129E-05
12:174: RMS t2m                              1.3629E-03            NORMALIZED  4.6528E-06
14:183: RMS q2m                              2.7574E-07            NORMALIZED  2.3419E-05
16:206: RMS uustar                           1.0346E-04            NORMALIZED  4.6622E-04
18:215: RMS ffmm                             2.8022E-03            NORMALIZED  3.0231E-04
20:224: RMS ffhh                             2.9986E-03            NORMALIZED  2.6153E-04
22:247: RMS tisfc                            4.3024E-03            NORMALIZED  1.4495E-05
24:256: RMS tprcp                            4.0255E-09            NORMALIZED  1.4058E-04
26:265: RMS srflag                           2.5196E-06            NORMALIZED  2.0670E-04
28:274: RMS snwdph                           2.8506E-04            NORMALIZED  7.3824E-05
30:318: RMS sncovr                           2.1527E-06            NORMALIZED  8.1830E-05
32:327: RMS snodl                            2.8510E-04            NORMALIZED  7.3247E-05
34:336: RMS weasdl                           1.8909E-05            NORMALIZED  3.9866E-05
36:345: RMS tsfc                             6.1090E-03            NORMALIZED  2.0815E-05
38:354: RMS tsfcl                            6.1710E-03            NORMALIZED  2.1036E-05
40:363: RMS zorlw                            5.5101E-09            NORMALIZED  3.3380E-45
42:372: RMS zorll                            6.3398E-07            NORMALIZED  1.0091E-43
44:388: RMS albdirvis_lnd                    5.0315E-07            NORMALIZED  2.4398E-06
46:397: RMS albdirnir_lnd                    3.1518E-07            NORMALIZED  1.1648E-06
48:406: RMS albdifvis_lnd                    6.4593E-07            NORMALIZED  3.1296E-06
50:415: RMS albdifnir_lnd                    3.4762E-07            NORMALIZED  1.3184E-06
52:424: RMS emis_lnd                         1.4664E-08            NORMALIZED  1.5270E-08
54:468: RMS z_c                              2.9596E-08            NORMALIZED  3.3165E-05
56:477: RMS c_0                              1.0355E-06            NORMALIZED  2.0057E-05
58:486: RMS c_d                              5.7352E-05            NORMALIZED  1.0066E-06
60:495: RMS w_0                              5.3371E-05            NORMALIZED  2.8772E-02
62:504: RMS w_d                              1.6080E-04            NORMALIZED  2.5746E-02
64:513: RMS xt                               2.8358E-06            NORMALIZED  3.5464E-03
66:522: RMS xs                               2.1546E-10            NORMALIZED  5.3957E-06
68:531: RMS xu                               1.3901E-12            NORMALIZED  2.5423E-09
70:540: RMS xv                               1.3175E-12            NORMALIZED  3.9614E-09
72:549: RMS xz                               2.5758E-04            NORMALIZED  1.3058E-05
74:565: RMS xtts                             7.5468E-06            NORMALIZED  5.1105E-03
76:574: RMS xzts                             1.6297E-02            NORMALIZED  4.9958E-02
78:597: RMS dt_cool                          8.2583E-05            NORMALIZED  1.9509E-04
80:606: RMS qrain                            3.3114E-06            NORMALIZED  5.2082E-06
82:622: RMS tvxy                             5.9277E-03            NORMALIZED  9.4057E-24
84:631: RMS tgxy                             5.8983E-03            NORMALIZED  9.3687E-24
86:640: RMS canicexy                         4.7976E-06            NORMALIZED  7.6125E-27
88:649: RMS canliqxy                         7.2036E-06            NORMALIZED  1.1430E-26
90:658: RMS eahxy                            9.9980E-02            NORMALIZED  1.5864E-22
92:667: RMS tahxy                            4.6986E-03            NORMALIZED  7.4553E-24
94:676: RMS cmxy                             1.6188E-05            NORMALIZED  2.5712E-26
96:685: RMS chxy                             1.0444E-05            NORMALIZED  1.6588E-26
98:694: RMS fwetxy                           1.9677E-05            NORMALIZED  3.1222E-26
100:703: RMS sneqvoxy                         1.8875E-05            NORMALIZED  2.9980E-26
102:719: RMS qsnowxy                          8.1609E-10            NORMALIZED  1.2962E-30
104:735: RMS zwtxy                            1.2505E-11            NORMALIZED  1.9841E-32
106:744: RMS waxy                             2.5009E-09            NORMALIZED  3.9683E-30
108:753: RMS wtxy                             2.5009E-09            NORMALIZED  3.9683E-30
110:818: RMS taussxy                          7.7821E-08            NORMALIZED  1.2361E-28
112:855: RMS stc                              1.6835E-04            NORMALIZED  5.6971E-07
114:864: RMS smc                              9.7652E-08            NORMALIZED  1.3723E-07
116:873: RMS slc                              5.1218E-07            NORMALIZED  7.2423E-07
118:882: RMS snicexy                          5.0871E-06            NORMALIZED  8.0801E-27
120:891: RMS snliqxy                          4.8214E-06            NORMALIZED  7.6581E-27
122:900: RMS tsnoxy                           5.2374E-04            NORMALIZED  8.3189E-25
124:916: RMS zsnsoxy                          2.3362E-07            NORMALIZED  3.7108E-28

Can I run w/o any LSM ?

@HelinWei-NOAA
Copy link
Collaborator

@DeniseWorthen I compared input.nml (namelist file) between control and restart runs, there are some difference likely not related to cold/warm start. Do you know why?
< make_nh = .true.

make_nh = .false.
59c59
< na_init = 1


na_init = 0
71c71
< external_ic = .true.


external_ic = .false.
74,75c74,75
< nggps_ic = .true.
< mountain = .false.


nggps_ic = .false.
mountain = .true.
91c91
< warm_start = .false.


warm_start = .true.
205c205
< nstf_name = 2,1,0,0,0


nstf_name = 2,0,0,0,0

@DeniseWorthen
Copy link
Collaborator Author

@HelinWei-NOAA These are the settings required for ATM to be a restart vs a 'cold start'. They are used in both coupled and standalone configurations.

export WARM_START=.true.
export NGGPS_IC=.false.
export EXTERNAL_IC=.false.
export MAKE_NH=.false.
export MOUNTAIN=.true.
export NA_INIT=0

@DeniseWorthen
Copy link
Collaborator Author

@HelinWei-NOAA In case you don't know...the restart tests in the RT system utilize checkpoint restarts. This means that the control run will write restarts at a specified interval and then continue to the end of the fhmax. The restart test uses those checkpoint restarts to restart and also run forward to fhmax.

The results of the control and restart test are compared; they must be identical if the model "restart reproduced". All the restarts produced by a control run are B4B w/ the restarts produced by the restart run.

@HelinWei-NOAA
Copy link
Collaborator

@DeniseWorthen Thanks for the explanation. It is weird to me that both restarts after 1500 can't reproduce either.

@DeniseWorthen
Copy link
Collaborator Author

DeniseWorthen commented Feb 25, 2025

@HelinWei-NOAA I've been able to get restart repro at hour day 15, hour 00 by switching to iaer = 5111.

@HelinWei-NOAA
Copy link
Collaborator

@DeniseWorthen Great work! It looks like the model read some aerosol data during the middle of month if we turn on MERRA2 (iaer=1011)

@yangfanglin
Copy link
Collaborator

@AnningCheng-NOAA Anning, could you please help check why the model is not reproducing if iaer=1011 (merra2 clima) ?

@AnningCheng-NOAA
Copy link
Contributor

Hi, Fanglin, I will take a look.

@AnningCheng-NOAA
Copy link
Contributor

when I issue
git clone --recursive https://github.com/DeniseWorthen/ufs-weather-model/tree/bugfix/d15restart
Cloning into 'd15restart'...
fatal: repository 'https://github.com/DeniseWorthen/ufs-weather-model/tree/bugfix/d15restart/' not found

which branch or tag is used to repeat the restart issue now? I need to repeat the error for debugging

@AnningCheng-NOAA
Copy link
Contributor

It looks like I do not have the permission to access the branch in hera

@AnningCheng-NOAA
Copy link
Contributor

never mind, I have just cloned the branch and will post what I have found soon.

@AnningCheng-NOAA
Copy link
Contributor

The issue has been fixed by a few lines of code change in aerosol interpolation code to make sure the continuous running to read the forcing file the same way as restart run. Formerly, the continuous run does not read the previous record.

My test code is located at /scratch1/NCEPDEV/global/Anning.Cheng/tmp/ufs-weather-model
RT results: /scratch1/NCEPDEV/stmp2/Anning.Cheng/FV3_RT/rt_3648516

PRs have been created to merge my changes to trunk:
#2625
NOAA-EMC/fv3atm#933
NCAR/ccpp-physics#1115

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants