Much slower timings in model init with the latest ufs-weather-model #801

EricRogers-NOAA · 2021-09-14T22:43:51Z

Today Ben Blake checked out the latest develop branch and ran a short test on WCOSS Dell Phase 3.5 with the 3 km RRFS LAM domain over North America, coldstarted from the GFS analysis, He saw an increase in the model initialization time by almost 8 minutes compared to the current parallel LAM run:

LAM parallel: in fcst,init total time: 74.2139439582825 (#ddcd809, checked out 7/30/21)
My test: in fcst,init total time: 524.866119146347 (#e198256, checked out today)

Ben's test run did not run to completion so no termination times are available. Bin Liu noted similar behavior in HAFS

climbfuji · 2021-09-14T22:49:29Z

Hi Eric, a lot of development happened between the two hashes you posted. One way to narrow this done is to use the good old bisect mode. Do you think you have time to do that? I went to https://github.com/ufs-community/ufs-weather-model/commits/develop and searched for ddcd809 then everything that got merged since then is above.

EricRogers-NOAA · 2021-09-14T23:11:38Z

I take it you mean check out a version, run a short test and see when the slowdown starts? I don't have a lot of time lately because I'm working on the WCOSS2 conversion effort, but I'll try to clear some time for this.

EricRogers-NOAA · 2021-09-16T18:29:40Z

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain

Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461
#4a2a127 (9/7/2021): in fv3_cap, init time= 410.151222944260
#01d70f4 (8/30/2021): in fv3_cap, init time= 422.296640157700
#3f3c253 (8/25/2021): in fv3_cap, init time= 420.641896009445
#b26a896 (8/23/2021): compile aborted:

/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/ufs-weather-model_aug23/FV3/atmos_cubed_sphere/tools/fv_eta.F90(49)
: error #6580: Name in only-list does not exist or is not accessible. [ASCII_READ]
use fms2_io_mod, only: ascii_read
---------------------------^

#f7cfebf (8/18/2021): compile aborted, same as above
#2258171 (8/13/2021): compile aborted, same as above

Why am I getting these compile aborts? I'm doing this:

git clone --recursive https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir
cd ufs-weather-model_mydir
git checkout (hash)

then to compile:

set -x
. /usrx/local/prod/lmod/lmod/init/sh
module purge
module use modulefiles
module load ufs_wcoss_dell_p3
export CMAKE_PLATFORM=wcoss_dell_p3
export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export BUILD_VERBOSE=1
./build.sh

DusanJovic-NOAA · 2021-09-17T14:41:59Z

@EricRogers-NOAA After checking out the exact commit (hash) you want to build, you must update submodules. Otherwise you'll be using submodules from the initial clone, which are submodules used by the current develop branch, not the submodules from the hash you want. So clone like this (note no --recursive in git clone)

git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir
cd ufs-weather-model_mydir
git checkout (hash)
git submodule update --init --recursive

and then build.

EricRogers-NOAA · 2021-09-17T15:55:18Z

@DusanJovic-NOAA thank you very much. I always forget that submodule step. I was able to checkout the Aug 23 commit now. I'll be sending out an updated list of init timings later.

EricRogers-NOAA · 2021-09-17T16:34:20Z

New timing tests, with the correct checkout of earlier commits:

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain, all warm starts from LAMDA IC

Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461
#b26a896 (8/23/2021): in fv3_cap, init time= 39.86402297019963
3) #3f3c253 (8/25/2021): in fv3_cap, init time= 231.806740045547

The 8/25/2021 commit #762 is the cause of the slower init time.

junwang-noaa · 2021-09-17T16:57:51Z

Thanks, Dusan. @ericaligo-NOAA Thanks to identify the PR that causes the slowness of model initialization step.

@bensonr @mlee03 The #762 is the FMS lib update to 2021.03. Would you please take a look what code updates in fms might cause the slowness? Thanks.

bensonr · 2021-09-17T17:28:11Z

@junwang-noaa - a similar issue has been brought directly to my attention by the HAFS regional team. I know the reason and am trying to verify the resolution will alleviate the situation.

EricRogers-NOAA · 2021-12-06T20:35:52Z

Latest UFS code put into LAM parallels (#805421d) on 11/30/2021. Init time for the RRFS domain went from ~140 sec to almost 30 minutes:

err: in fcst,init total time: 1970.07599091530

JacobCarley-NOAA · 2022-03-10T15:14:38Z

@junwang-noaa @arunchawla-NOAA
Do we have any updates on this? When testing the model on Cray TO4 (Luna/Surge) the model takes over 3000s to initialize.

bensonr · 2022-05-03T15:34:14Z

@junwang-noaa @arunchawla-NOAA @JacobCarley-NOAA - please try your tests with this version of fv3atm. Make sure to use io_layout=1,1 to test the initialization performance.

This branch also contains a fix to the restart checksum issue you've been wanting removed. The option is controlled separately for the dycore and the physics. In the dycore, one needs to add fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true. for use within FV3GFS_io.F90. If you don't want the option to be in atmos_model.F90 but exist in FV3GFS_io.F90 itself, feel free to reimplement as you see fit.

Once you are satisfied with the results of your testing, please merge the changes into your own branches and add the appropriate PRs.

junwang-noaa · 2022-05-03T15:42:40Z

@bensonr Thank you very much for making the code changes.
@BinLiu-NOAA @EricRogers-NOAA FYI,

EricRogers-NOAA · 2022-05-03T16:27:49Z

How would I check this out and compile w.r.t. to the full model (https://github.com/ufs-community/ufs-weather-model)? I've always just cloned the https://github.com/ufs-community/ufs-weather-model (and maybe checkout a feature branch) and have not had experience dealing with a different version of fv3atm or some other submodule. Thanks for your assistance.

junwang-noaa · 2022-05-03T16:36:03Z

@EricRogers-NOAA I am creating a ufs-weather-model branch from the latest develop branch using Rusty's fv3atm, we can use it for testing. I will let you know when I am done.

bensonr · 2022-05-03T16:48:48Z

Or simply

cd ufs-weather-model
mv FV3 FV3.orig (or: rm -rf FV3)
git clone -b emc_io_fixes --recursive https://github.com/bensonr/fv3atm FV3

junwang-noaa · 2022-05-03T17:00:43Z

Thanks, Rusty, that will work too. Anyway, @EricRogers-NOAA @BinLiu-NOAA Here is the branch for testing:

https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io

BinLiu-NOAA · 2022-05-03T17:08:36Z

Thanks a lot, @bensonr @junwang-noaa! We will test from the HAFS side and report back on how this new branch perform for the model forecast init phase when using io_layout of (1x1) for both cold-start and warm-start scenarios. We will also test the capability of skipping the checksum step. Thanks!

EricRogers-NOAA · 2022-05-03T18:02:06Z

@junwang-noaa my compile failed on WCOSS Dell:

CMake Error at FV3/CMakeLists.txt:21 (message):
An error occured while running ccpp_prebuild.py, check
/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/emc_io_fixes/build/FV3/ccpp_prebuild.{out,err}

I had been using this to compile the code, I take it there have been changes:

#!/bin/bash
set -x
. /usrx/local/prod/lmod/lmod/init/sh
module purge
module use modulefiles
module load ufs_wcoss_dell_p3
export CMAKE_PLATFORM=wcoss_dell_p3
#export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=ON -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export BUILD_VERBOSE=1
./build.sh

junwang-noaa · 2022-05-03T18:43:44Z

I run the RT tests on Orion, it works. Let me check Dell.

junwang-noaa · 2022-05-04T02:29:39Z

@EricRogers-NOAA The code compiled on dell. Here is what I did:

Jun.Wang@v71a1 ufs-weather-model]$ pwd
/gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model
module purge
module use -a /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model/modulefiles
module load ufs_wcoss_dell_p3
module list
export CMAKE_FLAGS="-DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km -D32BIT=ON"
nohup ./build.sh >xxxcmpl 2>&1 &

BenjaminBlake-NOAA · 2022-05-04T12:07:30Z

@junwang-noaa I tried compiling your branch using your commands listed above, but I got the same error as Eric. I looked in my build/FV3/ccpp/ccpp_prebuild.err file and I saw the following message at the end:

KeyError: 'rrtmg_sw_pre'

The FV3_GFS_v15_thompson_mynn_lam3km suite file we were using did contain rrtmg_sw_pre, but I see it was replaced by rad_sw_pre in the repository. The xml file we are using is slightly different because it uses the unified GWD scheme. After making that change, the code compiled for me. @EricRogers-NOAA give that a try and see if it works for you (I used your original compile.sh)

EricRogers-NOAA · 2022-05-05T18:49:51Z

I've got the new code running on WCOSS Dell P3; I saw this print:

Computing rain collecting graupel table took 226.539 seconds.
creating rain collecting snow table
Computing rain collecting snow table took 31.857 seconds.

Are there new tables we need to read in that will eliminate the above computations and reduce run time?

The run subsequently aborted a few minutes after the above print,

EricRogers-NOAA · 2022-05-05T19:26:21Z

One of the ESMF debug prints had this in the failed run of the new code:

20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:2101 Invalid argument - inst_tracer_diag_aod is not a StandardName in the NUOPC_FieldDictionary!
20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:480 Invalid argument - Passing error in return code
20220505 184831.351 ERROR PET1875 module_fcst_grid_comp.F90:410 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 module_fcst_grid_comp.F90:1079 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 fv3_cap.F90:888 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:700 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2577 Invalid argument - Phase 'IPDvXp01' Initialize for modelComp 1: ATM did not return ESMF_SUCCESS
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1286 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 UFS.F90:386 Invalid argument - Aborting UFS

junwang-noaa · 2022-05-05T19:37:44Z

@EricRogers-NOAA Please update the fd_nems.yaml file in the run directory from the latest develop branch:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/fd_nems.yaml

ericaligo-NOAA · 2022-05-05T20:40:39Z

I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme?

…

On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ>. You are receiving this because you were mentioned.Message ID: ***@***.***>

RuiyuSun · 2022-05-06T03:04:28Z

I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme?
…
On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***>
@ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.

SMoorthi-emc · 2022-05-06T11:15:05Z

Ruiyu, Where are these precomputed tables? I also saw them being created while running.

…

On Thu, May 5, 2022 at 11:04 PM RuiyuSun ***@***.***> wrote: I'm not using the latest code, so I don't know if new tables have been added. Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme? … <#m_5980202897869194547_> On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment) <#801 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: *@*.***> @ericaligo-NOAA <https://github.com/ericaligo-NOAA> I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory. — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALLVRYRSEDPYM2KHFHIHBZLVISD4XANCNFSM5EBEOZQQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Dr. Shrinivas Moorthi Research Meteorologist Modeling and Data Assimilation Branch Environmental Modeling Center / National Centers for Environmental Prediction 5830 University Research Court - (W/NP23), College Park MD 20740 USA Tel: (301)683-3718 e-mail: ***@***.*** Phone: (301) 683-3718 Fax: (301) 683-3718

RuiyuSun · 2022-05-06T13:32:14Z

You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin.
qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat

junwang-noaa · 2022-05-24T19:15:29Z

@EricRogers-NOAA So the only change you made is io_layout change? Also I am curious, did you specify the chunksizes for restart files? I think you can still use io_layout= 1,15 if that helps with the total run time.

My understanding is that using more io tasks in io_layout will speed up reading the restart files

EricRogers-NOAA · 2022-05-24T19:35:36Z

@junwang-noaa In the current code, making io_layout=1,X (I used 15) dramatically speed up reading the restart files, for example;
regional_forecast_tm03_12.log: fcst_initialize total time: 192.971059083939
regional_forecast_tm04_12.log: fcst_initialize total time: 206.560415983200
regional_forecast_tm05_12.log: fcst_initialize total time: 153.661432027817

With io_layout=1,1 these were much larger (up to one hour). But with Rusty's I/O changes these times are very fast now with io_layout=1,1

regional_forecast_tm03_12.log: fcst_initialize total time: 42.0637090206146
regional_forecast_tm04_12.log: fcst_initialize total time: 43.1961140632629
regional_forecast_tm05_12.log: fcst_initialize total time: 45.6291441917419

I did not specify chunksizes for the restart files, is this doable now with the parameter ncchksz in fms2_io_nml? I was unaware of this option until recently when it came up in another email thread.

EricRogers-NOAA · 2022-06-02T22:10:42Z

Is there an example handy of how to set chunksizes for restart files? Is this done in the fms2_io_nml part of the input.nml file, or in the code itself?

junwang-noaa · 2022-06-09T13:51:06Z

@EricRogers-NOAA I checked with Rusty. You can set the ncchksz in fms2_io_nml for "64BIT" format. I was using default fms2_io_nml setting in my global C384 test (/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_199573/control_c384gdas_chksum/RESTART), I do not see the chunksize specified as netcdf attributes in the restart files. .

    float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
            ua:checksum = "  41A8342EDD4099" ;
    float va(Time, zaxis_1, yaxis_2, xaxis_1) ;
            va:checksum = "   3CF60F6445013" ;

Are the files with chunksize created from the model run in your testing? Are you writing out the "NC-4" format in fms2_io_nml? Can I take a look at your run directory? Thanks

EricRogers-NOAA · 2022-06-09T20:54:12Z

@junwang-noaa I have netcdf_default_format = "netcdf4" in fms2_io_nml. I have two working directories for you to look at :

/gpfs/dell2/ptmp/emc.campara/regional_forecast_tm04_2022060912 is from a 1-h forecast from the real-time LAMDAX parallel using code #a46c7ed (3/24/22). The forecast is run with io_layout=1,15, so you see 14 fv_core*.00* and fv_tracer*.00* files in the ./RESTART directory. If you run “ncdump -s -h fv_core.res.tile1.nc.0000”, you see the chunksize settings as expected, for example:

float u(Time, zaxis_1, yaxis_1, xaxis_1) ;
         u:checksum = "FF0DCF1C16B1B8C5" ;
         u:_Storage = "chunked" ;
         u:_ChunkSizes = 1, 17, 46, 988 ;
         u:_Endianness = "little" ;
 float v(Time, zaxis_1, yaxis_2, xaxis_2) ;
         v:checksum = "  322FDF40C4A28E" ;
         v:_Storage = "chunked" ;
         v:_ChunkSizes = 1, 17, 45, 988 ;
         v:_Endianness = "little" ;

// global attributes:
:NumFilesInSet = 15 ;
:_NCProperties = "version=2|netcdf" ;
:_SuperblockVersion = 0 ;
:_IsNetcdf4 = 1 ;
:_Format = "netCDF-4" ;

Since the GSI analysis can only read the full fv_core and fv_tracer files, I need to run the mppnccombine utility to combine the 15 restart file pieces into one file. I have put the combined files in the RESTART directory (fv_core.res.tile1.nc, fv_tracer.res.tile1.nc), and they are created by this command:

$EXECfv3/mppnccombine -v -64 fv_core.res.tile1.nc
$EXECfv3/mppnccombine -v -64 fv_tracer.res.tile1.nc

The “-64” is making the combined files as 64-bit offset, and you see no chunksizes when you run “ncdump -s -h fv_core.res.tile1.nc”:

   float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
            ua:checksum = "FFF424CE79DE9282" ;
    float va(Time, zaxis_1, yaxis_2, xaxis_1) ;
            va:checksum = "FF3C70DE1E0697AB" ;

// global attributes:
:_Format = "64-bit offset" ;
}
So with no chunksizes set for 64-bit offset restart files, the parallel reads of the restart files in the GSI code are efficient.

/gpfs/dell2/ptmp/emc.campara/regional_forecast_tm06_2022060912_testcode is a 1-h LAMDAX test forecast with the code with Rusty's I/O changes. With his changes we can run with io_layout=1,1 with no slowdown at the model startup, so in the RESTART directory the model writes out the full-size restart files in netcdf-4 format with these chunksizes:
```
 xaxis_1 = 3950 ;
 xaxis_2 = 3951 ;
 yaxis_1 = 2701 ;
 yaxis_2 = 2700 ;
 zaxis_1 = 65 ;
```

.
.
float delp(Time, zaxis_1, yaxis_2, xaxis_1) ;
delp:checksum = " B0FF4737CF56015" ;
delp:_Storage = "chunked" ;
delp:_ChunkSizes = 1, 6, 300, 439 ;
delp:_Endianness = "little" ;

With these default chunksize settings the GSI code is about 30% slower because the parallel reads are not efficient. We found that if these chunksizes were reset to the grid dimensions using nccopy:

nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc
nccopy -c xaxis_1/3950,yaxis_1/2700,zaxis_1/65 fv_tracer.res.tile1.nc fv_tracer.res.tile1_new.nc

The GSI run times were faster, and on par with what we see now with 64-bit offset restart files. Of course running the above nccopy command adds extra run time/overhead, so if these restart file chunk sizes could be set in the model itself it would be helpful.

bensonr · 2022-06-09T21:19:48Z

@EricRogers-NOAA @junwang-noaa

First, the axis values are not chunksizes, they are the variable dimensions as you can see in the variable definition float delp(Time, zaxis_1, yaxis_2, xaxis_1) ;

I am very surprised the io_layout=1,15 restart files have chunked data in them. According to the NetCDF documentation, the ncchksz parameter, which is used in the nf90_open function call, is ignored when creating NC-4/HDF5 files. Instead, the chunking is defined at the variable definition layer using nf90_def_var. According the the NetCDF documentation, the default data layout is contiguous, unless certain arguments are provided - which they are not provided in fms2_io (see here and here). Don't be confused by the presence of a checksum attribute as that is created and written by FMS and not related to the fletcher32 argument to nf90_def_var. I suggest consulting with Ed Hartnett, if he is still available to EMC, as to why a variable definition is being chunked when the documentation indicates it should not be.

EricRogers-NOAA · 2022-06-09T23:09:26Z

@bensonr Thanks, I will reach out to Ed. I wasn't quite clear above about that nccopy command, when I ran that command as is, we went from chunksizes in fv_tracer.res.tile1.nc from this:

    float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ;
            sphum:checksum = " 93FD4CEFDD0C45B" ;
            sphum:_Storage = "chunked" ;
            sphum:_ChunkSizes = 1, 6, 300, 439 ;
            sphum:_Endianness = "little" ;

to this
float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ;
sphum:checksum = " 93FD4CEFDD0C45B" ;
sphum:_Storage = "chunked" ;
sphum:_ChunkSizes = 1, 65, 2700, 3950 ;
sphum:_Endianness = "little" ;
sphum:_NoFill = "true" ;

And the GSI code reading this file was about 30% faster. We did not test to see if these new chunksizes were the most optimal, however.

junwang-noaa · 2022-06-10T14:07:19Z

@EricRogers-NOAA May I ask if there is the specific reason to use "netcdf4" netcdf_default_format? What about using the default "64bit" option for the restart files as you use in the mppnccombine command? I don't see the chunksizes in the restart files when using the default "64bit" format, maybe it will speed up the process (no need of using nccopy and fast reading in GSI)?

EricRogers-NOAA · 2022-06-10T14:29:03Z

@junwang-noaa I will revisit this with Ting Lei, but as best as I can gather from going through old emails, the parallel I/O in the GSI supports netcdf4, not netcdf3, which is why I went down the netcdf4 rabbit hole. But that begs the question, is the GSI faster when reading default 64bit restart files vs netcdf4 restart files with inefficient chunksizes? I'll make some tests to answer that question.

EricRogers-NOAA · 2022-06-10T20:28:11Z

@junwang-noaa When I set netcdf_default_format = "64bit" in the 3km N. American RRFS domain forecast run, I got this error:

FATAL from PE 0: NetCDF: One or more variable sizes violate format constraints: set_netcdf_mode

which is bringing back memories of what happened last year when I first started running this huge domain (3951x2701) in a DA run with the model writing out restart files. Someone (I believe it was Rusty) who recommended adding the fms2_io_nml block with netcdf_default_format = "netcdf4".

junwang-noaa · 2022-06-13T15:50:56Z

@EricRogers-NOAA Thanks for the explanation. So for RRFS, we have to use "netcdf4" format. In that case, can we still use "io_layout=1,15" as before? Will that slow down RRFS runs? My understanding is that until the netcdf chunksize issue is fixed, RRFS will not get speedup by using the fixes Rusty provided. But other applications such as GFS/HAFS/UFSAQM will speed up by using io_layout=1,1 with default "64bit" netcdf format. I think we need to open a separate issue on the chunksize used in netcdf format and ask Ed to take a look. Meanwhile, we can work on committing Rusty's fixes back to the commit queue if it does slow down RRFS. @JianpingHuang-NOAA is waiting for the code changes to run UFSAQM on wcoss2. Please let me know if this working for you. Thanks

EricRogers-NOAA · 2022-06-13T16:18:50Z

@junwang-noaa Yes, RRFS will need to use netcdf4 format for restart files, we can just continue to set io_layout=1,15 and then recombine the restart file pieces using that mppnccombine utility with the -64 option so the input restart files to the GSI are in 64bit format.

Earlier I stated this: "We found that if these chunksizes were reset to the grid dimensions using nccopy:

nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc
nccopy -c xaxis_1/3950,yaxis_1/2700,zaxis_1/65 fv_tracer.res.tile1.nc fv_tracer.res.tile1_new.nc

The GSI run times were faster, and on par with what we see now with 64-bit offset restart files.". What I discovered is that the above nccopy command from netcdf 4.5.0 (on WCOSS1 Dell) worked as I described above, but when one loaded netcdf/4.7.4 from hpc-stack, the above nccopy command did not change the chunksizes for the 2-d and 3-d variables in the restart files.

junwang-noaa · 2022-06-13T17:05:19Z

@EricRogers-NOAA Thanks for the information. I think this is a netcdf issue we may need to ask some netcdf expert to take a look. Would you please create an issue with the default chunksizes with continuous fields? Once that issue is resolved, maybe we don't need the nccopy any more.

JianpingHuang-NOAA · 2022-06-13T18:08:29Z

Should UFS-AQM use "netcdf64" format and set io_layout=1,15 to speed up the restart-file writing ? @jun Wang - NOAA Federal ***@***.***> @eric Rogers - NOAA Federal ***@***.***>

…

On Mon, Jun 13, 2022 at 1:05 PM Jun Wang ***@***.***> wrote: @EricRogers-NOAA <https://github.com/EricRogers-NOAA> Thanks for the information. I think this is a netcdf issue we may need to ask some netcdf expert to take a look. Would you please create an issue with the default chunksizes with continuous fields? Once that issue is resolved, maybe we don't need the nccopy any more. — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANA2PI5YWV6FZSGGTZGSMALVO5S5ZANCNFSM5EBEOZQQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

junwang-noaa · 2022-06-13T18:14:20Z

@JianpingHuang-NOAA Yes, you can try to see if it speed up your run.

junwang-noaa · 2022-06-13T18:14:37Z

@bensonr Would you please sync your fv3atm branch https://github.com/bensonr/fv3atm/tree/emc_io_fixes and create a PR at fv3atm repo? If possible, please also create a ufs-weather-model PR. Please let me know if you want me to create the ufs-weather-model PR. Thanks.

bensonr · 2022-06-13T19:15:12Z

@junwang-noaa - you may recall there was an update needed to properly ignore the checksum. I believe you want to you fork/branch from @BinLiu-NOAA.

EricRogers-NOAA · 2022-06-13T19:15:50Z

@junwang-noaa I created a new issue : https://github.com/ufs-community/ufs-weather-model/issues/1270

junwang-noaa · 2022-06-13T20:32:46Z

@bensonr The updates from @BinLiu-NOAA is in driver/fvGFS/atmosphere.F90, I added the following changes based on your dycore branch:

https://github.com/bensonr/GFDL_atmos_cubed_sphere/tree/emc_io_fixes

@@ -324,10 +324,11 @@ contains
    logical :: dycore_only  = .false.
    logical :: debug        = .false.
    logical :: sync         = .false.
+   logical :: ignore_rst_cksum = .false.
    integer, parameter     :: maxhr = 4096
    real, dimension(maxhr) :: fdiag = 0.
    real                   :: fhmax=384.0, fhmaxhf=120.0, fhout=3.0, fhouthf=1.0,avg_max_length=3600.
-   namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length
+   namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length, ignore_rst_cksum

@@ -449,6 +450,15 @@ contains
 !--- allocate pref
    allocate(pref(npz+1,2), dum1d(npz+1))

+   ! DH* 20210326
+   ! First, read atmos_model_nml namelist section - this is a workaround to avoid
+   ! unnecessary additional changes to the input namelists, in anticipation of the
+   ! implementation of a generic interface for GFDL and CCPP fast physics soon
+   read(input_nml_file, nml=atmos_model_nml, iostat=io)
+   ierr = check_nml_error(io, 'atmos_model_nml')
+   ! *DH 20210326
+
@@ -490,15 +500,6 @@ contains

    ! Do CCPP fast physics initialization before call to adiabatic_init (since this calls fv_dynamics)

-   ! DH* 20210326
-   ! First, read atmos_model_nml namelist section - this is a workaround to avoid
-   ! unnecessary additional changes to the input namelists, in anticipation of the
-   ! implementation of a generic interface for GFDL and CCPP fast physics soon
-   read(input_nml_file, nml=atmos_model_nml, iostat=io)
-   ierr = check_nml_error(io, 'atmos_model_nml')
-   !write(0,'(a)') "It's me, and my physics suite is '" // trim(ccpp_suite) // "'"
-   ! *DH 20210326
-

Would you please add those changes to your dycore branch emc_io_fixes? Thank you!

junwang-noaa · 2022-06-13T20:33:17Z

@EricRogers-NOAA Thanks for creating the issue. Let's see if Ed can help take a look.

junwang-noaa · 2022-06-15T13:56:54Z

@bensonr I pulled your fv3atm changes in a branch checksum_io and synced with the latest fv3atm update. Would you please create a PR to the GFDL dycore dev/emc branch? I will create corresponding PRs on fv3atm and ufs-weather-model after that. My understanding is that we will have one dycore dev/emc PR(NOAA-GFDL/GFDL_atmos_cubed_sphere#194) before this one, so we need you to sync with dev/emc after PR 194 is committed. I will keep syncing the fv3/ufs wm branches until the commit time.

@BinLiu-NOAA @JianpingHuang-NOAA @EricRogers-NOAA The ufs-weather-model branch https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io is now pointing to Rusty's updated dycore branch and synced to the top of ufs wm develop branch, please let me know if you have any issues. Thanks

bensonr · 2022-06-15T14:08:59Z

@junwang-noaa - FV3 PR #197 was created yesterday and you were assigned as a reviewer.

junwang-noaa · 2022-06-15T15:14:34Z

@bensonr Thanks, Rusty. The fv3 PR#556 and ufs-weather-model PR#1275 were created.

junwang-noaa · 2022-07-01T15:17:26Z

The PR#1275 was committed. An issue #1270 was created about the default chunksize issue in NetCDF4 files in the RRFS application for further discussion.

This issue will be closed.

bensonr · 2022-10-11T08:02:46Z

RTS was also run on gaea before committing things and asking for functional tests.

…

On Tue, May 3, 2022 at 2:44 PM Jun Wang ***@***.***> wrote: I run the RT tests on Orion, it works. Let me check Dell. — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSKBVFEL7UJW5H3TBVY3NTVIFXWXANCNFSM5EBEOZQQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

junwang-noaa · 2022-10-11T08:19:55Z

GFDL plans to implement a fix in FV3 to resolve the issue, I will let you know when the timeline is available.

…

On Thu, Mar 10, 2022 at 10:14 AM JacobCarley-NOAA ***@***.***> wrote: @junwang-noaa <https://github.com/junwang-noaa> @arunchawla-NOAA <https://github.com/arunchawla-NOAA> Do we have any updates on this? When testing the model on Cray TO4 (Luna/Surge) the model takes over 3000s to initialize. — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AI7D6TMJKCSDYGDW7NLM6DTU7IGWTANCNFSM5EBEOZQQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

* "point to the MYNN PBL update for RRFS.v1" * "point to the updates of smoke and fv3atm PR #801" * "point to the updates to allocate rho_dry to zero size when not in use" * "point to hash of NOAA-EMC/fv3atm@f9a1759" --------- Co-authored-by: JONG KIM <jong.kim@noaa.gov> Co-authored-by: matthew pyle <Matthew.Pyle@noaa.gov>

EricRogers-NOAA added the bug Something isn't working label Sep 14, 2021

arunchawla-NOAA removed the bug Something isn't working label Mar 10, 2022

junwang-noaa closed this as completed Jul 1, 2022

junwang-noaa mentioned this issue Jul 1, 2022

Default chunksizes for continuous fields are too small in RRFS restart files #1270

Closed

haiqinli added a commit to haiqinli/ufs-weather-model that referenced this issue Mar 14, 2024

"point to the updates of smoke and fv3atm PR ufs-community#801"

d246761

Much slower timings in model init with the latest ufs-weather-model #801

Much slower timings in model init with the latest ufs-weather-model #801

Comments

EricRogers-NOAA commented Sep 14, 2021

climbfuji commented Sep 14, 2021

EricRogers-NOAA commented Sep 14, 2021

EricRogers-NOAA commented Sep 16, 2021 • edited Loading

DusanJovic-NOAA commented Sep 17, 2021

EricRogers-NOAA commented Sep 17, 2021

EricRogers-NOAA commented Sep 17, 2021

junwang-noaa commented Sep 17, 2021

bensonr commented Sep 17, 2021

EricRogers-NOAA commented Dec 6, 2021

JacobCarley-NOAA commented Mar 10, 2022

bensonr commented May 3, 2022

junwang-noaa commented May 3, 2022

EricRogers-NOAA commented May 3, 2022

junwang-noaa commented May 3, 2022

bensonr commented May 3, 2022

junwang-noaa commented May 3, 2022

BinLiu-NOAA commented May 3, 2022

EricRogers-NOAA commented May 3, 2022 • edited Loading

junwang-noaa commented May 3, 2022

junwang-noaa commented May 4, 2022

BenjaminBlake-NOAA commented May 4, 2022

EricRogers-NOAA commented May 5, 2022 • edited Loading

EricRogers-NOAA commented May 5, 2022

junwang-noaa commented May 5, 2022

ericaligo-NOAA commented May 5, 2022 via email

RuiyuSun commented May 6, 2022

SMoorthi-emc commented May 6, 2022 via email

RuiyuSun commented May 6, 2022

junwang-noaa commented May 24, 2022 • edited Loading

EricRogers-NOAA commented May 24, 2022

EricRogers-NOAA commented Jun 2, 2022

junwang-noaa commented Jun 9, 2022

EricRogers-NOAA commented Jun 9, 2022

bensonr commented Jun 9, 2022 • edited Loading

EricRogers-NOAA commented Jun 9, 2022

junwang-noaa commented Jun 10, 2022

EricRogers-NOAA commented Jun 10, 2022

EricRogers-NOAA commented Jun 10, 2022

junwang-noaa commented Jun 13, 2022

EricRogers-NOAA commented Jun 13, 2022

junwang-noaa commented Jun 13, 2022

JianpingHuang-NOAA commented Jun 13, 2022 via email

junwang-noaa commented Jun 13, 2022

junwang-noaa commented Jun 13, 2022

bensonr commented Jun 13, 2022

EricRogers-NOAA commented Jun 13, 2022

junwang-noaa commented Jun 13, 2022

junwang-noaa commented Jun 13, 2022

junwang-noaa commented Jun 15, 2022 • edited Loading

bensonr commented Jun 15, 2022

junwang-noaa commented Jun 15, 2022

junwang-noaa commented Jul 1, 2022

bensonr commented Oct 11, 2022 via email

junwang-noaa commented Oct 11, 2022 via email

EricRogers-NOAA commented Sep 16, 2021 •

edited

Loading

EricRogers-NOAA commented May 3, 2022 •

edited

Loading

EricRogers-NOAA commented May 5, 2022 •

edited

Loading

junwang-noaa commented May 24, 2022 •

edited

Loading

bensonr commented Jun 9, 2022 •

edited

Loading

junwang-noaa commented Jun 15, 2022 •

edited

Loading