Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much slower timings in model init with the latest ufs-weather-model #801

Closed
EricRogers-NOAA opened this issue Sep 14, 2021 · 75 comments
Closed

Comments

@EricRogers-NOAA
Copy link

Today Ben Blake checked out the latest develop branch and ran a short test on WCOSS Dell Phase 3.5 with the 3 km RRFS LAM domain over North America, coldstarted from the GFS analysis, He saw an increase in the model initialization time by almost 8 minutes compared to the current parallel LAM run:

LAM parallel: in fcst,init total time: 74.2139439582825 (#ddcd809, checked out 7/30/21)
My test: in fcst,init total time: 524.866119146347 (#e198256, checked out today)

Ben's test run did not run to completion so no termination times are available. Bin Liu noted similar behavior in HAFS

@EricRogers-NOAA EricRogers-NOAA added the bug Something isn't working label Sep 14, 2021
@climbfuji
Copy link
Collaborator

Hi Eric, a lot of development happened between the two hashes you posted. One way to narrow this done is to use the good old bisect mode. Do you think you have time to do that? I went to https://github.com/ufs-community/ufs-weather-model/commits/develop and searched for ddcd809 then everything that got merged since then is above.

@EricRogers-NOAA
Copy link
Author

I take it you mean check out a version, run a short test and see when the slowdown starts? I don't have a lot of time lately because I'm working on the WCOSS2 conversion effort, but I'll try to clear some time for this.

@EricRogers-NOAA
Copy link
Author

EricRogers-NOAA commented Sep 16, 2021

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain

  1. Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461
  2. #4a2a127 (9/7/2021): in fv3_cap, init time= 410.151222944260
  3. #01d70f4 (8/30/2021): in fv3_cap, init time= 422.296640157700
  4. #3f3c253 (8/25/2021): in fv3_cap, init time= 420.641896009445
  5. #b26a896 (8/23/2021): compile aborted:

/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/ufs-weather-model_aug23/FV3/atmos_cubed_sphere/tools/fv_eta.F90(49)
: error #6580: Name in only-list does not exist or is not accessible. [ASCII_READ]
use fms2_io_mod, only: ascii_read
---------------------------^

  1. #f7cfebf (8/18/2021): compile aborted, same as above
  2. #2258171 (8/13/2021): compile aborted, same as above

Why am I getting these compile aborts? I'm doing this:

git clone --recursive https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir
cd ufs-weather-model_mydir
git checkout (hash)

then to compile:

set -x
. /usrx/local/prod/lmod/lmod/init/sh
module purge
module use modulefiles
module load ufs_wcoss_dell_p3
export CMAKE_PLATFORM=wcoss_dell_p3
export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export BUILD_VERBOSE=1
./build.sh

@DusanJovic-NOAA
Copy link
Collaborator

@EricRogers-NOAA After checking out the exact commit (hash) you want to build, you must update submodules. Otherwise you'll be using submodules from the initial clone, which are submodules used by the current develop branch, not the submodules from the hash you want. So clone like this (note no --recursive in git clone)

git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir
cd ufs-weather-model_mydir
git checkout (hash)
git submodule update --init --recursive

and then build.

@EricRogers-NOAA
Copy link
Author

@DusanJovic-NOAA thank you very much. I always forget that submodule step. I was able to checkout the Aug 23 commit now. I'll be sending out an updated list of init timings later.

@EricRogers-NOAA
Copy link
Author

New timing tests, with the correct checkout of earlier commits:

Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain, all warm starts from LAMDA IC

  1. Control run, #ddcd809, (7/30/21): in fv3_cap, init time= 37.0698819160461
  2. #b26a896 (8/23/2021): in fv3_cap, init time= 39.86402297019963
    3) #3f3c253 (8/25/2021): in fv3_cap, init time= 231.806740045547

The 8/25/2021 commit #762 is the cause of the slower init time.

@junwang-noaa
Copy link
Collaborator

Thanks, Dusan. @ericaligo-NOAA Thanks to identify the PR that causes the slowness of model initialization step.

@bensonr @mlee03 The #762 is the FMS lib update to 2021.03. Would you please take a look what code updates in fms might cause the slowness? Thanks.

@bensonr
Copy link
Contributor

bensonr commented Sep 17, 2021

@junwang-noaa - a similar issue has been brought directly to my attention by the HAFS regional team. I know the reason and am trying to verify the resolution will alleviate the situation.

@EricRogers-NOAA
Copy link
Author

Latest UFS code put into LAM parallels (#805421d) on 11/30/2021. Init time for the RRFS domain went from ~140 sec to almost 30 minutes:

err: in fcst,init total time: 1970.07599091530

@JacobCarley-NOAA
Copy link

@junwang-noaa @arunchawla-NOAA
Do we have any updates on this? When testing the model on Cray TO4 (Luna/Surge) the model takes over 3000s to initialize.

@arunchawla-NOAA arunchawla-NOAA removed the bug Something isn't working label Mar 10, 2022
@bensonr
Copy link
Contributor

bensonr commented May 3, 2022

@junwang-noaa @arunchawla-NOAA @JacobCarley-NOAA - please try your tests with this version of fv3atm. Make sure to use io_layout=1,1 to test the initialization performance.

This branch also contains a fix to the restart checksum issue you've been wanting removed. The option is controlled separately for the dycore and the physics. In the dycore, one needs to add fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true. for use within FV3GFS_io.F90. If you don't want the option to be in atmos_model.F90 but exist in FV3GFS_io.F90 itself, feel free to reimplement as you see fit.

Once you are satisfied with the results of your testing, please merge the changes into your own branches and add the appropriate PRs.

@junwang-noaa
Copy link
Collaborator

@bensonr Thank you very much for making the code changes.
@BinLiu-NOAA @EricRogers-NOAA FYI,

@EricRogers-NOAA
Copy link
Author

How would I check this out and compile w.r.t. to the full model (https://github.com/ufs-community/ufs-weather-model)? I've always just cloned the https://github.com/ufs-community/ufs-weather-model (and maybe checkout a feature branch) and have not had experience dealing with a different version of fv3atm or some other submodule. Thanks for your assistance.

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA I am creating a ufs-weather-model branch from the latest develop branch using Rusty's fv3atm, we can use it for testing. I will let you know when I am done.

@bensonr
Copy link
Contributor

bensonr commented May 3, 2022

Or simply

@junwang-noaa
Copy link
Collaborator

Thanks, Rusty, that will work too. Anyway, @EricRogers-NOAA @BinLiu-NOAA Here is the branch for testing:

https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io

@BinLiu-NOAA
Copy link
Contributor

Thanks a lot, @bensonr @junwang-noaa! We will test from the HAFS side and report back on how this new branch perform for the model forecast init phase when using io_layout of (1x1) for both cold-start and warm-start scenarios. We will also test the capability of skipping the checksum step. Thanks!

@EricRogers-NOAA
Copy link
Author

EricRogers-NOAA commented May 3, 2022

@junwang-noaa my compile failed on WCOSS Dell:

CMake Error at FV3/CMakeLists.txt:21 (message):
An error occured while running ccpp_prebuild.py, check
/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/emc_io_fixes/build/FV3/ccpp_prebuild.{out,err}

I had been using this to compile the code, I take it there have been changes:

#!/bin/bash
set -x
. /usrx/local/prod/lmod/lmod/init/sh
module purge
module use modulefiles
module load ufs_wcoss_dell_p3
export CMAKE_PLATFORM=wcoss_dell_p3
#export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=ON -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export CMAKE_FLAGS="-DAPP=ATM -D32BIT=ON -DDEBUG=OFF -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km"
export BUILD_VERBOSE=1
./build.sh

@junwang-noaa
Copy link
Collaborator

I run the RT tests on Orion, it works. Let me check Dell.

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA The code compiled on dell. Here is what I did:

Jun.Wang@v71a1 ufs-weather-model]$ pwd
/gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model
module purge
module use -a /gpfs/dell1/ptmp/Jun.Wang/ufs-weather-model/modulefiles
module load ufs_wcoss_dell_p3
module list
export CMAKE_FLAGS="-DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn_lam3km -D32BIT=ON"
nohup ./build.sh >xxxcmpl 2>&1 &

@BenjaminBlake-NOAA
Copy link

@junwang-noaa I tried compiling your branch using your commands listed above, but I got the same error as Eric. I looked in my build/FV3/ccpp/ccpp_prebuild.err file and I saw the following message at the end:

KeyError: 'rrtmg_sw_pre'

The FV3_GFS_v15_thompson_mynn_lam3km suite file we were using did contain rrtmg_sw_pre, but I see it was replaced by rad_sw_pre in the repository. The xml file we are using is slightly different because it uses the unified GWD scheme. After making that change, the code compiled for me. @EricRogers-NOAA give that a try and see if it works for you (I used your original compile.sh)

@EricRogers-NOAA
Copy link
Author

EricRogers-NOAA commented May 5, 2022

I've got the new code running on WCOSS Dell P3; I saw this print:

Computing rain collecting graupel table took 226.539 seconds.
creating rain collecting snow table
Computing rain collecting snow table took 31.857 seconds.

Are there new tables we need to read in that will eliminate the above computations and reduce run time?

The run subsequently aborted a few minutes after the above print,

@EricRogers-NOAA
Copy link
Author

One of the ESMF debug prints had this in the failed run of the new code:

20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:2101 Invalid argument - inst_tracer_diag_aod is not a StandardName in the NUOPC_FieldDictionary!
20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:480 Invalid argument - Passing error in return code
20220505 184831.351 ERROR PET1875 module_fcst_grid_comp.F90:410 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 module_fcst_grid_comp.F90:1079 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 fv3_cap.F90:888 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 ATM:src/addon/NUOPC/src/NUOPC_ModelBase.F90:700 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:2577 Invalid argument - Phase 'IPDvXp01' Initialize for modelComp 1: ATM did not return ESMF_SUCCESS
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:1286 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 EARTH Grid Comp:src/addon/NUOPC/src/NUOPC_Driver.F90:457 Invalid argument - Passing error in return code
20220505 184831.354 ERROR PET1875 UFS.F90:386 Invalid argument - Aborting UFS

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA Please update the fd_nems.yaml file in the run directory from the latest develop branch:

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/fd_nems.yaml

@ericaligo-NOAA
Copy link
Collaborator

ericaligo-NOAA commented May 5, 2022 via email

@RuiyuSun
Copy link
Contributor

RuiyuSun commented May 6, 2022

I'm not using the latest code, so I don't know if new tables have been added.  Ruiyu, do you know if the latest ufs-weather-model code requires new tables to be read in by the Thompson scheme?

On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. creating rain collecting snow table Computing rain collecting snow table took 31.857 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? — Reply to this email directly, view it on GitHub <#801 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ. You are receiving this because you were mentioned.Message ID: @.***>
@ericaligo-NOAA I am not aware of any new tables required. If the tables are being created at the initial time that probably means the existing table files are not copied to the run directory.

@SMoorthi-emc
Copy link
Contributor

SMoorthi-emc commented May 6, 2022 via email

@RuiyuSun
Copy link
Contributor

RuiyuSun commented May 6, 2022

You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin.
qr_acr_qgV2.dat, qr_acr_qsV2.dat, freezeH2O.dat

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented May 24, 2022

@EricRogers-NOAA So the only change you made is io_layout change? Also I am curious, did you specify the chunksizes for restart files? I think you can still use io_layout= 1,15 if that helps with the total run time.

My understanding is that using more io tasks in io_layout will speed up reading the restart files

@EricRogers-NOAA
Copy link
Author

@junwang-noaa In the current code, making io_layout=1,X (I used 15) dramatically speed up reading the restart files, for example;
regional_forecast_tm03_12.log: fcst_initialize total time: 192.971059083939
regional_forecast_tm04_12.log: fcst_initialize total time: 206.560415983200
regional_forecast_tm05_12.log: fcst_initialize total time: 153.661432027817

With io_layout=1,1 these were much larger (up to one hour). But with Rusty's I/O changes these times are very fast now with io_layout=1,1

regional_forecast_tm03_12.log: fcst_initialize total time: 42.0637090206146
regional_forecast_tm04_12.log: fcst_initialize total time: 43.1961140632629
regional_forecast_tm05_12.log: fcst_initialize total time: 45.6291441917419

I did not specify chunksizes for the restart files, is this doable now with the parameter ncchksz in fms2_io_nml? I was unaware of this option until recently when it came up in another email thread.

@EricRogers-NOAA
Copy link
Author

Is there an example handy of how to set chunksizes for restart files? Is this done in the fms2_io_nml part of the input.nml file, or in the code itself?

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA I checked with Rusty. You can set the ncchksz in fms2_io_nml for "64BIT" format. I was using default fms2_io_nml setting in my global C384 test (/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_199573/control_c384gdas_chksum/RESTART), I do not see the chunksize specified as netcdf attributes in the restart files. .

    float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
            ua:checksum = "  41A8342EDD4099" ;
    float va(Time, zaxis_1, yaxis_2, xaxis_1) ;
            va:checksum = "   3CF60F6445013" ;

Are the files with chunksize created from the model run in your testing? Are you writing out the "NC-4" format in fms2_io_nml? Can I take a look at your run directory? Thanks

@EricRogers-NOAA
Copy link
Author

@junwang-noaa I have netcdf_default_format = "netcdf4" in fms2_io_nml. I have two working directories for you to look at :

  1. /gpfs/dell2/ptmp/emc.campara/regional_forecast_tm04_2022060912 is from a 1-h forecast from the real-time LAMDAX parallel using code #a46c7ed (3/24/22). The forecast is run with io_layout=1,15, so you see 14 fv_core*.00* and fv_tracer*.00* files in the ./RESTART directory. If you run “ncdump -s -h fv_core.res.tile1.nc.0000”, you see the chunksize settings as expected, for example:

    float u(Time, zaxis_1, yaxis_1, xaxis_1) ;
             u:checksum = "FF0DCF1C16B1B8C5" ;
             u:_Storage = "chunked" ;
             u:_ChunkSizes = 1, 17, 46, 988 ;
             u:_Endianness = "little" ;
     float v(Time, zaxis_1, yaxis_2, xaxis_2) ;
             v:checksum = "  322FDF40C4A28E" ;
             v:_Storage = "chunked" ;
             v:_ChunkSizes = 1, 17, 45, 988 ;
             v:_Endianness = "little" ;
    

// global attributes:
:NumFilesInSet = 15 ;
:_NCProperties = "version=2|netcdf" ;
:_SuperblockVersion = 0 ;
:_IsNetcdf4 = 1 ;
:_Format = "netCDF-4" ;

Since the GSI analysis can only read the full fv_core and fv_tracer files, I need to run the mppnccombine utility to combine the 15 restart file pieces into one file. I have put the combined files in the RESTART directory (fv_core.res.tile1.nc, fv_tracer.res.tile1.nc), and they are created by this command:

$EXECfv3/mppnccombine -v -64 fv_core.res.tile1.nc
$EXECfv3/mppnccombine -v -64 fv_tracer.res.tile1.nc

The “-64” is making the combined files as 64-bit offset, and you see no chunksizes when you run “ncdump -s -h fv_core.res.tile1.nc”:

   float ua(Time, zaxis_1, yaxis_2, xaxis_1) ;
            ua:checksum = "FFF424CE79DE9282" ;
    float va(Time, zaxis_1, yaxis_2, xaxis_1) ;
            va:checksum = "FF3C70DE1E0697AB" ;

// global attributes:
:_Format = "64-bit offset" ;
}
So with no chunksizes set for 64-bit offset restart files, the parallel reads of the restart files in the GSI code are efficient.

  1. /gpfs/dell2/ptmp/emc.campara/regional_forecast_tm06_2022060912_testcode is a 1-h LAMDAX test forecast with the code with Rusty's I/O changes. With his changes we can run with io_layout=1,1 with no slowdown at the model startup, so in the RESTART directory the model writes out the full-size restart files in netcdf-4 format with these chunksizes:

     xaxis_1 = 3950 ;
     xaxis_2 = 3951 ;
     yaxis_1 = 2701 ;
     yaxis_2 = 2700 ;
     zaxis_1 = 65 ;
    

.
.
float delp(Time, zaxis_1, yaxis_2, xaxis_1) ;
delp:checksum = " B0FF4737CF56015" ;
delp:_Storage = "chunked" ;
delp:_ChunkSizes = 1, 6, 300, 439 ;
delp:_Endianness = "little" ;

With these default chunksize settings the GSI code is about 30% slower because the parallel reads are not efficient. We found that if these chunksizes were reset to the grid dimensions using nccopy:

nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc
nccopy -c xaxis_1/3950,yaxis_1/2700,zaxis_1/65 fv_tracer.res.tile1.nc fv_tracer.res.tile1_new.nc

The GSI run times were faster, and on par with what we see now with 64-bit offset restart files. Of course running the above nccopy command adds extra run time/overhead, so if these restart file chunk sizes could be set in the model itself it would be helpful.

@bensonr
Copy link
Contributor

bensonr commented Jun 9, 2022

@EricRogers-NOAA @junwang-noaa

First, the axis values are not chunksizes, they are the variable dimensions as you can see in the variable definition float delp(Time, zaxis_1, yaxis_2, xaxis_1) ;

I am very surprised the io_layout=1,15 restart files have chunked data in them. According to the NetCDF documentation, the ncchksz parameter, which is used in the nf90_open function call, is ignored when creating NC-4/HDF5 files. Instead, the chunking is defined at the variable definition layer using nf90_def_var. According the the NetCDF documentation, the default data layout is contiguous, unless certain arguments are provided - which they are not provided in fms2_io (see here and here). Don't be confused by the presence of a checksum attribute as that is created and written by FMS and not related to the fletcher32 argument to nf90_def_var. I suggest consulting with Ed Hartnett, if he is still available to EMC, as to why a variable definition is being chunked when the documentation indicates it should not be.

@EricRogers-NOAA
Copy link
Author

@bensonr Thanks, I will reach out to Ed. I wasn't quite clear above about that nccopy command, when I ran that command as is, we went from chunksizes in fv_tracer.res.tile1.nc from this:

    float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ;
            sphum:checksum = " 93FD4CEFDD0C45B" ;
            sphum:_Storage = "chunked" ;
            sphum:_ChunkSizes = 1, 6, 300, 439 ;
            sphum:_Endianness = "little" ;

to this
float sphum(Time, zaxis_1, yaxis_1, xaxis_1) ;
sphum:checksum = " 93FD4CEFDD0C45B" ;
sphum:_Storage = "chunked" ;
sphum:_ChunkSizes = 1, 65, 2700, 3950 ;
sphum:_Endianness = "little" ;
sphum:_NoFill = "true" ;

And the GSI code reading this file was about 30% faster. We did not test to see if these new chunksizes were the most optimal, however.

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA May I ask if there is the specific reason to use "netcdf4" netcdf_default_format? What about using the default "64bit" option for the restart files as you use in the mppnccombine command? I don't see the chunksizes in the restart files when using the default "64bit" format, maybe it will speed up the process (no need of using nccopy and fast reading in GSI)?

@EricRogers-NOAA
Copy link
Author

@junwang-noaa I will revisit this with Ting Lei, but as best as I can gather from going through old emails, the parallel I/O in the GSI supports netcdf4, not netcdf3, which is why I went down the netcdf4 rabbit hole. But that begs the question, is the GSI faster when reading default 64bit restart files vs netcdf4 restart files with inefficient chunksizes? I'll make some tests to answer that question.

@EricRogers-NOAA
Copy link
Author

@junwang-noaa When I set netcdf_default_format = "64bit" in the 3km N. American RRFS domain forecast run, I got this error:

FATAL from PE 0: NetCDF: One or more variable sizes violate format constraints: set_netcdf_mode

which is bringing back memories of what happened last year when I first started running this huge domain (3951x2701) in a DA run with the model writing out restart files. Someone (I believe it was Rusty) who recommended adding the fms2_io_nml block with netcdf_default_format = "netcdf4".

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA Thanks for the explanation. So for RRFS, we have to use "netcdf4" format. In that case, can we still use "io_layout=1,15" as before? Will that slow down RRFS runs? My understanding is that until the netcdf chunksize issue is fixed, RRFS will not get speedup by using the fixes Rusty provided. But other applications such as GFS/HAFS/UFSAQM will speed up by using io_layout=1,1 with default "64bit" netcdf format. I think we need to open a separate issue on the chunksize used in netcdf format and ask Ed to take a look. Meanwhile, we can work on committing Rusty's fixes back to the commit queue if it does slow down RRFS. @JianpingHuang-NOAA is waiting for the code changes to run UFSAQM on wcoss2. Please let me know if this working for you. Thanks

@EricRogers-NOAA
Copy link
Author

@junwang-noaa Yes, RRFS will need to use netcdf4 format for restart files, we can just continue to set io_layout=1,15 and then recombine the restart file pieces using that mppnccombine utility with the -64 option so the input restart files to the GSI are in 64bit format.

Earlier I stated this: "We found that if these chunksizes were reset to the grid dimensions using nccopy:

nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc
nccopy -c xaxis_1/3950,yaxis_1/2700,zaxis_1/65 fv_tracer.res.tile1.nc fv_tracer.res.tile1_new.nc

The GSI run times were faster, and on par with what we see now with 64-bit offset restart files.". What I discovered is that the above nccopy command from netcdf 4.5.0 (on WCOSS1 Dell) worked as I described above, but when one loaded netcdf/4.7.4 from hpc-stack, the above nccopy command did not change the chunksizes for the 2-d and 3-d variables in the restart files.

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA Thanks for the information. I think this is a netcdf issue we may need to ask some netcdf expert to take a look. Would you please create an issue with the default chunksizes with continuous fields? Once that issue is resolved, maybe we don't need the nccopy any more.

@JianpingHuang-NOAA
Copy link
Contributor

JianpingHuang-NOAA commented Jun 13, 2022 via email

@junwang-noaa
Copy link
Collaborator

@JianpingHuang-NOAA Yes, you can try to see if it speed up your run.

@junwang-noaa
Copy link
Collaborator

@bensonr Would you please sync your fv3atm branch https://github.com/bensonr/fv3atm/tree/emc_io_fixes and create a PR at fv3atm repo? If possible, please also create a ufs-weather-model PR. Please let me know if you want me to create the ufs-weather-model PR. Thanks.

@bensonr
Copy link
Contributor

bensonr commented Jun 13, 2022

@junwang-noaa - you may recall there was an update needed to properly ignore the checksum. I believe you want to you fork/branch from @BinLiu-NOAA.

@EricRogers-NOAA
Copy link
Author

@junwang-noaa
Copy link
Collaborator

@bensonr The updates from @BinLiu-NOAA is in driver/fvGFS/atmosphere.F90, I added the following changes based on your dycore branch:

https://github.com/bensonr/GFDL_atmos_cubed_sphere/tree/emc_io_fixes

@@ -324,10 +324,11 @@ contains
    logical :: dycore_only  = .false.
    logical :: debug        = .false.
    logical :: sync         = .false.
+   logical :: ignore_rst_cksum = .false.
    integer, parameter     :: maxhr = 4096
    real, dimension(maxhr) :: fdiag = 0.
    real                   :: fhmax=384.0, fhmaxhf=120.0, fhout=3.0, fhouthf=1.0,avg_max_length=3600.
-   namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length
+   namelist /atmos_model_nml/ blocksize, chksum_debug, dycore_only, debug, sync, fdiag, fhmax, fhmaxhf, fhout, fhouthf, ccpp_suite, avg_max_length, ignore_rst_cksum

@@ -449,6 +450,15 @@ contains
 !--- allocate pref
    allocate(pref(npz+1,2), dum1d(npz+1))

+   ! DH* 20210326
+   ! First, read atmos_model_nml namelist section - this is a workaround to avoid
+   ! unnecessary additional changes to the input namelists, in anticipation of the
+   ! implementation of a generic interface for GFDL and CCPP fast physics soon
+   read(input_nml_file, nml=atmos_model_nml, iostat=io)
+   ierr = check_nml_error(io, 'atmos_model_nml')
+   ! *DH 20210326
+
@@ -490,15 +500,6 @@ contains

    ! Do CCPP fast physics initialization before call to adiabatic_init (since this calls fv_dynamics)

-   ! DH* 20210326
-   ! First, read atmos_model_nml namelist section - this is a workaround to avoid
-   ! unnecessary additional changes to the input namelists, in anticipation of the
-   ! implementation of a generic interface for GFDL and CCPP fast physics soon
-   read(input_nml_file, nml=atmos_model_nml, iostat=io)
-   ierr = check_nml_error(io, 'atmos_model_nml')
-   !write(0,'(a)') "It's me, and my physics suite is '" // trim(ccpp_suite) // "'"
-   ! *DH 20210326
-

Would you please add those changes to your dycore branch emc_io_fixes? Thank you!

@junwang-noaa
Copy link
Collaborator

@EricRogers-NOAA Thanks for creating the issue. Let's see if Ed can help take a look.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jun 15, 2022

@bensonr I pulled your fv3atm changes in a branch checksum_io and synced with the latest fv3atm update. Would you please create a PR to the GFDL dycore dev/emc branch? I will create corresponding PRs on fv3atm and ufs-weather-model after that. My understanding is that we will have one dycore dev/emc PR(NOAA-GFDL/GFDL_atmos_cubed_sphere#194) before this one, so we need you to sync with dev/emc after PR 194 is committed. I will keep syncing the fv3/ufs wm branches until the commit time.

@BinLiu-NOAA @JianpingHuang-NOAA @EricRogers-NOAA The ufs-weather-model branch https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io is now pointing to Rusty's updated dycore branch and synced to the top of ufs wm develop branch, please let me know if you have any issues. Thanks

@bensonr
Copy link
Contributor

bensonr commented Jun 15, 2022

@junwang-noaa - FV3 PR #197 was created yesterday and you were assigned as a reviewer.

@junwang-noaa
Copy link
Collaborator

@bensonr Thanks, Rusty. The fv3 PR#556 and ufs-weather-model PR#1275 were created.

@junwang-noaa
Copy link
Collaborator

The PR#1275 was committed. An issue #1270 was created about the default chunksize issue in NetCDF4 files in the RRFS application for further discussion.

This issue will be closed.

@bensonr
Copy link
Contributor

bensonr commented Oct 11, 2022 via email

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Oct 11, 2022 via email

haiqinli added a commit to haiqinli/ufs-weather-model that referenced this issue Mar 14, 2024
MatthewPyle-NOAA pushed a commit that referenced this issue Mar 19, 2024
* "point to the MYNN PBL update for RRFS.v1"
* "point to the updates of smoke and fv3atm PR #801"
* "point to the updates to allocate rho_dry to zero size when not in use"
* "point to hash of NOAA-EMC/fv3atm@f9a1759"
---------
Co-authored-by: JONG KIM <jong.kim@noaa.gov>
Co-authored-by: matthew pyle <Matthew.Pyle@noaa.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests