-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Much slower timings in model init with the latest ufs-weather-model #801
Comments
Hi Eric, a lot of development happened between the two hashes you posted. One way to narrow this done is to use the good old bisect mode. Do you think you have time to do that? I went to https://github.com/ufs-community/ufs-weather-model/commits/develop and searched for ddcd809 then everything that got merged since then is above. |
I take it you mean check out a version, run a short test and see when the slowdown starts? I don't have a lot of time lately because I'm working on the WCOSS2 conversion effort, but I'll try to clear some time for this. |
Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain
/gpfs/dell6/emc/modeling/noscrub/Eric.Rogers/ufs-weather-model_aug23/FV3/atmos_cubed_sphere/tools/fv_eta.F90(49)
Why am I getting these compile aborts? I'm doing this: git clone --recursive https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir then to compile: set -x |
@EricRogers-NOAA After checking out the exact commit (hash) you want to build, you must update submodules. Otherwise you'll be using submodules from the initial clone, which are submodules used by the current develop branch, not the submodules from the hash you want. So clone like this (note no --recursive in git clone) git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model_mydir and then build. |
@DusanJovic-NOAA thank you very much. I always forget that submodule step. I was able to checkout the Aug 23 commit now. I'll be sending out an updated list of init timings later. |
New timing tests, with the correct checkout of earlier commits: Timing tests: IC=00z 9/16/21 3 km CONUS LAM domain, all warm starts from LAMDA IC
The 8/25/2021 commit #762 is the cause of the slower init time. |
Thanks, Dusan. @ericaligo-NOAA Thanks to identify the PR that causes the slowness of model initialization step. @bensonr @mlee03 The #762 is the FMS lib update to 2021.03. Would you please take a look what code updates in fms might cause the slowness? Thanks. |
@junwang-noaa - a similar issue has been brought directly to my attention by the HAFS regional team. I know the reason and am trying to verify the resolution will alleviate the situation. |
Latest UFS code put into LAM parallels (#805421d) on 11/30/2021. Init time for the RRFS domain went from ~140 sec to almost 30 minutes: err: in fcst,init total time: 1970.07599091530 |
@junwang-noaa @arunchawla-NOAA |
@junwang-noaa @arunchawla-NOAA @JacobCarley-NOAA - please try your tests with this version of fv3atm. Make sure to use io_layout=1,1 to test the initialization performance. This branch also contains a fix to the restart checksum issue you've been wanting removed. The option is controlled separately for the dycore and the physics. In the dycore, one needs to add fv_core_nml::ignore_rst_cksum=.true. and for the physics, atmos_model_nml::ignore_rst_cksum=.true. for use within FV3GFS_io.F90. If you don't want the option to be in atmos_model.F90 but exist in FV3GFS_io.F90 itself, feel free to reimplement as you see fit. Once you are satisfied with the results of your testing, please merge the changes into your own branches and add the appropriate PRs. |
@bensonr Thank you very much for making the code changes. |
How would I check this out and compile w.r.t. to the full model (https://github.com/ufs-community/ufs-weather-model)? I've always just cloned the https://github.com/ufs-community/ufs-weather-model (and maybe checkout a feature branch) and have not had experience dealing with a different version of fv3atm or some other submodule. Thanks for your assistance. |
@EricRogers-NOAA I am creating a ufs-weather-model branch from the latest develop branch using Rusty's fv3atm, we can use it for testing. I will let you know when I am done. |
Or simply
|
Thanks, Rusty, that will work too. Anyway, @EricRogers-NOAA @BinLiu-NOAA Here is the branch for testing: https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io |
Thanks a lot, @bensonr @junwang-noaa! We will test from the HAFS side and report back on how this new branch perform for the model forecast init phase when using io_layout of (1x1) for both cold-start and warm-start scenarios. We will also test the capability of skipping the checksum step. Thanks! |
@junwang-noaa my compile failed on WCOSS Dell: CMake Error at FV3/CMakeLists.txt:21 (message): I had been using this to compile the code, I take it there have been changes: #!/bin/bash |
I run the RT tests on Orion, it works. Let me check Dell. |
@EricRogers-NOAA The code compiled on dell. Here is what I did: Jun.Wang@v71a1 ufs-weather-model]$ pwd |
@junwang-noaa I tried compiling your branch using your commands listed above, but I got the same error as Eric. I looked in my build/FV3/ccpp/ccpp_prebuild.err file and I saw the following message at the end: KeyError: 'rrtmg_sw_pre' The FV3_GFS_v15_thompson_mynn_lam3km suite file we were using did contain rrtmg_sw_pre, but I see it was replaced by rad_sw_pre in the repository. The xml file we are using is slightly different because it uses the unified GWD scheme. After making that change, the code compiled for me. @EricRogers-NOAA give that a try and see if it works for you (I used your original compile.sh) |
I've got the new code running on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table took 226.539 seconds. Are there new tables we need to read in that will eliminate the above computations and reduce run time? The run subsequently aborted a few minutes after the above print, |
One of the ESMF debug prints had this in the failed run of the new code: 20220505 184831.351 ERROR PET1875 src/addon/NUOPC/src/NUOPC_Base.F90:2101 Invalid argument - inst_tracer_diag_aod is not a StandardName in the NUOPC_FieldDictionary! |
@EricRogers-NOAA Please update the fd_nems.yaml file in the run directory from the latest develop branch: https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/parm/fd_nems.yaml |
I'm not using the latest code, so I don't know if new tables have been
added. Ruiyu, do you know if the latest ufs-weather-model code requires
new tables to be read in by the Thompson scheme?
…On 5/5/2022 2:50 PM, EricRogers-NOAA wrote:
I've got the new code running on WCOSS Dell P3; I saw this print:
Computing rain collecting graupel table took 226.539 seconds.
creating rain collecting snow table
Computing rain collecting snow table took 31.857 seconds.
Are there new tables we need to read in that will eliminate the above
computations and reduce run time?
—
Reply to this email directly, view it on GitHub
<#801 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Ruiyu, Where are these precomputed tables? I also saw them being created
while running.
…On Thu, May 5, 2022 at 11:04 PM RuiyuSun ***@***.***> wrote:
I'm not using the latest code, so I don't know if new tables have been
added. Ruiyu, do you know if the latest ufs-weather-model code requires
new tables to be read in by the Thompson scheme?
… <#m_5980202897869194547_>
On 5/5/2022 2:50 PM, EricRogers-NOAA wrote: I've got the new code running
on WCOSS Dell P3; I saw this print: Computing rain collecting graupel table
took 226.539 seconds. creating rain collecting snow table Computing rain
collecting snow table took 31.857 seconds. Are there new tables we need to
read in that will eliminate the above computations and reduce run time? —
Reply to this email directly, view it on GitHub <#801 (comment)
<#801 (comment)>>,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ALQ75MJ6Q6N3V5KRTNTBQJ3VIQJ5VANCNFSM5EBEOZQQ.
You are receiving this because you were mentioned.Message ID: *@*.***>
@ericaligo-NOAA <https://github.com/ericaligo-NOAA> I am not aware of any
new tables required. If the tables are being created at the initial time
that probably means the existing table files are not copied to the run
directory.
—
Reply to this email directly, view it on GitHub
<#801 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALLVRYRSEDPYM2KHFHIHBZLVISD4XANCNFSM5EBEOZQQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Dr. Shrinivas Moorthi
Research Meteorologist
Modeling and Data Assimilation Branch
Environmental Modeling Center / National Centers for Environmental
Prediction
5830 University Research Court - (W/NP23), College Park MD 20740 USA
Tel: (301)683-3718
e-mail: ***@***.***
Phone: (301) 683-3718 Fax: (301) 683-3718
|
You can use the tables created in your current/previous experiment for your future experiments. I didn't find them in the current ufs_weather_model. They need to be added @yangfanglin. |
@EricRogers-NOAA So the only change you made is io_layout change? Also I am curious, did you specify the chunksizes for restart files? I think you can still use io_layout= 1,15 if that helps with the total run time. My understanding is that using more io tasks in io_layout will speed up reading the restart files |
@junwang-noaa In the current code, making io_layout=1,X (I used 15) dramatically speed up reading the restart files, for example; With io_layout=1,1 these were much larger (up to one hour). But with Rusty's I/O changes these times are very fast now with io_layout=1,1 regional_forecast_tm03_12.log: fcst_initialize total time: 42.0637090206146 I did not specify chunksizes for the restart files, is this doable now with the parameter ncchksz in fms2_io_nml? I was unaware of this option until recently when it came up in another email thread. |
Is there an example handy of how to set chunksizes for restart files? Is this done in the fms2_io_nml part of the input.nml file, or in the code itself? |
@EricRogers-NOAA I checked with Rusty. You can set the ncchksz in fms2_io_nml for "64BIT" format. I was using default fms2_io_nml setting in my global C384 test (/scratch1/NCEPDEV/stmp2/Jun.Wang/FV3_RT/rt_199573/control_c384gdas_chksum/RESTART), I do not see the chunksize specified as netcdf attributes in the restart files. .
Are the files with chunksize created from the model run in your testing? Are you writing out the "NC-4" format in fms2_io_nml? Can I take a look at your run directory? Thanks |
@junwang-noaa I have netcdf_default_format = "netcdf4" in fms2_io_nml. I have two working directories for you to look at :
// global attributes:
$EXECfv3/mppnccombine -v -64 fv_core.res.tile1.nc The “-64” is making the combined files as 64-bit offset, and you see no chunksizes when you run “ncdump -s -h fv_core.res.tile1.nc”:
// global attributes:
. With these default chunksize settings the GSI code is about 30% slower because the parallel reads are not efficient. We found that if these chunksizes were reset to the grid dimensions using nccopy: nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc The GSI run times were faster, and on par with what we see now with 64-bit offset restart files. Of course running the above nccopy command adds extra run time/overhead, so if these restart file chunk sizes could be set in the model itself it would be helpful. |
@EricRogers-NOAA @junwang-noaa First, the axis values are not chunksizes, they are the variable dimensions as you can see in the variable definition float delp(Time, zaxis_1, yaxis_2, xaxis_1) ; I am very surprised the io_layout=1,15 restart files have chunked data in them. According to the NetCDF documentation, the ncchksz parameter, which is used in the nf90_open function call, is ignored when creating NC-4/HDF5 files. Instead, the chunking is defined at the variable definition layer using nf90_def_var. According the the NetCDF documentation, the default data layout is contiguous, unless certain arguments are provided - which they are not provided in fms2_io (see here and here). Don't be confused by the presence of a checksum attribute as that is created and written by FMS and not related to the fletcher32 argument to nf90_def_var. I suggest consulting with Ed Hartnett, if he is still available to EMC, as to why a variable definition is being chunked when the documentation indicates it should not be. |
@bensonr Thanks, I will reach out to Ed. I wasn't quite clear above about that nccopy command, when I ran that command as is, we went from chunksizes in fv_tracer.res.tile1.nc from this:
to this And the GSI code reading this file was about 30% faster. We did not test to see if these new chunksizes were the most optimal, however. |
@EricRogers-NOAA May I ask if there is the specific reason to use "netcdf4" netcdf_default_format? What about using the default "64bit" option for the restart files as you use in the mppnccombine command? I don't see the chunksizes in the restart files when using the default "64bit" format, maybe it will speed up the process (no need of using nccopy and fast reading in GSI)? |
@junwang-noaa I will revisit this with Ting Lei, but as best as I can gather from going through old emails, the parallel I/O in the GSI supports netcdf4, not netcdf3, which is why I went down the netcdf4 rabbit hole. But that begs the question, is the GSI faster when reading default 64bit restart files vs netcdf4 restart files with inefficient chunksizes? I'll make some tests to answer that question. |
@junwang-noaa When I set netcdf_default_format = "64bit" in the 3km N. American RRFS domain forecast run, I got this error: FATAL from PE 0: NetCDF: One or more variable sizes violate format constraints: set_netcdf_mode which is bringing back memories of what happened last year when I first started running this huge domain (3951x2701) in a DA run with the model writing out restart files. Someone (I believe it was Rusty) who recommended adding the fms2_io_nml block with netcdf_default_format = "netcdf4". |
@EricRogers-NOAA Thanks for the explanation. So for RRFS, we have to use "netcdf4" format. In that case, can we still use "io_layout=1,15" as before? Will that slow down RRFS runs? My understanding is that until the netcdf chunksize issue is fixed, RRFS will not get speedup by using the fixes Rusty provided. But other applications such as GFS/HAFS/UFSAQM will speed up by using io_layout=1,1 with default "64bit" netcdf format. I think we need to open a separate issue on the chunksize used in netcdf format and ask Ed to take a look. Meanwhile, we can work on committing Rusty's fixes back to the commit queue if it does slow down RRFS. @JianpingHuang-NOAA is waiting for the code changes to run UFSAQM on wcoss2. Please let me know if this working for you. Thanks |
@junwang-noaa Yes, RRFS will need to use netcdf4 format for restart files, we can just continue to set io_layout=1,15 and then recombine the restart file pieces using that mppnccombine utility with the -64 option so the input restart files to the GSI are in 64bit format. Earlier I stated this: "We found that if these chunksizes were reset to the grid dimensions using nccopy: nccopy -c xaxis_1/3950,xaxis_2/3951,yaxis_1/2701,yaxis_2/2700,zaxis_1/65 fv_core.res.tile1.nc fv_core.res.tile1_new.nc The GSI run times were faster, and on par with what we see now with 64-bit offset restart files.". What I discovered is that the above nccopy command from netcdf 4.5.0 (on WCOSS1 Dell) worked as I described above, but when one loaded netcdf/4.7.4 from hpc-stack, the above nccopy command did not change the chunksizes for the 2-d and 3-d variables in the restart files. |
@EricRogers-NOAA Thanks for the information. I think this is a netcdf issue we may need to ask some netcdf expert to take a look. Would you please create an issue with the default chunksizes with continuous fields? Once that issue is resolved, maybe we don't need the nccopy any more. |
Should UFS-AQM use "netcdf64" format and set io_layout=1,15 to speed up the
restart-file writing ? @jun Wang - NOAA Federal ***@***.***> @eric
Rogers - NOAA Federal ***@***.***>
…On Mon, Jun 13, 2022 at 1:05 PM Jun Wang ***@***.***> wrote:
@EricRogers-NOAA <https://github.com/EricRogers-NOAA> Thanks for the
information. I think this is a netcdf issue we may need to ask some netcdf
expert to take a look. Would you please create an issue with the default
chunksizes with continuous fields? Once that issue is resolved, maybe we
don't need the nccopy any more.
—
Reply to this email directly, view it on GitHub
<#801 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANA2PI5YWV6FZSGGTZGSMALVO5S5ZANCNFSM5EBEOZQQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@JianpingHuang-NOAA Yes, you can try to see if it speed up your run. |
@bensonr Would you please sync your fv3atm branch https://github.com/bensonr/fv3atm/tree/emc_io_fixes and create a PR at fv3atm repo? If possible, please also create a ufs-weather-model PR. Please let me know if you want me to create the ufs-weather-model PR. Thanks. |
@junwang-noaa - you may recall there was an update needed to properly ignore the checksum. I believe you want to you fork/branch from @BinLiu-NOAA. |
@junwang-noaa I created a new issue : https://github.com/ufs-community/ufs-weather-model/issues/1270 |
@bensonr The updates from @BinLiu-NOAA is in driver/fvGFS/atmosphere.F90, I added the following changes based on your dycore branch: https://github.com/bensonr/GFDL_atmos_cubed_sphere/tree/emc_io_fixes
Would you please add those changes to your dycore branch emc_io_fixes? Thank you! |
@EricRogers-NOAA Thanks for creating the issue. Let's see if Ed can help take a look. |
@bensonr I pulled your fv3atm changes in a branch checksum_io and synced with the latest fv3atm update. Would you please create a PR to the GFDL dycore dev/emc branch? I will create corresponding PRs on fv3atm and ufs-weather-model after that. My understanding is that we will have one dycore dev/emc PR(NOAA-GFDL/GFDL_atmos_cubed_sphere#194) before this one, so we need you to sync with dev/emc after PR 194 is committed. I will keep syncing the fv3/ufs wm branches until the commit time. @BinLiu-NOAA @JianpingHuang-NOAA @EricRogers-NOAA The ufs-weather-model branch https://github.com/junwang-noaa/ufs-weather-model/tree/checksum_io is now pointing to Rusty's updated dycore branch and synced to the top of ufs wm develop branch, please let me know if you have any issues. Thanks |
@junwang-noaa - FV3 PR #197 was created yesterday and you were assigned as a reviewer. |
The PR#1275 was committed. An issue #1270 was created about the default chunksize issue in NetCDF4 files in the RRFS application for further discussion. This issue will be closed. |
RTS was also run on gaea before committing things and asking for functional
tests.
…On Tue, May 3, 2022 at 2:44 PM Jun Wang ***@***.***> wrote:
I run the RT tests on Orion, it works. Let me check Dell.
—
Reply to this email directly, view it on GitHub
<#801 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSKBVFEL7UJW5H3TBVY3NTVIFXWXANCNFSM5EBEOZQQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
GFDL plans to implement a fix in FV3 to resolve the issue, I will let you
know when the timeline is available.
…On Thu, Mar 10, 2022 at 10:14 AM JacobCarley-NOAA ***@***.***> wrote:
@junwang-noaa <https://github.com/junwang-noaa> @arunchawla-NOAA
<https://github.com/arunchawla-NOAA>
Do we have any updates on this? When testing the model on Cray TO4
(Luna/Surge) the model takes over 3000s to initialize.
—
Reply to this email directly, view it on GitHub
<#801 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI7D6TMJKCSDYGDW7NLM6DTU7IGWTANCNFSM5EBEOZQQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
* "point to the MYNN PBL update for RRFS.v1" * "point to the updates of smoke and fv3atm PR #801" * "point to the updates to allocate rho_dry to zero size when not in use" * "point to hash of NOAA-EMC/fv3atm@f9a1759" --------- Co-authored-by: JONG KIM <jong.kim@noaa.gov> Co-authored-by: matthew pyle <Matthew.Pyle@noaa.gov>
Today Ben Blake checked out the latest develop branch and ran a short test on WCOSS Dell Phase 3.5 with the 3 km RRFS LAM domain over North America, coldstarted from the GFS analysis, He saw an increase in the model initialization time by almost 8 minutes compared to the current parallel LAM run:
LAM parallel: in fcst,init total time: 74.2139439582825 (#ddcd809, checked out 7/30/21)
My test: in fcst,init total time: 524.866119146347 (#e198256, checked out today)
Ben's test run did not run to completion so no termination times are available. Bin Liu noted similar behavior in HAFS
The text was updated successfully, but these errors were encountered: