Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

C768 S2SW does not run out of the box on hera #1793

Closed
JessicaMeixner-NOAA opened this issue Aug 11, 2023 · 14 comments
Closed

C768 S2SW does not run out of the box on hera #1793

JessicaMeixner-NOAA opened this issue Aug 11, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@JessicaMeixner-NOAA
Copy link
Contributor

Expected behavior
I should be able to run C768 S2SW forecast for dates with staged ICs (example, 2020062500)

Current behavior
Get a segfault.

Machines affected
Hera

To Reproduce
My test case:
./setup_expt.py gfs forecast-only --app S2SW --pslot c768t04 --configdir /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/global-workflow/parm/config/gfs --idate 2020062500 --edate 2020062500 --res 768 --gfs_cyc 1 --comrot /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/c768t04/COMROOT --expdir /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/c768t04/EXPDIR

Context
We're getting ready to run HR2 and several people are running into issues trying to run C768 experiments on hera.

Detailed Description
Example error:

srun: error: h24c31: task 4680: Killed
srun: launch/slurm: _step_signal: Terminating StepId=48081037.0
   0: slurmstepd: error: *** STEP 48081037.0 ON h1c04 CANCELLED AT 2023-08-10T14:33:31 ***
4825: forrtl: error (78): process killed (SIGTERM)
4825: Image              PC                Routine            Line        Source
4825: ufs_model.x        00000000064B639B  Unknown               Unknown  Unknown
4825: libpthread-2.17.s  00002ABFD69D9630  Unknown               Unknown  Unknown
4825: libpthread-2.17.s  00002ABFD69D5A35  pthread_cond_wait     Unknown  Unknown
4825: ufs_model.x        0000000001134589  _Z10vmkt_catchP6v         231  ESMCI_VMKernel.C
4825: ufs_model.x        00000000011353FA  _ZN5ESMCI3VMK4exi        2486  ESMCI_VMKernel.C
4825: ufs_model.x        00000000009B9D2F  c_esmc_compwait_         1095  ESMCI_FTable.C
4825: ufs_model.x        00000000008CBCA6  esmf_compmod_mp_e        1248  ESMF_Comp.F90
4825: ufs_model.x        0000000000BB5A86  esmf_gridcompmod_        1891  ESMF_GridComp.F90
4825: ufs_model.x        00000000029AA6CB  fv3gfs_cap_mod_mp        1166  fv3_cap.F90
4825: ufs_model.x        00000000029A8FD5  fv3gfs_cap_mod_mp        1024  fv3_cap.F90

Additional Information

Possible Implementation

Several combinations were tried. Increasing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 -> 40 works. Currently trying 20 to see if a smaller number will also work.

@JessicaMeixner-NOAA JessicaMeixner-NOAA added the bug Something isn't working label Aug 11, 2023
@JessicaMeixner-NOAA
Copy link
Contributor Author

FYI @HelinWei-NOAA @barlage @wzzheng90

@HenryRWinterbottom
Copy link
Contributor

@JessicaMeixner-NOAA Can you copy the directory of the failed job somewhere that I can take a closer look? I have seen similar errors/exceptions in the past and they were related to the configurations (e.g., namelists, tables, etc.,).

If you stage it somewhere I can run the forecast stand-alone and try to get to the bottom of it. Thank you.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA I've already found a successful configuration by changing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 -> 40 that run just completed.

I'm trying to see if only 20 WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS will also succeed as I know there's a hesitation to increase resources beyond what is needed. At this time I think the only thing is to determine if the 20 is sufficient and if that works on the other machines as well. Have a job in the queue on WCOSS2 right now with the original setting to see if that works.

@HenryRWinterbottom
Copy link
Contributor

@JessicaMeixner-NOAA OK, great. Keep me posted if you encounter any more issues.

If a resource allocation is the root issue, please open another issue.

Can this be closed?

@JessicaMeixner-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA personally I would not consider this issue to be closed until you can run this configuration out of the box from develop --- or is there an expectation that resources are not sufficient out of the box and that we should all be modifying these on our own and should there be some documentation somewhere that we can share what resources work?

@HenryRWinterbottom
Copy link
Contributor

@JessicaMeixner-NOAA Check your email. I sent you a link to a spreadsheet, that if populated, might answer some of these questions in the future.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA Thanks for the email with the resources document. There are many resrouce related issues open on the global-workflow right now and I know this is a wider-spread problem that this specific issue. Quick clarification though: Does that mean it's not expected for someone to be able to run C768 S2SW free-forecast (or insert other well used configuration) out of the box on hera (or orion/wcoss2) without modifying the resources?

@HenryRWinterbottom
Copy link
Contributor

No. The user can use whatever they want. The end result would be suggested and/or tested amounts of allocated resources.

We don't have the bandwidth to debug them all but if we can have a base estimate of "what works" it will give the user a starting point. I also realize that not all of these configurations will be tested. But if we have at least some numbers we may be able to extrapolate.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@HenryWinterbottom-NOAA Thanks for getting this conversation and spreadsheet started. I'm in a meeting and then will be on leave and wont have time to properly respond until Monday.

I did want to write a quick note that WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20 on hera for C768 works.

@HenryRWinterbottom
Copy link
Contributor

@JessicaMeixner-NOAA No problem and thank you for following-up.

Enjoy your leave and we can chat more next week.

@WalterKolczynski-NOAA WalterKolczynski-NOAA self-assigned this Aug 14, 2023
@WalterKolczynski-NOAA
Copy link
Contributor

I'll update the write tasks this sprint.

@aerorahul
Copy link
Contributor

@junwang-noaa
In your opinion, what is the appropriate load balanced combination of layout, write tasks, etc for the atmosphere, ocean, ice and other components at this resolution?

@jiandewang
Copy link
Contributor

@aerorahul @junwang-noaa FYI:
for HR1 on HERA we used
layout=16x12, OCN=220PE, ICE=120PE, WAV=80PE, MED=300PE, write task=24, thread=3
and got 3.6hr for 7 days of fcst

WalterKolczynski-NOAA pushed a commit that referenced this issue Aug 21, 2023
We're getting ready to run HR2 and in the process have found a few minor bugs.
While these shouldn't effect others running low resolution cases, I wanted to push
these bug fixes for anyone trying to run high resolution.  These bugs address: 
* Issue #1793 Adding extra tasks to write component for hera for C768 (otherwise
crashes due to memory)
* Avoiding requesting two wave restarts at the same time (this is a known bug that
the workflow usually has work around for.  A fix for the underlying WW3 bug should
be coming within the next month, but this will get us through that waiting period) 
* Adding a setting that was missed when updating the ufs-waether-model that ensure
that CMEPS restarts are written in a reasonable time (See:
ufs-community/ufs-weather-model#1873 for more details)
@WalterKolczynski-NOAA
Copy link
Contributor

Fixed by #1805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants