-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C768 S2SW does not run out of the box on hera #1793
Comments
@JessicaMeixner-NOAA Can you copy the directory of the failed job somewhere that I can take a closer look? I have seen similar errors/exceptions in the past and they were related to the configurations (e.g., namelists, tables, etc.,). If you stage it somewhere I can run the forecast stand-alone and try to get to the bottom of it. Thank you. |
@HenryWinterbottom-NOAA I've already found a successful configuration by changing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 -> 40 that run just completed. I'm trying to see if only 20 WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS will also succeed as I know there's a hesitation to increase resources beyond what is needed. At this time I think the only thing is to determine if the 20 is sufficient and if that works on the other machines as well. Have a job in the queue on WCOSS2 right now with the original setting to see if that works. |
@JessicaMeixner-NOAA OK, great. Keep me posted if you encounter any more issues. If a resource allocation is the root issue, please open another issue. Can this be closed? |
@HenryWinterbottom-NOAA personally I would not consider this issue to be closed until you can run this configuration out of the box from develop --- or is there an expectation that resources are not sufficient out of the box and that we should all be modifying these on our own and should there be some documentation somewhere that we can share what resources work? |
@JessicaMeixner-NOAA Check your email. I sent you a link to a spreadsheet, that if populated, might answer some of these questions in the future. |
@HenryWinterbottom-NOAA Thanks for the email with the resources document. There are many resrouce related issues open on the global-workflow right now and I know this is a wider-spread problem that this specific issue. Quick clarification though: Does that mean it's not expected for someone to be able to run C768 S2SW free-forecast (or insert other well used configuration) out of the box on hera (or orion/wcoss2) without modifying the resources? |
No. The user can use whatever they want. The end result would be suggested and/or tested amounts of allocated resources. We don't have the bandwidth to debug them all but if we can have a base estimate of "what works" it will give the user a starting point. I also realize that not all of these configurations will be tested. But if we have at least some numbers we may be able to extrapolate. |
@HenryWinterbottom-NOAA Thanks for getting this conversation and spreadsheet started. I'm in a meeting and then will be on leave and wont have time to properly respond until Monday. I did want to write a quick note that WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20 on hera for C768 works. |
@JessicaMeixner-NOAA No problem and thank you for following-up. Enjoy your leave and we can chat more next week. |
I'll update the write tasks this sprint. |
@junwang-noaa |
@aerorahul @junwang-noaa FYI: |
We're getting ready to run HR2 and in the process have found a few minor bugs. While these shouldn't effect others running low resolution cases, I wanted to push these bug fixes for anyone trying to run high resolution. These bugs address: * Issue #1793 Adding extra tasks to write component for hera for C768 (otherwise crashes due to memory) * Avoiding requesting two wave restarts at the same time (this is a known bug that the workflow usually has work around for. A fix for the underlying WW3 bug should be coming within the next month, but this will get us through that waiting period) * Adding a setting that was missed when updating the ufs-waether-model that ensure that CMEPS restarts are written in a reasonable time (See: ufs-community/ufs-weather-model#1873 for more details)
Fixed by #1805 |
Expected behavior
I should be able to run C768 S2SW forecast for dates with staged ICs (example, 2020062500)
Current behavior
Get a segfault.
Machines affected
Hera
To Reproduce
My test case:
./setup_expt.py gfs forecast-only --app S2SW --pslot c768t04 --configdir /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/global-workflow/parm/config/gfs --idate 2020062500 --edate 2020062500 --res 768 --gfs_cyc 1 --comrot /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/c768t04/COMROOT --expdir /scratch1/NCEPDEV/climate/Jessica.Meixner/HR2/oceanout/c768t04/EXPDIR
Context
We're getting ready to run HR2 and several people are running into issues trying to run C768 experiments on hera.
Detailed Description
Example error:
Additional Information
Possible Implementation
Several combinations were tried. Increasing WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 -> 40 works. Currently trying 20 to see if a smaller number will also work.
The text was updated successfully, but these errors were encountered: