Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks running out of memory when they shouldn't #1664

Closed
guillaumevernieres opened this issue Jun 6, 2023 · 15 comments
Closed

Tasks running out of memory when they shouldn't #1664

guillaumevernieres opened this issue Jun 6, 2023 · 15 comments
Assignees
Labels
bug Something isn't working

Comments

@guillaumevernieres
Copy link
Contributor

Description

I'm running a lower res prototype GFSv17 with the gdas cycle only and the model at C384/0.25. The following tasks have been failing due to lack of memory:

It looks like all these tasks have threads-per-core limited to 1, which isn't right. Omitting this option for ocnanalrun task allows me to complete the analysis on 30 nodes. I wonder if that is also the issue for the other 2 tasks. I'll test tomorrow and update this issue.
It could also just be bad luck since none of these tasks use the --exclusive option.

Pinging @aerorahul and @WalterKolczynski-NOAA .

@guillaumevernieres guillaumevernieres added the bug Something isn't working label Jun 6, 2023
Copy link
Contributor

@guillaumevernieres I recall a similar issue with the ocean analysis in UFS-RNR. I checked out configuration for C96 and this is what I have.

https://github.com/NOAA-PSL/UFS-RNR/blob/develop/cylc/tasks/tasks.UFS-RNR.1p0.coupled.RDHPCS-Hera.SLURM.yaml#L439

the exclusive directive is in there which implies that I need that to get it to complete successfully.

I haven't tested this on Orion, but this was true for Hera. This will however not work on the cloud (at least AWS) since SLURM does not support that directive with their distro/build.

@guillaumevernieres
Copy link
Contributor Author

Thanks for checking @HenryWinterbottom-NOAA . I don't know what the optimum configuration would look like, but I don't think "exclusive" is it, not for tasks for which we have a rough idea of the memory footprint. I'll try to come up with memory estimates for these 3 tasks.

@CoryMartin-NOAA
Copy link
Contributor

I noticed this recently too, I think we need to add --mem=100G type things to the rocoto XML. I'm not sure why this changed all of a sudden but it even happens at C96 resolution for various tasks.

@guillaumevernieres
Copy link
Contributor Author

I noticed this recently too, I think we need to add --mem=100G type things to the rocoto XML. I'm not sure why this changed all of a sudden but it even happens at C96 resolution for various tasks.

Haha ... 100GB , people are going to scream when we put these numbers down for the DA :) . But yes, that exactly what I'm doing "right now" @CoryMartin-NOAA .

@HenryRWinterbottom
Copy link
Contributor

@guillaumevernieres I think exclusive tries to guarantee that you are the only user on a node. So yes, maybe not what you need.

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Aug 22, 2023

I need some C384/0.25° ICs on Hera so I can test. Can someone point me to them?

@WalterKolczynski-NOAA
Copy link
Contributor

Or maybe I can just use P8?

@JessicaMeixner-NOAA
Copy link
Contributor

I think @guillaumevernieres should have a C385/0.25 set up for cycling. For forecast only you could use P8 ICs - but avoid 2013010100 (as there's an instability it seems)

@WalterKolczynski-NOAA
Copy link
Contributor

These are all DA jobs (although gdasfcst seems to have already been fixed), so I'll need the cycling set. Thanks for confirming, @JessicaMeixner-NOAA

Also @guillaumevernieres, how many threads should the ocnanal job use?

@guillaumevernieres
Copy link
Contributor Author

These are all DA jobs (although gdasfcst seems to have already been fixed), so I'll need the cycling set. Thanks for confirming, @JessicaMeixner-NOAA

Also @guillaumevernieres, how many threads should the ocnanal job use?

@WalterKolczynski-NOAA I don't really need help for the marine DA but we can discuss the merit of my changes in my next PR.

I have made changes following @NeilBarton-NOAA and @JessicaMeixner-NOAA 's suggestions for the forecast step, but I don't see similar changes in develop. When you say that the forecast was fixed, when did that happen?

The analcalc job is failing for everybody I think, the quick fix is to up the memory requested, but I don't know if there is an underlying bug that needs to be addressed.

@WalterKolczynski-NOAA
Copy link
Contributor

@guillaumevernieres For the forecast, #1763 updated the number of threads for C384 and @CoryMartin-NOAA reported it now runs to completion.

I was trying to fix the others, but I can't test solutions unless I have ICs. If you are already working on it, please assign yourself to this issue and I will move on to something else.

@guillaumevernieres
Copy link
Contributor Author

@guillaumevernieres For the forecast, #1763 updated the number of threads for C384 and @CoryMartin-NOAA reported it now runs to completion.

I was trying to fix the others, but I can't test solutions unless I have ICs. If you are already working on it, please assign yourself to this issue and I will move on to something else.

@WalterKolczynski-NOAA , the PR you point to was for fixing the anacalc job ... Which I suppose @CoryMartin-NOAA told me about. I forgot, sorry!

Give me a few mn to dig-out warm coupled IC so you can test the s2s forecast in cycling mode.

@WalterKolczynski-NOAA
Copy link
Contributor

It fixed the fcst as well AFAIK, by updating config.ufs.

@guillaumevernieres
Copy link
Contributor Author

It fixed the fcst as well AFAIK, by updating config.ufs.

OK.

Warm s2s IC's:
/scratch2/NCEPDEV/ocean/Guillaume.Vernieres/data/ICSDIR/C384O025/gdas.20210701/

@WalterKolczynski-NOAA
Copy link
Contributor

Thanks, will check things out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants