Tasks running out of memory when they shouldn't #1664

guillaumevernieres · 2023-06-06T00:04:13Z

Description

I'm running a lower res prototype GFSv17 with the gdas cycle only and the model at C384/0.25. The following tasks have been failing due to lack of memory:

gdasfcst (see issue Some forecast resolutions fail on lower-mem HPC when using ESMF threading #1572). It's working with 40 nodes, it used to be fine with 19.
gdasanalcalc
ocnanalrun: I run that one regularly on 30 nodes outside of the g-w without issues. It fails with 40 nodes when submitted with the gw

It looks like all these tasks have threads-per-core limited to 1, which isn't right. Omitting this option for ocnanalrun task allows me to complete the analysis on 30 nodes. I wonder if that is also the issue for the other 2 tasks. I'll test tomorrow and update this issue.
It could also just be bad luck since none of these tasks use the --exclusive option.

Pinging @aerorahul and @WalterKolczynski-NOAA .

The text was updated successfully, but these errors were encountered:

HenryRWinterbottom · 2023-06-06T00:09:27Z

@guillaumevernieres I recall a similar issue with the ocean analysis in UFS-RNR. I checked out configuration for C96 and this is what I have.

https://github.com/NOAA-PSL/UFS-RNR/blob/develop/cylc/tasks/tasks.UFS-RNR.1p0.coupled.RDHPCS-Hera.SLURM.yaml#L439

the exclusive directive is in there which implies that I need that to get it to complete successfully.

I haven't tested this on Orion, but this was true for Hera. This will however not work on the cloud (at least AWS) since SLURM does not support that directive with their distro/build.

guillaumevernieres · 2023-06-06T12:28:05Z

Thanks for checking @HenryWinterbottom-NOAA . I don't know what the optimum configuration would look like, but I don't think "exclusive" is it, not for tasks for which we have a rough idea of the memory footprint. I'll try to come up with memory estimates for these 3 tasks.

CoryMartin-NOAA · 2023-06-06T12:38:55Z

I noticed this recently too, I think we need to add --mem=100G type things to the rocoto XML. I'm not sure why this changed all of a sudden but it even happens at C96 resolution for various tasks.

guillaumevernieres · 2023-06-06T12:48:46Z

I noticed this recently too, I think we need to add --mem=100G type things to the rocoto XML. I'm not sure why this changed all of a sudden but it even happens at C96 resolution for various tasks.

Haha ... 100GB , people are going to scream when we put these numbers down for the DA :) . But yes, that exactly what I'm doing "right now" @CoryMartin-NOAA .

HenryRWinterbottom · 2023-06-06T13:30:08Z

@guillaumevernieres I think exclusive tries to guarantee that you are the only user on a node. So yes, maybe not what you need.

WalterKolczynski-NOAA · 2023-08-22T19:23:00Z

I need some C384/0.25° ICs on Hera so I can test. Can someone point me to them?

WalterKolczynski-NOAA · 2023-08-22T19:27:09Z

Or maybe I can just use P8?

JessicaMeixner-NOAA · 2023-08-22T19:34:23Z

I think @guillaumevernieres should have a C385/0.25 set up for cycling. For forecast only you could use P8 ICs - but avoid 2013010100 (as there's an instability it seems)

WalterKolczynski-NOAA · 2023-08-22T19:37:11Z

These are all DA jobs (although gdasfcst seems to have already been fixed), so I'll need the cycling set. Thanks for confirming, @JessicaMeixner-NOAA

Also @guillaumevernieres, how many threads should the ocnanal job use?

guillaumevernieres · 2023-08-22T20:19:46Z

These are all DA jobs (although gdasfcst seems to have already been fixed), so I'll need the cycling set. Thanks for confirming, @JessicaMeixner-NOAA

Also @guillaumevernieres, how many threads should the ocnanal job use?

@WalterKolczynski-NOAA I don't really need help for the marine DA but we can discuss the merit of my changes in my next PR.

I have made changes following @NeilBarton-NOAA and @JessicaMeixner-NOAA 's suggestions for the forecast step, but I don't see similar changes in develop. When you say that the forecast was fixed, when did that happen?

The analcalc job is failing for everybody I think, the quick fix is to up the memory requested, but I don't know if there is an underlying bug that needs to be addressed.

WalterKolczynski-NOAA · 2023-08-22T20:26:39Z

@guillaumevernieres For the forecast, #1763 updated the number of threads for C384 and @CoryMartin-NOAA reported it now runs to completion.

I was trying to fix the others, but I can't test solutions unless I have ICs. If you are already working on it, please assign yourself to this issue and I will move on to something else.

guillaumevernieres · 2023-08-22T20:37:34Z

@guillaumevernieres For the forecast, #1763 updated the number of threads for C384 and @CoryMartin-NOAA reported it now runs to completion.

I was trying to fix the others, but I can't test solutions unless I have ICs. If you are already working on it, please assign yourself to this issue and I will move on to something else.

@WalterKolczynski-NOAA , the PR you point to was for fixing the anacalc job ... Which I suppose @CoryMartin-NOAA told me about. I forgot, sorry!

Give me a few mn to dig-out warm coupled IC so you can test the s2s forecast in cycling mode.

WalterKolczynski-NOAA · 2023-08-22T20:38:54Z

It fixed the fcst as well AFAIK, by updating config.ufs.

guillaumevernieres · 2023-08-22T20:43:12Z

It fixed the fcst as well AFAIK, by updating config.ufs.

OK.

Warm s2s IC's:
/scratch2/NCEPDEV/ocean/Guillaume.Vernieres/data/ICSDIR/C384O025/gdas.20210701/

WalterKolczynski-NOAA · 2023-08-22T21:23:27Z

Thanks, will check things out.

guillaumevernieres added the bug Something isn't working label Jun 6, 2023

RussTreadon-NOAA mentioned this issue Jun 15, 2023

C384L127 FV3_GFS_v17_p8 ufs_model.x fails #1695

Closed

WalterKolczynski-NOAA self-assigned this Aug 14, 2023

WalterKolczynski-NOAA added this to the HR2 milestone Aug 14, 2023

WalterKolczynski-NOAA mentioned this issue Aug 22, 2023

gdasanalcalc runs out of memory (C384) on Hera #1481

Closed

JessicaMeixner-NOAA mentioned this issue Aug 30, 2023

Final Updates for HR2 #1827

Merged

7 tasks

WalterKolczynski-NOAA removed this from the HR2 milestone Sep 11, 2023

guillaumevernieres closed this as completed Sep 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks running out of memory when they shouldn't #1664

Tasks running out of memory when they shouldn't #1664

guillaumevernieres commented Jun 6, 2023

HenryRWinterbottom commented Jun 6, 2023

guillaumevernieres commented Jun 6, 2023

CoryMartin-NOAA commented Jun 6, 2023

guillaumevernieres commented Jun 6, 2023

HenryRWinterbottom commented Jun 6, 2023

WalterKolczynski-NOAA commented Aug 22, 2023 •

edited

Loading

WalterKolczynski-NOAA commented Aug 22, 2023

JessicaMeixner-NOAA commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

Tasks running out of memory when they shouldn't #1664

Tasks running out of memory when they shouldn't #1664

Comments

guillaumevernieres commented Jun 6, 2023

Description

HenryRWinterbottom commented Jun 6, 2023

guillaumevernieres commented Jun 6, 2023

CoryMartin-NOAA commented Jun 6, 2023

guillaumevernieres commented Jun 6, 2023

HenryRWinterbottom commented Jun 6, 2023

WalterKolczynski-NOAA commented Aug 22, 2023 • edited Loading

WalterKolczynski-NOAA commented Aug 22, 2023

JessicaMeixner-NOAA commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

guillaumevernieres commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023

WalterKolczynski-NOAA commented Aug 22, 2023 •

edited

Loading