Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config.resources specifies 40 threads for enkf.x on Hera #1084

Closed
RussTreadon-NOAA opened this issue Oct 21, 2022 · 3 comments
Closed

config.resources specifies 40 threads for enkf.x on Hera #1084

RussTreadon-NOAA opened this issue Oct 21, 2022 · 3 comments
Labels
bug Something isn't working

Comments

@RussTreadon-NOAA
Copy link
Contributor

Expected behavior
eupd successfully runs enkf.x on Hera with less than 40 nodes.

Current behavior
config.resources includes ${machine} = "HERA" blocks which set the number of threads to 40, nth_eupd=40 As a result the eupd job runs enkf.x with many more nodes than is necessary on Hera

Machines affected
Hera

To Reproduce
To see this behavior

  1. install g-w develop on Hera
  2. set up EXPDIR for $PSLOT at CASE=C96
  3. execute ./setup_xml.py to generate $PSLOT.xml
  4. open $PSLOT.xml in an editor
  5. scroll down to the gdaseupd section. You will see
        <queue>debug</queue>
        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>40:ppn=1:tpp=40</nodes>
        <native>--export=NONE</native>

eupd will be run on 40 nodes, 1 task per node, 40 threads per task.

Context
C96L127 eupd does not require 40 nodes to run enkf.x on Hera. enkf.x can be run on two nodes at this resolution.

Detailed Description
The eupd section of config.resources contains ${machine} = "HERA" blocks which specify that enkf.x be run with 40 threads, nth_eupd=40.

elif [ ${step} = "eupd" ]; then

    export wtime_eupd="00:30:00"
    if [ ${CASE} = "C768" ]; then
      export npe_eupd=480
      export nth_eupd=6
      if [[ ${machine} = "HERA" ]]; then
        export npe_eupd=150
        export nth_eupd=40
      fi
    elif [ ${CASE} = "C384" ]; then
      export npe_eupd=270
      export nth_eupd=2
      if [[ ${machine} = "HERA" ]]; then
        export npe_eupd=100
        export nth_eupd=40
      fi
      if [[ ${machine} = "S4" ]]; then
         export npe_eupd=160
         export nth_eupd=2
      fi
    elif [[ ${CASE} = "C192" || ${CASE} = "C96" || ${CASE} = "C48" ]]; then
      export npe_eupd=42
      export nth_eupd=2
      if [[ ${machine} = "HERA" ]]; then
        export npe_eupd=40
        export nth_eupd=40
      fi
    fi
    export npe_node_eupd=$(echo "${npe_node_max} / ${nth_eupd}" | bc)

It is not clear why nth_eupd=40 threads are specified for enkf.x on Hera. This results in eupd requesting many more nodes than are necessary to run enkf.x on Hera.

Possible Implementation
We should consider reducing the Hera thread count for eupd to be consistent with other machines. Of course, doing so requires testing at various $CASE on Hera to ensure no adverse impacts.

For my C96L127 parallel on Hera, config.resources has

        export npe_eupd=40
        export nth_eupd=2

This results in $PSLOT.xml requesting 2 nodes to run eupd

        <partition>hera</partition>
        <walltime>00:30:00</walltime>
        <nodes>2:ppn=20:tpp=2</nodes>
        <native>--export=NONE</native>
@RussTreadon-NOAA RussTreadon-NOAA added the bug Something isn't working label Oct 21, 2022
@KateFriedman-NOAA
Copy link
Member

KateFriedman-NOAA commented Oct 21, 2022

@RussTreadon-NOAA I have eupd resource updates coming in via PR #1070. The PR is in final review. This PR is mainly for WCOSS2 resources but touches some of the R&D ones as well. Please see these resources:

https://github.com/KateFriedman-NOAA/global-workflow/blob/feature/dev-wcoss2-resources/parm/config/config.resources#L630:L661

These ^ resources result in the following xml settings on Hera:

C384C192L127: <nodes>54:ppn=5:tpp=8</nodes>
C192C96L127: <nodes>5:ppn=10:tpp=4</nodes>
C96C48L127: <nodes>5:ppn=10:tpp=4</nodes> (I did not test this resolution)

The C384C192L127 value seems kinda high still, although it's down from the 100 nodes that were set previously.

On Orion the C384C192L127 eupd job runs with fewer nodes (<nodes>14:ppn=20:tpp=2</nodes>) although Orion has twice the memory per node.

What are your thoughts on the eupd values coming in via PR #1070?

@RussTreadon-NOAA
Copy link
Contributor Author

@KateFriedman-NOAA , thanks for the update. I can not comment on recommended resource settings on Hera without running test cases at various resolutions. As noted above I am currently running eupd for

export LEVS=128
export CASE="C96"
export CASE_ENKF="C48"

on Hera with <nodes>2:ppn=20:tpp=2</nodes>. 5 nodes for C96C48L127 is more than is necessary on Hera.

Since PR #1070, in part, addresses concerns of this issue, I am closing this issue.

@KateFriedman-NOAA
Copy link
Member

on Hera with 2:ppn=20:tpp=2. 5 nodes for C96C48L127 is more than is necessary on Hera.

Noted, thanks @RussTreadon-NOAA ! There is definitely more refinement/optimization to happen with resources on all machines so I will keep these values for C96C48L127 in mind. We currently group C192, C96, and C48 together with the same resources but given your information we can break them and reduce the values for C96 and C48 in future PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants