Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes requested by NCO on some of the resources used in operations. #603

Merged
merged 16 commits into from
Jan 26, 2022

Conversation

aerorahul
Copy link
Contributor

@aerorahul aerorahul commented Jan 20, 2022

Description
NCO requested clarification and updates to the resources for some of the jobs run in operations.
This PR addresses those comments and concerns.
Some of the changes are still being tested and may need further refinement.

Tagging @WeiWei-NCO

This work is part of #398

Type of change

  • Bug fix (non-breaking change)

How Has This Been Tested?
The only way to test this is with NCO running the jobs in this PR and validating it meets their acceptance criteria.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes need updates to the documentation. I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

Copy link
Contributor

@WalterKolczynski-NOAA WalterKolczynski-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good as long as they meet NCO timing requirements and don't crash due to OOM. There are a couple that were reduced to less than a full node but still are requesting exclusive mode. Seems like they either shouldn't use exclusive, or you might as well throw all the CPUs at it since they are reserved anyway as long as the overhead doesn't actually make the job slower.

@aerorahul
Copy link
Contributor Author

Mostly looks good as long as they meet NCO timing requirements and don't crash due to OOM. There are a couple that were reduced to less than a full node but still are requesting exclusive mode. Seems like they either shouldn't use exclusive, or you might as well throw all the CPUs at it since they are reserved anyway as long as the overhead doesn't actually make the job slower.

I haven't gone through all jobs, but we shouldn't request more CPU's than required or actually used.
This came about from conversations w/ @WeiWei-NCO

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Jan 21, 2022

Mostly looks good as long as they meet NCO timing requirements and don't crash due to OOM. There are a couple that were reduced to less than a full node but still are requesting exclusive mode. Seems like they either shouldn't use exclusive, or you might as well throw all the CPUs at it since they are reserved anyway as long as the overhead doesn't actually make the job slower.

I haven't gone through all jobs, but we shouldn't request more CPU's than required or actually used. This came about from conversations w/ @WeiWei-NCO

I agree. My point was just that if we are requesting exclusive mode, we are "using" the other CPUs on that node, even if we aren't requesting them explicitly.

@aerorahul
Copy link
Contributor Author

Mostly looks good as long as they meet NCO timing requirements and don't crash due to OOM. There are a couple that were reduced to less than a full node but still are requesting exclusive mode. Seems like they either shouldn't use exclusive, or you might as well throw all the CPUs at it since they are reserved anyway as long as the overhead doesn't actually make the job slower.

I haven't gone through all jobs, but we shouldn't request more CPU's than required or actually used. This came about from conversations w/ @WeiWei-NCO

I agree. My point was just that if we are requesting exclusive mode, we are "using" the other CPUs on that node, even if we aren't requesting them explicitly.

Actually, you are not "using" them. If one tries to do mpiexec -n number_exceeding_requested_cores, the the job will fail.

Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, I'll need to incorporate into rocoto workflow and test. Left questions on the waveinit and wavepostsbs resources changes: need excl or memory setting per Wei's request.

@aerorahul
Copy link
Contributor Author

@NOAA-EMC/global-workflow-cms
I have added a few more updates since the last review following changes from NCO.

KateFriedman-NOAA and others added 3 commits January 25, 2022 19:54
- update the gdasesfc, gdasanal, and gfsanal job resources in ecf
scripts, based on optimization work between NCO/EMC

Refs: #399
- based on optimization work between NCO/EMC, update the echgres job
resources to be more appropriate for the larger node machine

Refs: #399
@aerorahul
Copy link
Contributor Author

@KateFriedman-NOAA @WeiWei-NCO
I think this is current as of this evening of testing and can be merged.
There may be a few more later, but I think we are on solid ground now.

KateFriedman-NOAA added a commit to KateFriedman-NOAA/global-workflow that referenced this pull request Jan 26, 2022
- fold in further resource optimization values from testing done by
NCO/Wei and committed to ecf scripts in PR NOAA-EMC#603

Refs: NOAA-EMC#399
- adjust the analysis job walltimes to be 50mins for both the gdas and
gfs suite jobs

Refs: #399
Copy link
Member

@KateFriedman-NOAA KateFriedman-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and conform to expected optimization outcomes between NCO/EMC.

@KateFriedman-NOAA KateFriedman-NOAA merged commit 5e54f2f into feature/ops-wcoss2 Jan 26, 2022
@KateFriedman-NOAA KateFriedman-NOAA deleted the feature/nco-fixes-red-numbers branch January 26, 2022 14:17
kayeekayee pushed a commit to kayeekayee/global-workflow that referenced this pull request May 30, 2024
* Switch ccpp-physics submodue.

* ccpp NOAA-GSL#17: git action ci yaml update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants