Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCPP updates: UGWPv1 decomp bug fixes, remove Julie from CODEOWNERS; bug fix in tests/run_compile.sh #835

Merged

Conversation

climbfuji
Copy link
Collaborator

@climbfuji climbfuji commented Sep 28, 2021

PR Checklist

  • Ths PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR. Please consult the ufs-weather-model wiki if you are unsure how to do this.

  • This PR has been tested using a branch which is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR

  • An Issue describing the work contained in this PR has been created either in the subcomponent(s) or in the ufs-weather-model. The Issue should be created in the repository that is most relevant to the changes in contained in the PR. The Issue and the dependent sub-component PR
    are specified below.

  • If new or updated input data is required by this PR, it is clearly stated in the text of the PR.

Description

Update submodule pointer for fv3atm for the changes described in the associated PRs below:

  • Substantial changes in UGWPv1 to fix problem of lack of reproducibility when changing the domain decomposition layout.
  • Remove Julie from CODEOWNERS files in both ccpp-framework and ccpp-physics.

See NCAR/ccpp-physics#728 for a detailed description of the UGWPv1 updates.

Also: bug fix in tests/run_compile.sh to correctly catch failed compile jobs (e.g. when the disk runs full) from @DusanJovic-NOAA

No new input data required, but the results for the following regression tests will change:

  • For GNU:
control_ugwpv1
control_ugwpv1_debug
  • For Intel:
control_ugwpv1
control_ugwpv1_debug
cpld_bmark_wave_v16_p7b
fv3_hrrr
fv3_rap

These baseline changes are expected, because all of these tests utilize UGWPv1.

Issue(s) addressed

Fixes #742

Testing

How were these changes tested? What compilers / HPCs was it tested with? Are the changes covered by regression tests? (If not, why? Do new tests need to be added?) Have regression tests and unit tests (utests) been run? On which platforms and with which compilers? (Note that unit tests can only be run on tier-1 platforms)

  • hera.intel
  • hera.gnu
  • orion.intel
  • cheyenne.intel
  • cheyenne.gnu
  • gaea.intel
  • jet.intel
  • wcoss_cray
  • wcoss_dell_p3
  • CI - 52ddbce

Dependencies

@climbfuji climbfuji changed the title CCPP updates: UGWPv1 decomp bug fixes, remove Julie from CODEOWNERS CCPP updates: UGWPv1 decomp bug fixes, remove Julie from CODEOWNERS; bug fix in tests/run_compile.sh Sep 28, 2021
@climbfuji climbfuji marked this pull request as ready for review September 28, 2021 17:23
@climbfuji climbfuji added Baseline Updates Current baselines will be updated. cheyenne-gnu-BL Waiting for Reviews The PR is waiting for reviews from associated component PR's. labels Sep 28, 2021
@github-actions github-actions bot removed the run-ci label Sep 28, 2021
@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/745008573/20210928191514/ufs-weather-model
Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_16035
Test control_thompson_extdiag_debug 076 failed failed
Test control_thompson_extdiag_debug 076 failed in run_test failed
Test cpld_bmark_v16 010 failed failed
Test cpld_bmark_v16 010 failed in run_test failed
Please make changes and add the following label back:
hera-intel-RT

@climbfuji
Copy link
Collaborator Author

Automated RT Failure Notification
Machine: hera
Compiler: intel
Job: RT
Repo location: /scratch1/NCEPDEV/nems/emc.nemspara/autort/pr/745008573/20210928191514/ufs-weather-model
Please manually delete: /scratch1/NCEPDEV/stmp2/emc.nemspara/FV3_RT/rt_16035
Test control_thompson_extdiag_debug 076 failed failed
Test control_thompson_extdiag_debug 076 failed in run_test failed
Test cpld_bmark_v16 010 failed failed
Test cpld_bmark_v16 010 failed in run_test failed
Please make changes and add the following label back:
hera-intel-RT

Both jobs hang right at the beginning, a slurm issue? Will recover the regression test log file, rerun those two jobs manually, append to the log file and commit.

@BrianCurtis-NOAA
Copy link
Collaborator

Automated RT Failure Notification
Machine: orion
Compiler: intel
Job: BL
Repo location: /work/noaa/nems/emc.nemspara/autort/pr/745008573/20210928140009/ufs-weather-model
Please manually delete: /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_304452
Test control_c384gdas_wav 093 failed failed
Test control_c384gdas_wav 093 failed in run_test failed
Please make changes and add the following label back:
orion-intel-BL

@climbfuji
Copy link
Collaborator Author

climbfuji commented Sep 29, 2021

Automated RT Failure Notification
Machine: orion
Compiler: intel
Job: BL
Repo location: /work/noaa/nems/emc.nemspara/autort/pr/745008573/20210928140009/ufs-weather-model
Please manually delete: /work/noaa/stmp/bcurtis/stmp/bcurtis/FV3_RT/rt_304452
Test control_c384gdas_wav 093 failed failed
Test control_c384gdas_wav 093 failed in run_test failed
Please make changes and add the following label back:
orion-intel-BL

Timeout ... and somehow a weird message about out of memory. Something to keep in mind, just in case this happens again:

...
33 min. TEST 093 control_c384gdas_wav is running,  status: R jobid 3214774
Slurm unknown status -. Check sacct ...
3214774                   TIMEOUT        rt_304452_093
3214774.bat+            CANCELLED                batch
3214774.ext+        OUT_OF_MEMORY               extern
3214774.0                  FAILED              fv3.exe
34 min. TEST 093 control_c384gdas_wav is TIMEOUT,  status: - jobid 3214774

EDIT: This was the baseline creation step, and everything is under @BrianCurtis-NOAA's user/permissions ... will create baseline again, manually ...

@DeniseWorthen
Copy link
Collaborator

So auto-RT works on orion because nothing needs to be moved, but auto-BL does not work on orion, correct?

@climbfuji
Copy link
Collaborator Author

So auto-RT works on orion because nothing needs to be moved, but auto-BL does not work on orion, correct?

Well, the one test failed because of a timeout, but I couldn't access the remaining baseline that was created under Brian's stmp directory. Otherwise I would have copied that, rerun only the one job to create the missing baseline, and then verify against it. Now I had to create the entire baseline again.

@climbfuji
Copy link
Collaborator Author

fv3atm hash correct, ready to merge

@climbfuji climbfuji added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Sep 29, 2021
@junwang-noaa junwang-noaa merged commit 3e5cac8 into ufs-community:develop Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Baseline Updates Current baselines will be updated. Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. Waiting for Reviews The PR is waiting for reviews from associated component PR's.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

control_ugwpv1 fails when changing the decomp
5 participants