Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for izumi nag tests to pass (PR into tmp-241219) #2925

Merged
merged 6 commits into from
Jan 9, 2025

Conversation

slevis-lmwg
Copy link
Contributor

Description of changes

Allocation statements should have been (0:mxpft) instead of (mxpft).
I introduced the bug in a small refactor requested in #2917.

Specific notes

CTSM Issues Fixed (include github issue #):
Fixes #2924

Are answers expected to change (and if so in what way)?
Not relative to ctsm5.3.016.

Testing performed, if any:

PASS SMS_D_Ld65.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-FireLi2024GSWP
PASS SMS_D.f10_f10_mg37.I2000Clm60BgcCrop.izumi_nag.clm-crop
PASS ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso

I will submit aux_clm next.

@slevis-lmwg slevis-lmwg self-assigned this Jan 7, 2025
@slevis-lmwg slevis-lmwg added bug something is working incorrectly bfb bit-for-bit labels Jan 7, 2025
@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jan 7, 2025

./run_sys_tests -s aux_clm -c ctsm5.3.016 --skip-generate
derecho OK
izumi IN PROGRESS but early results show one of the problems I saw a few of weeks ago that led me to ignore izumi:

    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm50Bgc.izumi_nag.clm-ciso SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I1850Clm60Bgc.izumi_nag.clm-ciso--clm-matrixcnOn SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-flexCN_FUN SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-flexCN_FUN--clm-matrixcnOn SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-luna SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-noFUN_flexCN SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-noFUN_flexCN--clm-matrixcnOn SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-reduceOutput SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-o3lombardozzi2015 SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.izumi_nag.clm-clm60cam7LndTuningModeLDust SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_P48x1.f10_f10_mg37.IHistClm60Bgc.izumi_nag.clm-decStart SHAREDLIB_BUILD failed to initialize
    FAIL ERP_D_P48x1.f10_f10_mg37.IHistClm60Bgc.izumi_nag.clm-decStart--clm-matrixcnOn_ignore_warnings SHAREDLIB_BUILD failed to initialize

Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slevis-lmwg awesome for you finding this. This is another example of Nag finding a legit problem for us.

This is great. Can you also create a branch on b4b-dev where you merge 1c81c98? We can get that into b4b-dev immediately that way.

I still propose we have @glemieux go first with his simple FATES-Hydro tag. And we should have him go now.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 7, 2025

@slevis-lmwg I have been having to redo the build and resubmit after it does the initial run through. So that looks the same as what I've been seeing, and also with the ctsm5.3.016 tag on master.

@slevis-lmwg slevis-lmwg changed the title Bug fix for izumi nag tests to pass Bug fix for izumi nag tests to pass (PR into tmp-241219 or master) Jan 7, 2025
Copy link
Collaborator

@ekluzek ekluzek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, and I'm glad you found this.

The one change we need is to add a test where Tom's scheme is turned on. We'll have this when we make it the default, but until then there should be at least one test for it. And actually I suggest we have both a derecho_intel and a izumi_nag test for it now (because of the problem nag found).

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 8, 2025

Oh, and I suspect that had we had a test for it in the first tag -- we would've seen this problem sooner and with DEBUG for derecho_intel.

@slevis-lmwg
Copy link
Contributor Author

slevis-lmwg commented Jan 9, 2025

aux_clm results

derecho OK
izumi OK
On izumi I see the following failure in all the cases that 'failed to initialize' that I went back and built and ran, whether with ./case.build or ./create_test. For example:
FAIL ERP_D_Ld5_P48x1.f10_f10_mg37.I2000Clm50BgcCru.izumi_nag.clm-noFUN_flexCN BASELINE ctsm5.3.016: ERROR CPRNC failed to open files
I am aware that others have seen this behavior, too.

@ekluzek ekluzek changed the title Bug fix for izumi nag tests to pass (PR into tmp-241219 or master) Bug fix for izumi nag tests to pass (PR into tmp-241219) Jan 9, 2025
@slevis-lmwg
Copy link
Contributor Author

@glemieux regarding the diffs that I see between this PR (to be ...n03) and ctsm5.3.016:

  1. They are all Fates tests.
  2. I do not see these diffs when I compare to n02, so now I suspect that these are the same diffs that you would have expected all along (just as you confirmed for me yesterday on izumi).
  3. Could you quickly confirm your results against my list or tell me where to look, so that I may confirm beyond doubt:
    FAIL ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdDryDepSatPhen BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdMeganSatPhen BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL BASELINE ctsm5.3.016: DIFF
    FAIL SMS_Ld5_PS.f19_g17.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL ERP_Ld9.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdAllVars BASELINE ctsm5.3.016: DIFF
    FAIL ERP_P128x2_Ld30.f45_f45_mg37.I2000Clm60FatesSpCruRsGs.derecho_intel.clm-FatesColdSatPhen BASELINE ctsm5.3.016: DIFF
    FAIL ERS_D_Ld20.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdTwoStream BASELINE ctsm5.3.016: DIFF
    FAIL ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesRs.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdFixedBiogeo BASELINE ctsm5.3.016: DIFF
    FAIL ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdSizeAgeMort BASELINE ctsm5.3.016: DIFF
    FAIL ERS_Ld9.f10_f10_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdCH4Off BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm45Fates.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_D_Lm6_P256x1.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO BASELINE ctsm5.3.016: DIFF
    FAIL SMS_Ld5.f10_f10_mg37.I2000Clm45Fates.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE ctsm5.3.016: DIFF
    FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE ctsm5.3.016: DIFF

@slevis-lmwg
Copy link
Contributor Author

@glemieux my test directory (regarding the above) is
/glade/derecho/scratch/slevis/tests_0108-115648de

@slevis-lmwg
Copy link
Contributor Author

From my own digging in /glade/campaign/cgd/tss/ctsm_baselines/tmp-241219.n02.ctsm5.3.016,
typing grep '6: DIF' */TestStatus | grep Fates returns

FAIL ERP_Ld9.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdAllVars BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERP_P128x2_Ld30.f45_f45_mg37.I2000Clm60FatesSpCruRsGs.derecho_intel.clm-FatesColdSatPhen BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Ld20.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdTwoStream BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Ld3_PS.f09_g17.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Ld5.f10_f10_mg37.I2000Clm50Fates.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Mmpi-serial_Ld5.1x1_brazil.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_D_Mmpi-serial_Ld5.5x5_amazon.I2000Clm60FatesRs.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdFixedBiogeo BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_Ld30.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesColdSizeAgeMort BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL ERS_Ld9.f10_f10_mg37.I2000Clm50FatesCruRsGs.derecho_intel.clm-FatesColdCH4Off BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdDryDepSatPhen BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D.1x1_brazil.I2000Clm60FatesSpCruRsGs.derecho_gnu.clm-FatesColdMeganSatPhen BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm45Fates.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_D_Lm6_P256x1.f45_f45_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS.f45_f45_mg37.I2000Clm60FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_gnu.clm-FatesPRISM--clm-NEON-FATES-YELL BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_Ld10_D_Mmpi-serial.CLM_USRDAT.I1PtClm60Fates.derecho_intel.clm-FatesFireLightningPopDens--clm-NEON-FATES-NIWO BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_Ld5.f10_f10_mg37.I2000Clm45Fates.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_Ld5.f10_f10_mg37.I2000Clm50FatesRs.derecho_intel.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF
FAIL SMS_Ld5_PS.f19_g17.I2000Clm50FatesRs.derecho_gnu.clm-FatesCold BASELINE tmp-241219.n01.ctsm5.3.016: DIFF

@slevis-lmwg
Copy link
Contributor Author

The lists that I posted appear in different order, but I checked them and confirmed that they are identical.

@glemieux
Copy link
Collaborator

glemieux commented Jan 9, 2025

@slevis-lmwg yep you're correct, these are expected diffs against ctsm5.3.016 since the n02 temp branch commit updates the fates tag that includes a change to history outputs via sci.1.80.0_api.37.0.0. This PR comment associated with that tag is relevant to checking the diffs: NGEET/fates#1197 (comment).

Reviewing the original diffs (/glade/u/home/glemieux/scratch/ctsm-tests/tests_pr1197) , they match my expectations per that update.

@slevis-lmwg
Copy link
Contributor Author

Thank you @glemieux

@ekluzek and everyone:
This completes the aux_clm testing for this PR, so I will proceed with the n03 merge to tmp, followed by the 017 merge to master.

@slevis-lmwg slevis-lmwg merged commit caf8af9 into ESCOMP:tmp-241219 Jan 9, 2025
2 checks passed
@slevis-lmwg slevis-lmwg deleted the fix_izumi_nag_tests branch January 9, 2025 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bfb bit-for-bit bug something is working incorrectly
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants