Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERS and ERP tests failing with updated cprnc #3007

Closed
fischer-ncar opened this issue Feb 20, 2019 · 17 comments
Closed

ERS and ERP tests failing with updated cprnc #3007

fischer-ncar opened this issue Feb 20, 2019 · 17 comments

Comments

@fischer-ncar
Copy link
Contributor

The ERS and ERP tests are failing the restart compare for compsets with active clm and cam. This
happens with PR #2988.

FAIL ERP_Ln9.f09_f09_mg17.F1850.cheyenne_intel.cam-outfrq9s COMPARE_base_rest

Tail of output for clm history files.

SUMMARY of cprnc:
A total number of 238 fields were compared
of which 0 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
A total number of 2 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 8 fields on file 2 were not found on file 1.
diff_test: the two files DIFFER only in their field lists

@jedwards4b
Copy link
Contributor

This would suggest that the restart causes fields to be written to the history that were not written from a cold start - does this make sense? What fields are they? @billsacks ?

@billsacks
Copy link
Member

Shoot, sorry that I didn't think of this, or do the active-component testing that would have uncovered it.

CLM outputs a handful of "time-constant" fields to only its first history file, hence this issue. I can think of a few solutions:

  1. Back out much of my cprnc PR, at least for now

  2. Ignore field list differences for within-test comparisons, only considering them for baseline comparisons. (If I remember the cime logic correctly, this might be a bit tricky.)

  3. Change CLM to put these fields on every file. This would increase our disk usage, and would need to be done for E3SM, too.

  4. Change CLM to put these time-constant fields on separate history files rather than on the first history file. Then ignore this "time-constant" file in doing comparisons (similarly to how we ignore the "initial_hist" file for CISM). This is probably the best solution, but I'm not sure what it will take to implement it, and I'd want to run it by some of the CLM scientists. It would also need to be done for E3SM (but my guess is that the relevant code is still very similar for the two, so hopefully the changes from one could be relatively easily ported into the other).

I'm not sure what's best here... or maybe there are other solutions I didn't think of.

Also cc @ekluzek

@billsacks
Copy link
Member

How time-critical is it to get this fix in? One short-term solution would be to back up to the old builds of cprnc while we come up with a long-term solution.

@jedwards4b
Copy link
Contributor

Would it be possible to ignore field list differences for time constant fields? I've already updated cprnc on a number of platforms.

@fischer-ncar
Copy link
Contributor Author

It'll only be time-critical when I need to start the next round of tests.

@jedwards4b
Copy link
Contributor

I think we can note these as expected failures until we arrive at a solution.

@fischer-ncar
Copy link
Contributor Author

It's easy enough to manually check the cprnc output to verify if the tests truly failed.

@rljacob
Copy link
Member

rljacob commented Feb 20, 2019

We don't have an "expected failure" mechanism. @jgfouca has this new cprnc made it back to E3SM yet?

@jedwards4b
Copy link
Contributor

No, I'm pretty sure it hasn't and even if it has you also have to build and install it on the system.

@billsacks
Copy link
Member

What do you think about this fix:

  1. Allow a field to have a piece of metadata: cprnc_attributes. If this is present and set to a string like "allow_on_only_one_file", then the given variable would not contribute to the FIELDLIST DIFF.

This would require changes in the cprnc fortran (so would require cprnc to be rebuilt) and in CLM and E3SM, but I think the changes would be fairly straightforward.

I like this better than my solutions (3) or (4) because it would be easier for other components to implement this if necessary.

@jedwards4b
Copy link
Contributor

I guess I don't really like the idea of modifying history files to accommodate cprnc, what about incorporating some logic regarding variables without an unlimited dim - compare them if they exist in both files, ignore them if they are only present in one?

@ekluzek
Copy link
Contributor

ekluzek commented Feb 20, 2019

I agree with @jedwards4b, that seems like a more robust solution, and would only require messing with cprnc.

@billsacks
Copy link
Member

I'm hesitant to introduce general logic that ignores differences in field lists for time-constant variables, because cprnc is a general-purpose tool and it could be really confusing / misleading if it avoids checking the presence/absence of time-constant variables in general.

However, I could imagine introducing a flag to cprnc like '--allow-fieldlist-diffs-for-vars-with-no-time' (that's a mouthful; any suggestions for a better name?), which, if set, would do what @jedwards4b suggests. We would set that flag in cprnc calls from hist_utils.py. Does that sound reasonable to you?

@jedwards4b
Copy link
Contributor

yes --allow-missing-constant-fields?

@billsacks
Copy link
Member

Thanks, @jedwards4b . "constant" could mean a number of things, but maybe "--allow-missing-timeconst-fields"?

@billsacks
Copy link
Member

This is also causing a failure for

/glade/scratch/negins/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int/run/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152
937ch_int.cpl.hl2x1yr_glc.0003-01-01-00000.nc.base did NOT match /glade/scratch/negins/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int/run/ERS_Ly3.f10_f10_musg
s.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int.cpl.hl2x1yr_glc.0003-01-01-00000.nc.rest

due to these fields:

 Could not find match for file2 variable doml_lat in file1
 Could not find match for file2 variable doml_lon in file1
 Could not find match for file2 variable doml_area in file1
 Could not find match for file2 variable doml_aream in file1
 Could not find match for file2 variable doml_mask in file1
 Could not find match for file2 variable doml_frac in file1

I think @jedwards4b 's suggested fix will fix this one, too.

cc @negin513

@billsacks
Copy link
Member

I realized a problem with my suggestion of adding a flag like --allow-missing-timeconst-fields: this won't fix earlier versions of cime. So I think I'll go with @jedwards4b 's suggestion of changing the default behavior so that, by default, it doesn't count absent time-constant fields as a difference if the file has an unlimited/time dimension. (But I'll still plan to report those counts.)

I may also add a flag that turns on the counting of these time-constant fields, like --diff-if-missing-timeconst-fields. I think it will be straightforward to adjust hist_utils.py so that that flag is used for baseline comparisons but not in-test comparisons, and I also think it will be possible to do this in a way that still supports old cprnc executables that don't have this flag.

I need to make progress on some other time-critical things this week, so probably won't get to this until next week.

billsacks added a commit to billsacks/cime that referenced this issue Mar 22, 2019
For files with an unlimited (time) dimension, provide two separate
counts:

(1) The number of missing fields with an unlimited (time) dimension

(2) The number of missing fields without an unlimited (time) dimension

Only (1) is considered in determining whether to report a final
difference in the field lists.

Resolves ESMCI#3007
jedwards4b added a commit that referenced this issue Mar 28, 2019
cprnc: allow differences in field lists for time-constant fields
In cprnc: For files with an unlimited (time) dimension: Separately count
(1) missing time-varying fields and (2) missing time-constant
fields. Only (1) is considered in determining whether to report a final
difference in the field lists.

Before this, no distinction was made between time-varying
vs. time-constant fields in counting the number of missing
variables. (These counts were added in #2988). However, that led to
failures in some exact restart tests, because some time-constant fields
were on output files from one case but not the other (see #3007).

Here is sample output from cprnc for a few cases:

(1) Difference in the presence / absence of time-varying variables:

SUMMARY of cprnc:
 A total number of      6 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
 A total number of      0 fields could not be analyzed
 A total number of      2 time-varying fields on file 1 were not found on file 2.
 A total number of      0 time-constant fields on file 1 were not found on file 2.
 A total number of      0 time-varying fields on file 2 were not found on file 1.
 A total number of      0 time-constant fields on file 2 were not found on file 1.
  diff_test: the two files DIFFER only in their field lists
(2) Difference in the presence / absence of time-constant variables, for
files that have a time dimension:

SUMMARY of cprnc:
 A total number of     13 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
 A total number of      0 fields could not be analyzed
 A total number of      0 time-varying fields on file 1 were not found on file 2.
 A total number of      2 time-constant fields on file 1 were not found on file 2.
 A total number of      0 time-varying fields on file 2 were not found on file 1.
 A total number of      1 time-constant fields on file 2 were not found on file 1.
  diff_test: the two files seem to be IDENTICAL
      (But note that there were differences in field lists just for time-constant fields.)
(3) Difference in the presence / absence of time-constant variables, for
files that do NOT have a time dimension (note that this still results in
a DIFFER result; it seemed to me that that's what a user would want in
this case):

SUMMARY of cprnc:
 A total number of      5 fields were compared
          of which      0 had non-zero differences
               and      0 had differences in fill patterns
               and      0 had different dimension sizes
 A total number of      0 fields could not be analyzed
 A total number of      2 fields on file 1 were not found on file 2.
 A total number of      1 fields on file 2 were not found on file 1.
  diff_test: the two files DIFFER only in their field lists
The downside of this solution is that we wouldn't catch removals of
time-constant fields when doing baseline comparisons, and might overlook
this for interactive uses of cprnc. To address the issue of baseline
comparisons, I thought about adding a flag that treats time-constant
fieldlist diffs as differences
(#3007 (comment)), but
I have not done this here because this would have taken more time to
implement and added significantly more complexity (in both the cprnc
Fortran and in hist_utils.py) for uncertain benefit. (See the discussion in #3007
for more thoughts on this.)

Once this PR is approved and merged, I will update the cprnc builds on
hobart and cheyenne (currently they are using builds prior to #2988).

Test suite:

Ran the two tests noted as failing in #3007, pointing to a version of
cprnc built from this branch; these tests now pass.
scripts_regression_tests on cheyenne
Test baseline: n/a
Test namelist changes: none
Test status: bit for bit

Fixes #3007

User interface changes?: Changes cprnc behavior slightly, as noted above

Update gh-pages html (Y/N)?: N

Code review: @jedwards4b @jgfouca @fischer-ncar

Also cc @rljacob @gold2718 @mnlevy1981 @ekluzek
billsacks added a commit to billsacks/cprnc that referenced this issue Dec 2, 2023
For files with an unlimited (time) dimension, provide two separate
counts:

(1) The number of missing fields with an unlimited (time) dimension

(2) The number of missing fields without an unlimited (time) dimension

Only (1) is considered in determining whether to report a final
difference in the field lists.

Resolves ESMCI/cime#3007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants