ERS and ERP tests failing with updated cprnc #3007

fischer-ncar · 2019-02-20T16:23:07Z

The ERS and ERP tests are failing the restart compare for compsets with active clm and cam. This
happens with PR #2988.

FAIL ERP_Ln9.f09_f09_mg17.F1850.cheyenne_intel.cam-outfrq9s COMPARE_base_rest

Tail of output for clm history files.

SUMMARY of cprnc:
A total number of 238 fields were compared
of which 0 had non-zero differences
and 0 had differences in fill patterns
and 0 had different dimension sizes
A total number of 2 fields could not be analyzed
A total number of 0 fields on file 1 were not found on file 2.
A total number of 8 fields on file 2 were not found on file 1.
diff_test: the two files DIFFER only in their field lists

jedwards4b · 2019-02-20T16:27:42Z

This would suggest that the restart causes fields to be written to the history that were not written from a cold start - does this make sense? What fields are they? @billsacks ?

billsacks · 2019-02-20T17:08:14Z

Shoot, sorry that I didn't think of this, or do the active-component testing that would have uncovered it.

CLM outputs a handful of "time-constant" fields to only its first history file, hence this issue. I can think of a few solutions:

Back out much of my cprnc PR, at least for now
Ignore field list differences for within-test comparisons, only considering them for baseline comparisons. (If I remember the cime logic correctly, this might be a bit tricky.)
Change CLM to put these fields on every file. This would increase our disk usage, and would need to be done for E3SM, too.
Change CLM to put these time-constant fields on separate history files rather than on the first history file. Then ignore this "time-constant" file in doing comparisons (similarly to how we ignore the "initial_hist" file for CISM). This is probably the best solution, but I'm not sure what it will take to implement it, and I'd want to run it by some of the CLM scientists. It would also need to be done for E3SM (but my guess is that the relevant code is still very similar for the two, so hopefully the changes from one could be relatively easily ported into the other).

I'm not sure what's best here... or maybe there are other solutions I didn't think of.

Also cc @ekluzek

billsacks · 2019-02-20T17:10:03Z

How time-critical is it to get this fix in? One short-term solution would be to back up to the old builds of cprnc while we come up with a long-term solution.

jedwards4b · 2019-02-20T17:12:05Z

Would it be possible to ignore field list differences for time constant fields? I've already updated cprnc on a number of platforms.

fischer-ncar · 2019-02-20T17:12:11Z

It'll only be time-critical when I need to start the next round of tests.

jedwards4b · 2019-02-20T17:13:05Z

I think we can note these as expected failures until we arrive at a solution.

fischer-ncar · 2019-02-20T17:14:42Z

It's easy enough to manually check the cprnc output to verify if the tests truly failed.

rljacob · 2019-02-20T17:17:24Z

We don't have an "expected failure" mechanism. @jgfouca has this new cprnc made it back to E3SM yet?

jedwards4b · 2019-02-20T17:18:33Z

No, I'm pretty sure it hasn't and even if it has you also have to build and install it on the system.

billsacks · 2019-02-20T17:28:04Z

What do you think about this fix:

Allow a field to have a piece of metadata: cprnc_attributes. If this is present and set to a string like "allow_on_only_one_file", then the given variable would not contribute to the FIELDLIST DIFF.

This would require changes in the cprnc fortran (so would require cprnc to be rebuilt) and in CLM and E3SM, but I think the changes would be fairly straightforward.

I like this better than my solutions (3) or (4) because it would be easier for other components to implement this if necessary.

jedwards4b · 2019-02-20T17:38:42Z

I guess I don't really like the idea of modifying history files to accommodate cprnc, what about incorporating some logic regarding variables without an unlimited dim - compare them if they exist in both files, ignore them if they are only present in one?

ekluzek · 2019-02-20T17:55:24Z

I agree with @jedwards4b, that seems like a more robust solution, and would only require messing with cprnc.

billsacks · 2019-02-20T18:24:34Z

I'm hesitant to introduce general logic that ignores differences in field lists for time-constant variables, because cprnc is a general-purpose tool and it could be really confusing / misleading if it avoids checking the presence/absence of time-constant variables in general.

However, I could imagine introducing a flag to cprnc like '--allow-fieldlist-diffs-for-vars-with-no-time' (that's a mouthful; any suggestions for a better name?), which, if set, would do what @jedwards4b suggests. We would set that flag in cprnc calls from hist_utils.py. Does that sound reasonable to you?

jedwards4b · 2019-02-20T18:28:07Z

yes --allow-missing-constant-fields?

billsacks · 2019-02-20T18:30:00Z

Thanks, @jedwards4b . "constant" could mean a number of things, but maybe "--allow-missing-timeconst-fields"?

billsacks · 2019-02-20T20:04:44Z

This is also causing a failure for

/glade/scratch/negins/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int/run/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152
937ch_int.cpl.hl2x1yr_glc.0003-01-01-00000.nc.base did NOT match /glade/scratch/negins/ERS_Ly3.f10_f10_musgs.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int/run/ERS_Ly3.f10_f10_musg
s.I1850Clm50BgcCropCmip6.cheyenne_intel.clm-basic.GC.0219-152937ch_int.cpl.hl2x1yr_glc.0003-01-01-00000.nc.rest

due to these fields:

 Could not find match for file2 variable doml_lat in file1
 Could not find match for file2 variable doml_lon in file1
 Could not find match for file2 variable doml_area in file1
 Could not find match for file2 variable doml_aream in file1
 Could not find match for file2 variable doml_mask in file1
 Could not find match for file2 variable doml_frac in file1

I think @jedwards4b 's suggested fix will fix this one, too.

cc @negin513

billsacks · 2019-02-21T17:31:21Z

I realized a problem with my suggestion of adding a flag like --allow-missing-timeconst-fields: this won't fix earlier versions of cime. So I think I'll go with @jedwards4b 's suggestion of changing the default behavior so that, by default, it doesn't count absent time-constant fields as a difference if the file has an unlimited/time dimension. (But I'll still plan to report those counts.)

I may also add a flag that turns on the counting of these time-constant fields, like --diff-if-missing-timeconst-fields. I think it will be straightforward to adjust hist_utils.py so that that flag is used for baseline comparisons but not in-test comparisons, and I also think it will be possible to do this in a way that still supports old cprnc executables that don't have this flag.

I need to make progress on some other time-critical things this week, so probably won't get to this until next week.

For files with an unlimited (time) dimension, provide two separate counts: (1) The number of missing fields with an unlimited (time) dimension (2) The number of missing fields without an unlimited (time) dimension Only (1) is considered in determining whether to report a final difference in the field lists. Resolves ESMCI#3007

@jedwards4b

cprnc: allow differences in field lists for time-constant fields In cprnc: For files with an unlimited (time) dimension: Separately count (1) missing time-varying fields and (2) missing time-constant fields. Only (1) is considered in determining whether to report a final difference in the field lists. Before this, no distinction was made between time-varying vs. time-constant fields in counting the number of missing variables. (These counts were added in #2988). However, that led to failures in some exact restart tests, because some time-constant fields were on output files from one case but not the other (see #3007). Here is sample output from cprnc for a few cases: (1) Difference in the presence / absence of time-varying variables: SUMMARY of cprnc: A total number of 6 fields were compared of which 0 had non-zero differences and 0 had differences in fill patterns and 0 had different dimension sizes A total number of 0 fields could not be analyzed A total number of 2 time-varying fields on file 1 were not found on file 2. A total number of 0 time-constant fields on file 1 were not found on file 2. A total number of 0 time-varying fields on file 2 were not found on file 1. A total number of 0 time-constant fields on file 2 were not found on file 1. diff_test: the two files DIFFER only in their field lists (2) Difference in the presence / absence of time-constant variables, for files that have a time dimension: SUMMARY of cprnc: A total number of 13 fields were compared of which 0 had non-zero differences and 0 had differences in fill patterns and 0 had different dimension sizes A total number of 0 fields could not be analyzed A total number of 0 time-varying fields on file 1 were not found on file 2. A total number of 2 time-constant fields on file 1 were not found on file 2. A total number of 0 time-varying fields on file 2 were not found on file 1. A total number of 1 time-constant fields on file 2 were not found on file 1. diff_test: the two files seem to be IDENTICAL (But note that there were differences in field lists just for time-constant fields.) (3) Difference in the presence / absence of time-constant variables, for files that do NOT have a time dimension (note that this still results in a DIFFER result; it seemed to me that that's what a user would want in this case): SUMMARY of cprnc: A total number of 5 fields were compared of which 0 had non-zero differences and 0 had differences in fill patterns and 0 had different dimension sizes A total number of 0 fields could not be analyzed A total number of 2 fields on file 1 were not found on file 2. A total number of 1 fields on file 2 were not found on file 1. diff_test: the two files DIFFER only in their field lists The downside of this solution is that we wouldn't catch removals of time-constant fields when doing baseline comparisons, and might overlook this for interactive uses of cprnc. To address the issue of baseline comparisons, I thought about adding a flag that treats time-constant fieldlist diffs as differences (#3007 (comment)), but I have not done this here because this would have taken more time to implement and added significantly more complexity (in both the cprnc Fortran and in hist_utils.py) for uncertain benefit. (See the discussion in #3007 for more thoughts on this.) Once this PR is approved and merged, I will update the cprnc builds on hobart and cheyenne (currently they are using builds prior to #2988). Test suite: Ran the two tests noted as failing in #3007, pointing to a version of cprnc built from this branch; these tests now pass. scripts_regression_tests on cheyenne Test baseline: n/a Test namelist changes: none Test status: bit for bit Fixes #3007 User interface changes?: Changes cprnc behavior slightly, as noted above Update gh-pages html (Y/N)?: N Code review: @jedwards4b @jgfouca @fischer-ncar Also cc @rljacob @gold2718 @mnlevy1981 @ekluzek

For files with an unlimited (time) dimension, provide two separate counts: (1) The number of missing fields with an unlimited (time) dimension (2) The number of missing fields without an unlimited (time) dimension Only (1) is considered in determining whether to report a final difference in the field lists. Resolves ESMCI/cime#3007

fischer-ncar added the ty: Bug label Feb 20, 2019

fischer-ncar assigned billsacks Feb 20, 2019

rljacob added Assigned in progress and removed Assigned labels Feb 20, 2019

billsacks mentioned this issue Mar 22, 2019

cprnc: allow differences in field lists for time-constant fields #3051

Merged

jedwards4b closed this as completed in #3051 Mar 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERS and ERP tests failing with updated cprnc #3007

ERS and ERP tests failing with updated cprnc #3007

fischer-ncar commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

fischer-ncar commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

fischer-ncar commented Feb 20, 2019

rljacob commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

ekluzek commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 21, 2019

ERS and ERP tests failing with updated cprnc #3007

ERS and ERP tests failing with updated cprnc #3007

Comments

fischer-ncar commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

fischer-ncar commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

fischer-ncar commented Feb 20, 2019

rljacob commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

ekluzek commented Feb 20, 2019

billsacks commented Feb 20, 2019

jedwards4b commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 20, 2019

billsacks commented Feb 21, 2019