FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

glemieux · 2020-09-24T20:11:10Z

This error was discovered after correcting another error in the FatesHydro test mod in which no history files where being generated for the ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro test case (see ESCOMP/CTSM#1160). The test case runs, but fails the restart on FATES_ERRH2O_SCPF only:

FATES_ERRH2O_SCPF   (lndgrid,fates_levscpf,time)  t_index =      5     5
         12      156  (     0,    41,     1) (     1,     1,     1) (     0,    67,     1) (     0,   145,     1)
                 156   3.761678083316676E-16  -6.113989692399947E-16 1.0E-16  6.578333154084754E-17 6.4E-02 -2.485285346918768E-19
                 156   4.620813013804872E-16  -5.128924256061086E-16         -3.811208864302963E-17          3.013937768871115E-19
                 156  (     0,    41,     1) (     1,     1,     1)
          avg abs field values:    8.833762855196752E-18    rms diff: 1.6E-17   avg rel diff(npos):  6.4E-02
                                   7.583820984915835E-18                        avg decimal digits(ndif):  0.2 worst: -0.3
 RMS FATES_ERRH2O_SCPF                1.5775E-17            NORMALIZED  1.9217E+00

The text was updated successfully, but these errors were encountered:

rgknox · 2021-01-15T22:44:56Z

I'm wondering if maybe it is not reasonable for some error checks, particularly the small ones where the absolute values are close to machine precision, and thus the relative differences can be large.

billsacks · 2021-01-15T23:10:02Z

@rgknox I'm not sure if I understand your thinking, but it shouldn't matter if the absolute value is close to machine precision. Machine precision of ~ 1e-16 means that values of 1 and (1+1e-16) are indistinguishable. But double precision values can represent values down to about 1e-308. If you have a value like 1e-17, then that's indistinguishable from (1e-17 + 1e-33), but it should be very distinguishable from, say, 1.1e-17.

ekluzek · 2021-01-15T23:16:20Z

Yes, this looks like a legit issue to me. We do expect exact results between the baseline and restart cases for all history fields. And actually the error is large compared to the average values of the data. Does FATES_ERRH2O_SCPF need to be added to the restart file?

rgknox · 2021-01-15T23:53:25Z

Thanks for your responses! I think my terminology was bad in my previous comment. I don't mean machine precision, I mean floating point math error. In this case it is the normalized RMS that is tripping the fail right? My worry is that if we can expect math errors of ~e-17 or greater (which seems reasonable, but maybe I'm wrong), than the normalized difference when this amount of error is compared to such a small number... is going to be large. For instance, the normalized RMS when we have a difference between two numbers (due to math errors) of e-17, when their mean is e-17, is 1.

There is also something else I'm wondering: Let assume that the "state" of the model, ie the values of all variables that have memory, is exactly the same in one simulation at the beginning of the restart initialization, as the original (ie to the precision of the floating point values that it holds). If the same sequence of math operations are then performed on these numbers is exactly the same in both the restart and the base, should we expect the exact same rounding errors in both cases? ie, should both base and restart have the same error bias and match? And thus a more identical result than one where we expect math error?

billsacks · 2021-01-16T00:16:25Z

@rgknox okay, I think I understand what you're getting at here, but may still be misunderstanding. I think you're saying that this ERR field is the small difference between two relatively larger fields. So we have something like ERR = (A - B). Let's say A and B are both order 1, and the difference between the two is itself roundoff-level, so ERR is order 1e-16. If you then introduce a roundoff-level change in A, then ERR will change by order 1e-16, so normalized differences in ERR will be order 1. If that's what you're saying, then I completely agree. Indeed, we often see big relative differences in various ERR terms when we make roundoff-level changes in the model.

However, that all applies to comparisons between a case and a baseline. For restart comparisons, @ekluzek 's point holds: we expect the restart to be bit-for-bit in all fields. As for your question:

If the same sequence of math operations are then performed on these numbers is exactly the same in both the restart and the base, should we expect the exact same rounding errors in both cases? ie, should both base and restart have the same error bias and match? And thus a more identical result than one where we expect math error?

As far as I know, the answer is yes, we should expect the exact same rounding errors in both cases. I believe this is specified by the IEEE floating point standard. It's possible that some compiler flags would cause this not to be the case (and not follow the IEEE standard) for performance reasons, but we typically avoid those compiler flags, if they exist. So I wonder if there may be a slightly different code path that's followed in the base vs. restart case here that causes the roundoff behavior to differ.

glemieux · 2022-06-02T21:46:19Z

Note that this issue appears to have been fixed for the original test on cheyenne with the intel compiler, but is now showing up for the same test on izumi with the nag compiler: #854 (comment)

glemieux · 2022-10-14T22:12:20Z

Reopening due to the reappearance of this issue with the testing for ESCOMP/CTSM#1849.

glemieux mentioned this issue Jan 27, 2021

Merge fates_main_api into master ESCOMP/CTSM#1264

Merged

glemieux mentioned this issue Jun 2, 2022

Small updates to output variables, addition of FATES_FRACTION variable #854

Merged

5 tasks

glemieux linked a pull request Jul 8, 2022 that will close this issue

Small updates to output variables, addition of FATES_FRACTION variable #854

Merged

5 tasks

glemieux closed this as completed in #854 Jul 15, 2022

glemieux mentioned this issue Oct 14, 2022

Add long term exact restart test and fixed biogeog + no competition tests to fates suite ESCOMP/CTSM#1849

Merged

glemieux reopened this Oct 14, 2022

glemieux mentioned this issue Jan 31, 2023

[Land Use] C-based wood harvest #888

Merged

4 tasks

glemieux mentioned this issue Mar 8, 2023

Fix water balance error for grass when use FATES-Hydro #910

Merged

4 tasks

glemieux added this to FATES issue board Mar 9, 2023

github-project-automation bot moved this to ❕Todo in FATES issue board Mar 9, 2023

glemieux moved this from ❕Todo to 🟢 In Progress in FATES issue board Mar 9, 2023

glemieux mentioned this issue Dec 1, 2023

First tag with PE layouts and testing for Derecho ESCOMP/CTSM#2269

Merged

glemieux mentioned this issue Mar 19, 2024

prevent FATES-Hydro from failing upon restart with a drought deciduous PFT #1156

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

glemieux commented Sep 24, 2020 •

edited

Loading

rgknox commented Jan 15, 2021

billsacks commented Jan 15, 2021

ekluzek commented Jan 15, 2021

rgknox commented Jan 15, 2021

billsacks commented Jan 16, 2021

glemieux commented Jun 2, 2022 •

edited

Loading

glemieux commented Oct 14, 2022

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

Comments

glemieux commented Sep 24, 2020 • edited Loading

rgknox commented Jan 15, 2021

billsacks commented Jan 15, 2021

ekluzek commented Jan 15, 2021

rgknox commented Jan 15, 2021

billsacks commented Jan 16, 2021

glemieux commented Jun 2, 2022 • edited Loading

glemieux commented Oct 14, 2022

glemieux commented Sep 24, 2020 •

edited

Loading

glemieux commented Jun 2, 2022 •

edited

Loading