Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

Open
glemieux opened this issue Sep 24, 2020 · 7 comments · Fixed by #854
Open

FATES_ERRH2O_SCPF fails COMPARE_base_rest #701

glemieux opened this issue Sep 24, 2020 · 7 comments · Fixed by #854

Comments

@glemieux
Copy link
Contributor

glemieux commented Sep 24, 2020

This error was discovered after correcting another error in the FatesHydro test mod in which no history files where being generated for the ERS_D_Ld5.1x1_brazil.I2000Clm50FatesCruGs.cheyenne_intel.clm-FatesHydro test case (see ESCOMP/CTSM#1160). The test case runs, but fails the restart on FATES_ERRH2O_SCPF only:

FATES_ERRH2O_SCPF   (lndgrid,fates_levscpf,time)  t_index =      5     5
         12      156  (     0,    41,     1) (     1,     1,     1) (     0,    67,     1) (     0,   145,     1)
                 156   3.761678083316676E-16  -6.113989692399947E-16 1.0E-16  6.578333154084754E-17 6.4E-02 -2.485285346918768E-19
                 156   4.620813013804872E-16  -5.128924256061086E-16         -3.811208864302963E-17          3.013937768871115E-19
                 156  (     0,    41,     1) (     1,     1,     1)
          avg abs field values:    8.833762855196752E-18    rms diff: 1.6E-17   avg rel diff(npos):  6.4E-02
                                   7.583820984915835E-18                        avg decimal digits(ndif):  0.2 worst: -0.3
 RMS FATES_ERRH2O_SCPF                1.5775E-17            NORMALIZED  1.9217E+00
@rgknox
Copy link
Contributor

rgknox commented Jan 15, 2021

I'm wondering if maybe it is not reasonable for some error checks, particularly the small ones where the absolute values are close to machine precision, and thus the relative differences can be large.

@billsacks
Copy link
Member

@rgknox I'm not sure if I understand your thinking, but it shouldn't matter if the absolute value is close to machine precision. Machine precision of ~ 1e-16 means that values of 1 and (1+1e-16) are indistinguishable. But double precision values can represent values down to about 1e-308. If you have a value like 1e-17, then that's indistinguishable from (1e-17 + 1e-33), but it should be very distinguishable from, say, 1.1e-17.

@ekluzek
Copy link
Collaborator

ekluzek commented Jan 15, 2021

Yes, this looks like a legit issue to me. We do expect exact results between the baseline and restart cases for all history fields. And actually the error is large compared to the average values of the data. Does FATES_ERRH2O_SCPF need to be added to the restart file?

@rgknox
Copy link
Contributor

rgknox commented Jan 15, 2021

Thanks for your responses! I think my terminology was bad in my previous comment. I don't mean machine precision, I mean floating point math error. In this case it is the normalized RMS that is tripping the fail right? My worry is that if we can expect math errors of ~e-17 or greater (which seems reasonable, but maybe I'm wrong), than the normalized difference when this amount of error is compared to such a small number... is going to be large. For instance, the normalized RMS when we have a difference between two numbers (due to math errors) of e-17, when their mean is e-17, is 1.

There is also something else I'm wondering: Let assume that the "state" of the model, ie the values of all variables that have memory, is exactly the same in one simulation at the beginning of the restart initialization, as the original (ie to the precision of the floating point values that it holds). If the same sequence of math operations are then performed on these numbers is exactly the same in both the restart and the base, should we expect the exact same rounding errors in both cases? ie, should both base and restart have the same error bias and match? And thus a more identical result than one where we expect math error?

@billsacks
Copy link
Member

@rgknox okay, I think I understand what you're getting at here, but may still be misunderstanding. I think you're saying that this ERR field is the small difference between two relatively larger fields. So we have something like ERR = (A - B). Let's say A and B are both order 1, and the difference between the two is itself roundoff-level, so ERR is order 1e-16. If you then introduce a roundoff-level change in A, then ERR will change by order 1e-16, so normalized differences in ERR will be order 1. If that's what you're saying, then I completely agree. Indeed, we often see big relative differences in various ERR terms when we make roundoff-level changes in the model.

However, that all applies to comparisons between a case and a baseline. For restart comparisons, @ekluzek 's point holds: we expect the restart to be bit-for-bit in all fields. As for your question:

If the same sequence of math operations are then performed on these numbers is exactly the same in both the restart and the base, should we expect the exact same rounding errors in both cases? ie, should both base and restart have the same error bias and match? And thus a more identical result than one where we expect math error?

As far as I know, the answer is yes, we should expect the exact same rounding errors in both cases. I believe this is specified by the IEEE floating point standard. It's possible that some compiler flags would cause this not to be the case (and not follow the IEEE standard) for performance reasons, but we typically avoid those compiler flags, if they exist. So I wonder if there may be a slightly different code path that's followed in the base vs. restart case here that causes the roundoff behavior to differ.

@glemieux
Copy link
Contributor Author

glemieux commented Jun 2, 2022

Note that this issue appears to have been fixed for the original test on cheyenne with the intel compiler, but is now showing up for the same test on izumi with the nag compiler: #854 (comment)

@glemieux
Copy link
Contributor Author

Reopening due to the reappearance of this issue with the testing for ESCOMP/CTSM#1849.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🟢 In Progress
Development

Successfully merging a pull request may close this issue.

4 participants