Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate potential nonreproducibility in final local sum #5560

Merged
merged 2 commits into from
Apr 6, 2023

Conversation

worleyph
Copy link
Contributor

@worleyph worleyph commented Mar 27, 2023

The default reprosum algorithm converts each floating point summand
into a vector of integers representation. These vectors are summed
locally first, and then globally (using an mpi_allreduce). All of this
is exact and reproducible. The final step is to convert the final sum,
in the vector of integers representation, back into a floating point
value.

This final sum begins by converting each element of the integer vector
into a floating point value, and then adding these, smallest to
largest. Round off can occur in the least significant digit of the
final result, and whether round off occurs is sensitive to the
particular floating point summands. The numbers of MPI tasks and
OpenMP threads can affect the length of the integer vector, and thus
the floating point summands.
In consequence, reproducibility with respect to degree of parallelism
is not guaranteed. (The situation in which reproducibility is lost
has never been observed in E3SM simulations, but it cannot be
ruled out.)

This PR modifies the final summation so that the same set of floating
point summands is generated independent of the number of MPI tasks
and number of OpenMP threads, establishing reproducibility.

Fixes #5276

[non-BFB]


(Use above for merge commit message. Additional detail below.)

The reprosum algorithm converts each floating point summand into a vector of integers representation. These vectors are summed locally first, and then globally (using an mpi_allreduce). All of this is exact and reproducible. The final step is to convert the final sum, in the vector of integers representation, back into a floating point value.

For this last step, each component integer in the vector representation is converted back into a floating point value (call it a_{i} for the ith element of the vector), and then these values are summed locally, smallest to largest, to get the final value. The integer representation is first massaged so that all a_{i} have the same sign and the range of values is usually not overlapping (least significant binary digit in abs(a_{i}) represents a value larger than that represented by abs(a_{i+1}) in most cases, but sometimes is equal to).

When constructing the final floating point value, round-off error can be introduced in the least significant digit of the final sum. This level of inaccuracy has no practical impact on overall model accuracy, but it can potentially lead to loss of reproducibility with respect to the numbers of MPI tasks and OpenMP threads. The issue is that the set of floating point values generated from the vector of integers will vary with the number of components in the integer vector and the implicit exponents associated with each component, which are a functions of, among things, the number of MPI tasks and number of OpenMP threads. Any round-off error will be sensitive to the specific floating point values being summed, even though the sum would be invariant with respect to exact arithmetic. This nonreproducibility has been observed when changing an 8-byte integer vector representation to a 4-byte integer vector representation (which also changes the vector length and assocated implicit exponents), but only in stand-alone contrived examples and never in E3SM. However, we can not rule out nonreproducibility occurring in practice (without a more detailed analysis of the range of sums likely to occur in E3SM simulations).

The solution implemented here is to make the set of floating values generated from the integer vector independent of the size of the vector and definition of implicit exponents. Effectively, a new integer vector is generated from the original in which each component integer has a fixed number of significant digits (e.g., digits(1.0_r8)), and the floating point values are generated using this vector. As number, and value, of the significant digits is unchanged between the two representations, the implied value is unchanged, but any rounding error in the summation from using the new representation will be reproducible.

Since digits(1.0_r8) would be too many digts for a 4-byte integer representation of the vector, and since we want to continue to support this option, each new floating point value is actually contructed from some number of floating point values with possibly fewer significant digits, but in such a way that the summation of these is exact (no rounding error possible), and is identical to the floating point value that would be generated if the integer vector had been modified as indicated in the previous paragraph.

Note that another possibility to address this issue is by further decreasing the likelihood of rounding errors by calculating this final sum of floating point values using the DDPDD algorithm locally. (DDPDD is an alternative algorithm already available in the reprosum code.) DDPDD uses a two-floating-point-value representation of the intermediate values in the summation to approximately double the precision, i.e., for double precision values, summing in quad precision. In this case, round-off error is isolated to the least significant digit in the quad precision value, and it is even less likely that this will impact the least significant digit in the resulting double precision value (though not impossible).

This 'local' DDPDD approach was implemented, though it is not included in this commit as the proposed algorithm is preferred as it is always reproducible, and not just almost always. However, comparing the two approaches in E3SM experiments, they are BFB, implying that the proposed algorithm improves accuracy as well. In similar experiments, the propsoed is also BFB with using the existing alternative 'distributed' DDPDD algorithm.

As implied, the proposed algorithm (and the two DDPDD approaches) are not BFB with the current default integer vector algorithm in E3SM simulations. The differences arise from differing round-off in the least significant digit, but this does accumulate over time. For example, for a 5 day run on Chrysalis using

 --compset WCYCL1850 --res ne30pg2_EC30to60E2r2 --pecount XS

(675 ATM processes) there are 3669 reprosum calls, and using the proposed algorithm (or 'local' DDPDD or 'distributed' DDPDD) for the final sum changes the result 200 times (so 5.5% of the time).

Looking at the actual difference (printing out the mantissas in the final sums in octal), it is always +/- 1 in the least significant binary digit. But this is still enough to perturb the model output, e.g. (from the atm.log)

 nstep, te 241   0.26199582846629281E+10   0.26199596541672482E+10   0.75730462729090879E-04   0.98528834650276956E+05

vs.

 nstep, te 241   0.26199694926722941E+10   0.26199707204570022E+10   0.67893797509715502E-04   0.98528681087863835E+05

Similarly for a 5 day high resolution F case

  --compset F2010 --res ne120pg2_r0125_oRRS18to6v3

(2560 ATM processes) there are 9609 reprosum calls, and using the proposed algorithm for the final sum changes the result 104 times (so 1.1% of the time). The difference in the atm.log output is

 nstep, te      481   0.26056729910814352E+10   0.26056734037400241E+10   0.45650283260804208E-04   0.98502662499849699E+05

vs.

 nstep, te      481   0.26056829597470064E+10   0.26056833721086311E+10   0.45617405260189644E-04   0.98502719483102352E+05

Note that the computational cost of the proposed algorithm for the conversion of the integer vector into a floating point value is a little more expensive than that of the original algorithm, but the performance difference is in the noise as most of the cost of shr_reprosum_calc is in the common aspects of the current and proposed algorithms, i.e., determination of global max/min and number of summands, conversion of each floating point summand into the vector of integers representation, and the local and global sums of the integer vectors. For example, performance comparisons between the new and old approaches for calls to shr_reprosum_cale are inconclusive as to which is faster. Also note that the 'local' DDPDD algorithm is slightly more expensive than the proposed algorithm, but the difference is again in the noise.

[Non-BFB]

Fixes #5276

The reprosum algorithm converts each floating point summand into a
vector of integers representation. These vectors are summed locally
first, and then globally (using an mpi_allreduce). All of this is
exact and reproducible. The final step is to convert the final sum, in
the vector of integers representation, back into a floating point
value.

For this last step, each component integer in the vector
representation is converted back into a floating point value (call it
a_{i} for the ith element of the vector), and then these values are
summed locally, smallest to largest, to get the final value. The
integer representation is first massaged so that all a_{i} have the
same sign and the range of values is usually not overlapping (least
significant binary digit in abs(a_{i}) represents a value larger than
that represented by abs(a_{i+1}) in most cases, but sometimes is equal
to).

When constructing the final floating point value, round-off error can be
introduced in the least significant digit of the final sum. This level
of inaccuracy has no practical impact on overall model accuracy, but
it can potentially lead to loss of reproducibility with respect to the
numbers of MPI tasks and OpenMP threads. The issue is that the set of
floating point values generated from the vector of integers will vary
with the number of components in the integer vector and the implicit
exponents associated with each component, which are a functions of,
among things, the number of MPI tasks and number of OpenMP threads.
Any round-off error will be sensitive to the specific floating
point values being summed, even though the sum would be invariant with
respect to exact arithmetic. This nonreproducibility has been observed
when changing an 8-byte integer vector representation to a 4-byte
integer vector representation (which also changes the vector length
and assocated implicit exponents), but only in stand-alone contrived
examples and never in E3SM. However, we can not rule out
nonreproducibility occurring in practice (without a more detailed
analysis of the range of sums likely to occur in E3SM simulations).

The solution implemented here is to make the set of floating values
generated from the integer vector independent of the size of the
vector and definition of implicit exponents. Effectively, a new
integer vector is generated from the original in which each component
integer has a fixed number of significant digits (e.g.,
digits(1.0_r8)), and the floating point values are generated using
this vector. As number, and value, of the significant digits is
unchanged between the two representations, the implied value is
unchanged, but any rounding error in the summation from using the new
representation will be reproducible.

Since digits(1.0_r8) would be too many digts for a 4-byte integer
representation of the vector, and since we want to continue to support
this option, each new floating point value is actually contructed from
some number of floating point values with possibly fewer significant
digits, but in such a way that the summation of these is exact (no
rounding error possible), and is identical to the floating point
value that would be generated if the integer vector had been modified
as indicated in the previous paragraph.

Note that another possibility to address this issue is by further
decreasing the likelihood of rounding errors by calculating this final
sum of floating point values using the DDPDD algorithm locally. (DDPDD
is an alternative algorithm already available in the reprosum code.)
DDPDD uses a two-floating-point-value representation of the
intermediate values in the summation to approximately double the
precision, i.e., for double precision values, summing in quad
precision. In this case, round-off error is isolated to the least
significant digit in the quad precision value, and it is even less
likely that this will impact the least significant digit in the
resulting double precision value (though not impossible).

This 'local' DDPDD approach was implemented, though it is not included
in this commit as the proposed algorithm is preferred as it is always
reproducible, and not just almost always. However, comparing the two
approaches in E3SM experiments, they are BFB, implying that the
proposed algorithm improves accuracy as well. In similar
experiments, the propsoed is also BFB with using the existing
alternative 'distributed' DDPDD algorithm.

As implied, both the proposed algorithm and the two DDPDD approaches
are not BFB with the current default integer vector algorithm in E3SM
simulations. The differences arise from differing round-off in the
least significant digit, but this does accumulate over time. For
example, for a 5 day run on Chrysalis using

--compset WCYCL1850 --res ne30pg2_EC30to60E2r2 --pecount XS

(675 ATM processes) there are 3669 reprosum calls, all in ATM, and
using the proposed algorithm (or 'local' DDPDD or 'distributed' DDPDD)
for the final sum changes the result 200 times (so 5.5% of the time).

Looking at the actual difference (printing out the mantissas in the
final sums in octal), it is always +/- 1 in the least significant
binary digit. But this is still enough to perturb the model output,
e.g. (from the atm.log)

nstep, te 241   0.26199582846629281E+10   0.26199596541672482E+10   0.75730462729090879E-04   0.98528834650276956E+05

vs.

nstep, te 241   0.26199694926722941E+10   0.26199707204570022E+10   0.67893797509715502E-04   0.98528681087863835E+05

Note that the computational cost of the proposed algorithm for the
conversion of the integer vector into a floating point value is
a little more expensive than that of the original algorithm, but the
performance difference is in the noise as most of the cost of
shr_reprosum_calc is in the common aspects of the current and proposed
algorithms, i.e., determination of global max/min and number of
summands, conversion of each floating point summand into the
vector of integers representation, and the local and global sums
of the integer vectors. For example, performance comparisons between
the new and old approaches for calls to shr_reprosum_cale are
inconclusive as to which is faster. Also note that the 'local' DDPDD
algorithm is slightly more expensive than the proposed algorithm, but
the difference is again in the noise.

[Non-BFB]
@worleyph worleyph added bug fix PR Utils non-BFB PR makes roundoff changes to answers. labels Mar 27, 2023
@worleyph
Copy link
Contributor Author

@rljacob , please add reviewers and integrator. What tests should I run for a non-BFB PR? Note that I am feeling very comfortable with this because it is BFB with the existing alternative DDPDD in my E3SM test simulations, as well as being BFB when changing from integer8 to integer4 in my standalone tests. The agreement with the DDPDD algorithm implies improved accuracy (getting the last significant bit correct more often) compared to the original algorithm.

Note that comments were updated to reflect the code changes. A new timer was also added to capture the cost of the final summation, to address @ambrad 's question in the github issue discussion. I did not put any timing data in the PR description, but I can provide them if there is interest.

@amametjanov
Copy link
Member

e3sm_integration results from the current HEAD of this branch (e11ce2b) are at https://my.cdash.org/viewTest.php?buildid=2311359 .
Pat, please take a look at the errors/fails.

@worleyph
Copy link
Contributor Author

worleyph commented Mar 28, 2023

@amametjanov , thanks. There are 66 passes, 37 diffs, and 7 fails. I'm looking at the fails now (all seg. faults, apparently in reprosum, though I will need to chase this down). Please remind me - what are the different test types? Is there a query function to describe them? For example, all of the seg. fault fails are SMS_D tests (and all but two SMS_D tests have seg. faults).

@worleyph
Copy link
Contributor Author

@amametjanov , correction: 8 seg. faults and 36 diffs. The other seg. fault is an ERS_D test (so the '_D' again).

@worleyph
Copy link
Contributor Author

@amametjanov : okay. '_D' is debug, and bounds checking caught an error. I'll look into diagnosing, and then push a fix.

@rljacob
Copy link
Member

rljacob commented Mar 28, 2023

Interesting that the FIDEAL case had no diffs. That's EAM with "ideal" physics.

The r8 values generated from the integer vector representation and to
be summed locally to generate the global sum are held in the
vector summand_vector. The integer*8 integer vector, i8_gsum_level, is
dimensioned as (-(extra_levels-1):max_level), so is length
(max_level+extral_levels). Exactly digits(1.0_r8) digits are extracted
from the integer vector representation for each r8 value, so at most

((max_level+extral_levels)*(digits(1_i8)))/digits(1.0_r8) + 1

r8 values are generated, which is bounded from above by

(max_level+extral_levels)*(1 + (digits(1_i8)/digits(1.0_r8)))

and this is how summand_vector should be dimensioned.

[BFB]
@worleyph
Copy link
Contributor Author

I fixed a misdimensioned vector and reran one of the failed tests (ERS_D.ne4_oQU240.F2010.chrysalis_intel.eam-hommexx), and it passed this time. Should I try to rerun the e3sm_integration myself this time, or should @amametjanov run the tests?

@worleyph
Copy link
Contributor Author

@amametjanov , I ran the e3sm integration test on Chrysalis. Now there are 44 DIFFs and 66 PASSes (and no FAILs). Is there a way that I could upload this to my.cdash.org, to make it easier to examine? Also, is the version of master that I checked out for this PR still BFB with the MASTER baselines used in the compares? Thanks.

@amametjanov
Copy link
Member

<test-root-dir> $ cime/CIME/Tools/wait_for_tests  ./*<test-id>*/TestStatus --no-wait -b e3sm_integration_intel_pr5560_geab5db2

There should only be NML-diffs (i.e. still BFB) between current master baselines and the fork of this branch from Mar 27.

@worleyph
Copy link
Contributor Author

@amametjanov , thank you. The upload appears to have worked:

 ...
 Upload file: /lcrc/group/e3sm/ac.worleyph/scratch/chrys/Testing/20230329-0037/Done.xml to https://my.cdash.org/submit.php?project=E3SM&FileName=chrysalis___e3sm_integration_intel_pr5560_geab5db2___20230329-0037-ACME_Latest___XML___Done.xml&build=e3sm_integration_intel_pr5560_geab5db2&site=chrysalis&stamp=20230329-0037-ACME_Latest&MD5=2f097ea26dae1e859e717b8adb765c8f Size: 107
 Uploaded: /lcrc/group/e3sm/ac.worleyph/scratch/chrys/Testing/20230329-0037/Done.xml
 Submission successful

but I don't know where to go to view this. I see your entry on the dashboard at

 https://my.cdash.org/index.php?project=E3SM&date=2023-03-28

under Experimental. How can I view what I just uploaded?

Thanks.

@rljacob
Copy link
Member

rljacob commented Mar 29, 2023

@worleyph click on the number "44" under "Fail" and you'll see the list of tests. Click on each name to see the TestStatus.log output. The "DIFF" string under "Completed" for each test tells us the only issue was baseline diffs. If the test failed in any other way, it would say "FAIL" instead of "DIFF".

@worleyph
Copy link
Contributor Author

@rljacob , sorry - guess I was not clear. I can't find where the uploaded data is on the dashboard. All I can find is the results of the e3sm integration suite that Az ran this morning.

@worleyph
Copy link
Contributor Author

Looking at the data manually, all of the PASSes are marked as "RUN" experiments, e.g.

 PASS ERIO.ne30_g16_rx1.A.chrysalis_intel RUN

while all of the DIFFs are marked as "(phase BASELINE)" experiments, e.g.

 DIFF SMS_Ln9.ne4_oQU240.F2010.chrysalis_intel.eam-outfrq9s (phase BASELINE)

Guess that that makes sense. Only baseline comparisons will note the non-BFB behavior compared to current master. I can check the magnitudes of the diffs - how many do I need to check to be reassured that nothing unexpected is going on?

@amametjanov
Copy link
Member

Here is the new upload: https://my.cdash.org/viewTest.php?buildid=2311696 .

@worleyph
Copy link
Contributor Author

@amametjanov , thank you.

@rljacob , how do I evaluate the output of the e3sm integration experiments to determine that the results are consistent with round-off level changes? Are there any tools to postprocess the xxx.cprnc.out files?

@rljacob
Copy link
Member

rljacob commented Mar 29, 2023

No we don't have any tools for those. The tests just look for the word DIFFERENT at the end. You can search for the string "RMS" which is only printed out for variables that are different. But I can't remember how to interpret the actual values.

@worleyph
Copy link
Contributor Author

So, what do you want me to do? There are 44 tests with differences (all between baseline and PR; all internal consistency tests pass), and each test has between ~50 and ~300 fields that differ, so say 9000 diffs to evaluate, however one does them. The format of the cprcnc.out file is "familiar", but I can't remember how to read it either.

@ambrad
Copy link
Member

ambrad commented Mar 30, 2023

One thing you might consider, based on your observation above re: DDPDD, is creating baselines using DDPDD and then seeing whether there are still 44 DIFFs against these.

@worleyph
Copy link
Contributor Author

@ambrad , thank you for the suggestion. I have just now tested this for a single test that is a DIFF when compared to master (just to make sure that I have the process correct), and it was a PASS. Since the DDPDD and the integer vector algorithm share essentially zero code (and use completely different approaches), hopefully this will be sufficient if they are all BFB. I'll start this process tomorrow.

@worleyph
Copy link
Contributor Author

I checked out the latest version of master (last night), modified shr_reprosum_mod to always use the DDPDD option, then created baselines for the E3SM integration test suite. I then ran the e3sm integration test suite with the PR, comparing with these new baselines:

 -b generate_ddpdd_baselines

When compared against unmodified master, the PR had 44 tests with DIFFs. When compared against 'generate_ddpdd_baselines', there were 16 DIFFs (and all others were PASS, except for 4 NLFAILs due to using the most recent version of master, and these all had BASELINE PASS results).

Note that there is no expectation that the new algorithm will be BFB with the DDPDD algorithm. I have just observed that it seems to occur more often. That appeas to hold true in these experiments as well.

I'll look at a few of these DIFFs, and compare the cprnc.out between that for the unmodified master baselines and for the ddpdd baselines, to see if anything looks peculiar.

Anything else I should do? What do others do when submitting a non-BFB PR?

@rljacob
Copy link
Member

rljacob commented Mar 30, 2023

Usually we assert by the specifics of the code changes that a non-BFB won't change the climate statistics. If there's any doubt, there a "non-BFB" test suite that will explicitly test that assumption but its undergoing some maintenance right now. After that, the only thing left is to do a 10-year fully coupled run and have experts look at the diagnostics compared to a baseline. I think you're fine and this can proceed.

@worleyph
Copy link
Contributor Author

Just to be safe, I reran the comparison of the new shr_reprosum_mod default algorithm against the DDPDD algorithm using the same checkout of master (the version I checked out last night and used to generate the DDPDD baselines). Glad I did so, as this time there were 111 PASSes and only 1 DIFF (so the other 15 DIFFs in the previous comparison were due to non-BFB differences between the two versions of master).

I consider this to be a reasonable validation that the new algorithm is a round-off level change and not a buggy.

The test with the DIFF was

 SMS.ne4_oQU240.F2010.chrysalis_intel.eam-thetahy_sl

in

 SMS.ne4_oQU240.F2010.chrysalis_intel.eam-thetahy_sl.C.20230330_193732_sr89md/

in

 /lcrc/group/e3sm/ac.worleyph/scratch/chrys

Copy link
Contributor

@philipwjones philipwjones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve based on Pat's testing and quick visual pass

@rljacob
Copy link
Member

rljacob commented Mar 31, 2023

@worleyph I appreciate the detailed explanation in the PR description but do you have a short description of these changes we can use in the merge commit message?

@worleyph
Copy link
Contributor Author

@rljacob , How about (or do you need something even shorter?) ...

"The default reprosum algorithm converts each floating point summand into a vector of integers representation. These vectors are summed locally first, and then globally (using an mpi_allreduce). All of this is exact and reproducible. The final step is to convert the final sum, in the vector of integers representation, back into a floating point value.

This final sum begins by converting each element of the integer vector into a floating point value, and then adding these, smallest to largest. Round off can occur in the least significant digit of the final result, and whether round off occurs is sensitive to the particular floating point summands. The numbers of MPI tasks and OpenMP threads can affect the length of the integer vector, and thus the floating point summands. In consequence, reproducibility with respect to degree of parallelism is not guaranteed. (The situation in which reproducibility is lost has never been observed in E3SM simulations, but it cannot be ruled out.)

This PR modifies the final summation so that the same set of floating point summands is generated independent of the number of MPI tasks and number of OpenMP threads, establishing reproducibility."

@worleyph
Copy link
Contributor Author

worleyph commented Apr 2, 2023

@rljacob , @ambrad , I poked at the one test that DIFFed:

 SMS.ne4_oQU240.F2010.chrysalis_intel.eam-thetahy_sl

repeating the test without comparing with the precomputed baselines, twice, once with the updated reprosum algorithm and once with DDPDD, then comparing these directly. This direct comparison was BFB (looking at cpl.hi.0001-01-06-00000.nc and atm.log and comparison of each output from shr_reprosum_calc call using an instrumented version of shr_reprosum_mod.F90).

I then compared the cpl.hi.0001-01-06-00000.nc from the new DDPDD run with the cpl.hi.0001-01-06-00000.nc in

 /lcrc/group/e3sm/baselines/chrys/intel/generate_ddpdd_baselines/SMS.ne4_oQU240.F2010.chrysalis_intel.eam-thetahy_sl/

and there were DIFFs, even though this was for the same checkout of master. So, there appears to be some nondeterminancy in this test on Chrysalis?

@ambrad
Copy link
Member

ambrad commented Apr 2, 2023

there appears to be some nondeterminancy in this test on Chrysalis

For context, this test has been running in e3sm_developer on various platforms since approximately March 2020, so approx. three years. Of course, it's possible your code changes have revealed some nondeterminism, but I suggest that you analyze this a bit more before concluding that. In any case, if there is nondeterminism, a clear reproducer will be helpful; simply running it a few times in a row I think should have a nearly 0 probabillity of showing different answers given that it has run for three years without a problem.

@worleyph
Copy link
Contributor Author

worleyph commented Apr 2, 2023

@ambrad ,

 For context, this test has been running in e3sm_developer on various platforms since approximately March 2020, so approx. three years.

Good to hear. Perhaps there was something weird that occurred using the DDPDD algorithm on Chrysalis. (For example, I saw a hang in the MPI_Allreduce in DDPDD in an experiment two weeks ago on Chrysalis that I was not able to reproduce the next day.) SInce you do not run it this way, probably not worth pursuing since the default algorithm is tested regularly. Since I got a BFB result for this remaining test when I reran, I'm comfortable with the results relevant to this PR. Thanks.

@rljacob
Copy link
Member

rljacob commented Apr 3, 2023

Is this still a non-BFB PR ?

@worleyph
Copy link
Contributor Author

worleyph commented Apr 3, 2023

@rljacob, It is still a non-BFB PR. (It is just BFB with the alternative DDPDD algorithm on Chrysalis for the E3SM integration test suite.)

amametjanov added a commit that referenced this pull request Apr 6, 2023
(PR #5560)

Eliminate potential nonreproducibility in final local sum

The default reprosum algorithm converts each floating point summand
into a vector of integers representation. These vectors are summed
locally first, and then globally (using an mpi_allreduce). All of this
is exact and reproducible. The final step is to convert the final sum,
in the vector of integers representation, back into a floating point
value.

This final sum begins by converting each element of the integer vector
into a floating point value, and then adding these, smallest to
largest. Round off can occur in the least significant digit of the
final result, and whether round off occurs is sensitive to the
particular floating point summands. The numbers of MPI tasks and
OpenMP threads can affect the length of the integer vector, and thus
the floating point summands.
In consequence, reproducibility with respect to degree of parallelism
is not guaranteed. (The situation in which reproducibility is lost
has never been observed in E3SM simulations, but it cannot be
ruled out.)

This PR modifies the final summation so that the same set of floating
point summands is generated independent of the number of MPI tasks
and number of OpenMP threads, establishing reproducibility.

Fixes #5276

[non-BFB]
@amametjanov amametjanov merged commit db598dc into master Apr 6, 2023
@amametjanov amametjanov deleted the worleyph/share/util/fix_potential_nonrepro branch April 6, 2023 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix PR non-BFB PR makes roundoff changes to answers. Utils
Projects
None yet
Development

Successfully merging this pull request may close these issues.

improving accuracy of shr_reprosum_calc
6 participants