Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OTEL-2348] Improve DDSketch to Sketch conversion #464

Closed
wants to merge 3 commits into from

Conversation

jade-guiton-dd
Copy link

@jade-guiton-dd jade-guiton-dd commented Jan 16, 2025

What does this PR do?

This PR:

  1. Modifies the algorithm used to round bin counts to integers as part of the conversion from DDSketch to Sketch (pkg/quantile/ddsketch.go)
  2. Updated the test files in pkg/otlp/metrics/testdata to reflect the new algorithm's output
  3. Fixes a bug in the linear interpolation in Sketch.Quantile (pkg/quantile/sparse.go)
  4. Removes systematic bias introduced in the conversion (pkg/quantile/ddsketch.go)
  5. Changes the calculation of the bound on relative error in the tests (pkg/quantile/ddsketch_test.go)

The changes themselves are pretty small, but somewhat technical, so if this is too much to review, I can split this PR into smaller ones.

Motivation for changes 1 and 2

I recently noticed that some tests were failing on main, but only when running locally on my M3 Macbook. Here are the test logs (after renaming the test cases to more accurate names):

=== RUN   TestConvertDDSketchIntoSketch/U-quadratic_distribution_(a=-N,b=0)
    ddsketch_test.go:345:
                Error Trace:    [...]/opentelemetry-mapping-go/pkg/quantile/ddsketch_test.go:345
                Error:          Relative error is too high: 0.023252087955899094 (expected)
                                        < 1 (actual)
                Test:           TestConvertDDSketchIntoSketch/U-quadratic_distribution_(a=-N,b=N)#01
                Messages:       error too high for p99
=== RUN   TestConvertDDSketchIntoSketch/Truncated_Normal_distribution_(a=0,b=8,mu=0,_sigma=1e-3)
    ddsketch_test.go:345:
                Error Trace:    [...]/opentelemetry-mapping-go/pkg/quantile/ddsketch_test.go:345
                Error:          Relative error is too high: 0.023252087955899094 (expected)
                                        < 3027.111649598558 (actual)
                Test:           TestConvertDDSketchIntoSketch/Truncated_Normal_distribution_(a=0,b=8,mu=0,_sigma=1e-3)
                Messages:       error too high for p99

These tests compute the percentiles of a distribution before and after the DDSketch to Sketch conversion, and check that the relative error introduced is below some bound.

After investigating, this turned out to be because the rounding algorithm currently used to turn float-valued DDSketches into integer-valued Sketches is somewhat numerically unstable. The algorithm rounds bin counts down to an integer, but keeps track of the accumulated rounding error, and adds 1 extra unit to the current bin once it reaches 1.0.

Unfortunately, because of floating point errors, this accumulated error often ends up reaching 0.9999... in cases where it should be 1.0, which delays the addition of the extra unit until the next bin with a non-integer count, which can end up being far away from the logical range the extra unit should have been assigned to. Depending on the distribution, this transfer of a single unit between bins can end up changing the result of the quantile calculations.

The test already skipped the comparison for certain quantiles, presumably because of this issue. However, because of differences in floating-point calculations between x86 and ARM CPUs, this issue is triggered in different cases on the different architectures, hence the platform-specific failures.

In the first case, P99 shifted from a very small negative value to exactly zero, which is a small absolute error, but the relative error is mathematically undefined and defaults to 1. In the second case, P99 shifted from an accurate ~0.0025 to ~8.0, which is close to the P100 (ie. max) of the (very long-tailed) distribution, inducing a 300 000% relative error.

To solve this issue at the source, I modified the rounding algorithm so that it inserts the extra units determined from accumulated rounding error when said error reaches 0.5, instead of 1.0. This is done differently from the original algorithm, by tracking the sum of the input float counts and the output integer counts, and using math.Round to compute the difference. This adds the extra unit somewhere in the middle of its logical range of bins, instead of at/past the end, and should not present the same numerical instability on the computed quantiles, at least in the common case where the input DDSketch has integer counts. (When the DDSketch has fractional counts, we can still potentially see large shifts, please see the long comment in the code for an example.)

Switching to this algorithm fixes the failing tests, and also allows us to remove the previous exceptions that were made in the test. However, this changes the exact output of the algorithm in a lot of cases, so I had to update the test files in pkg/otlp/metrics/testdata for the TestExponentialHistogramTranslatorOptions test, which expect the output bin counts to have very precise values.

Note that the test failure is very brittle, such that even just slightly changing the number of points added to the DDSketch under test would have actually fixed it, so the passing tests are not a good indicator of the new algorithm being more accurate or stable. But I hope my theoretical explanation above is enough to convince that it is at least somewhat more stable.

Motivation for changes 3 and 4

As part of my investigation, I noticed a few parts of the calculation that seemed incorrect.

One of them is in Sketch.Quantile(), which, after finding the bin containing the wanted quantile, performs linear interpolation between its bounds to further improve the estimation. However, the bounds are flipped for negative bins, giving results that are incorrect, although still in the correct bin, so not very far off.

The second one is in createDDSketchWithSketchMapping, which remaps the original DDSketch to one with the same bins as the target Sketch. Despite the bin boundary calculation being the same between the two data structures, this code introduces a + 0.5 bias in the calculation, resulting in all bin boundaries being off by about half a bin.

I confirmed experimentally that both of these issues introduce unnecessary bias to the results.

There is a comment in the original code justifying the + 0.5 modification, but I do not understand its reasoning. I am interested if someone has an explanation.

Motivation for change 5

Another piece of code that seemed incorrect is the calculation of the tolerated bound on the relative error on the quantiles introduced by the conversion, in the testing code. (Link to the current code)

A comment provides the link to the other codebase this calculation was taken from, but that code seems to be calculating a bound on the error introduced by the process of merging two DDSketches, which has little relation to the error introduced by the rounding algorithm. (By the way, the Javadoc referenced in that code is available at this link.)

I replaced this calculation with a new one, with a long comment explaining the reasoning behind it. Ironically, the result is not very different (old bound: ~0.02325 / new bound: ~0.02356), which is why this change does not really affect the test results.

@jade-guiton-dd
Copy link
Author

jade-guiton-dd commented Jan 17, 2025

After trying to fix the failing tests, I think the code (both the processing code and the testing code) is in need of a much bigger fixup, or even potentially a full rewrite, which would probably require changes in the backend as well. I can't handle that myself right now, so I will close this PR and make a new one with a smaller set of changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant