Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add T-digest type and functions #5158

Merged
merged 11 commits into from
Oct 4, 2020
Merged

Add T-digest type and functions #5158

merged 11 commits into from
Oct 4, 2020

Conversation

kasiafi
Copy link
Member

@kasiafi kasiafi commented Sep 14, 2020

Also, use T-digest as internal structure for approx_percentile() functions

fixes #4975

@cla-bot cla-bot bot added the cla-signed label Sep 14, 2020
@martint martint self-requested a review September 14, 2020 16:26
@kasiafi kasiafi force-pushed the 124Tdigest branch 7 times, most recently from 7a3e140 to 4782cbf Compare September 16, 2020 15:34
@JsonCreator
public TDigestType()
{
super(new TypeSignature(StandardTypes.TDIGEST), Slice.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use TDigest as the native type instead of serializing and deserializing. See, for instance, LongTimestampType

@kasiafi
Copy link
Member Author

kasiafi commented Sep 22, 2020

Applied comments.

In the last commit, TDigest.valuesAt() method is used whenever values are requested for array of percentiles.

This PR depends on airlift/airlift#873 for the correctness of T-Digest implementation.

Comment on lines 117 to 126
// TDigest requires that percentiles list be ordered. Sort the percentiles list and rearrange the output according to original percentile order.
List<Double> sortedPercentiles = Ordering.natural().sortedCopy(percentiles);
List<Double> valuesAtPercentiles = digest.valuesAt(sortedPercentiles);
Map<Double, Double> percentilesToValues = new HashMap<>();
for (int i = 0; i < sortedPercentiles.size(); i++) {
percentilesToValues.put(sortedPercentiles.get(i), valuesAtPercentiles.get(i));
}
return percentiles.stream()
.map(percentilesToValues::get)
.collect(toImmutableList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be more efficient with fastutil's indirect sorting (you should give it a try). Something like:

int[] indexes = new int[percentiles.size()];
double[] sortedPercentiles = new double[percentiles.size()];
for (int i = 0; i < indexes.length; i++) {
    indexes[i] = i;
    sortedPercentiles[i] = percentiles.get(i);
}

it.unimi.dsi.fastutil.Arrays.quickSort(0, percentiles.size(), (a, b) -> Doubles.compare(sortedPercentiles[a], sortedPercentiles[b]), (a, b) -> {
    double tempPercentile = sortedPercentiles[a];
    sortedPercentiles[a] = sortedPercentiles[b];
    sortedPercentiles[b] = tempPercentile;

    int tempIndex = indexes[a];
    indexes[a] = indexes[b];
    indexes[b] = tempIndex;
});

List<Double> valuesAtPercentiles = digest.valuesAt(Doubles.asList(sortedPercentiles));
List<Double> result = new ArrayList<>(valuesAtPercentiles.size());
for (int i = 0; i < valuesAtPercentiles.size(); i++) {
    result.add(valuesAtPercentiles.get(indexes[i]));
}

return result;

@kasiafi
Copy link
Member Author

kasiafi commented Sep 29, 2020

Rebased and removed final commit.

@kasiafi
Copy link
Member Author

kasiafi commented Sep 29, 2020

@electrum @mosabua could you please review the following commits? They include documentation for T-digest and related fixes.

  • Fix parameter names in docs for approx_percentile functions d33ada7
  • Add documentation for T-digest type and functions 32cb2a2
  • Fix return type in qdigest function documentation a2d143b

presto-docs/src/main/sphinx/functions/tdigest.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/tdigest.rst Outdated Show resolved Hide resolved
---------

.. function:: merge(tdigest) -> tdigest
:noindex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why :noindex: for all of these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird. Apparently, I can remove one arbitrary :noindex: but it I remove two, it fails.

presto-docs/src/main/sphinx/language/types.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/language/types.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/language/types.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/tdigest.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/functions/qdigest.rst Outdated Show resolved Hide resolved
@kasiafi
Copy link
Member Author

kasiafi commented Sep 30, 2020

Applied comments. Docs fixes are in separate commit for easier review.

=========================
T-Digest Functions
=========================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a introductory sentence

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me with that? Isn't the data structure description enough for an introduction?


.. function:: tdigest_agg(x) -> tdigest

Returns the ``tdigest`` which is composed of all input values of ``x``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Composes all input values of x into a T-digest.

And need to add what x is...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And need to add what x is...

Should I say that x are numeric values? I think it's rather obvious. What else could I add here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well... what kind of numeric types work? I assume all .. but then we should say that


.. function:: tdigest_agg(x, w) -> tdigest

Returns the ``tdigest`` which is composed of all input values of ``x`` using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above and explain x and w

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says per-item weight w, so I consider w explained. I added a note that w must be >= 1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any < restriction for w?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No upper bound restriction.
I added information that x and w can be of any numeric type.

@kasiafi
Copy link
Member Author

kasiafi commented Oct 1, 2020

Applied most comments, need help with some.
Also, in section Aggregate Functions added links to tdigest_agg() functions, which I found missing.

@kasiafi kasiafi force-pushed the 124Tdigest branch 2 times, most recently from eaaeea3 to 25e18db Compare October 2, 2020 12:01
@kasiafi kasiafi force-pushed the 124Tdigest branch 2 times, most recently from 6f5e2d4 to f79a904 Compare October 2, 2020 16:14
Copy link
Member

@mosabua mosabua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc updates look good now.

kasiafi added 10 commits October 3, 2020 22:49
Use T-digest instead of QuantileDigest in the implementation of
- `approx_percentile(x, percentage)`,
- `approx_percentile(x, weight, percentage)`.
This change doesn't apply to
approx_percentile(x, weight, percentage, accuracy).
Use T-digest instead of QuantileDigest in the implementation of
- `approx_percentile(x, percentages)`,
- `approx_percentile(x, weight, percentages)`.
@kasiafi
Copy link
Member Author

kasiafi commented Oct 3, 2020

Squashed fixup and rebased.

@martint martint merged commit 68ec923 into trinodb:master Oct 4, 2020
@martint martint added this to the 344 milestone Oct 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Support for tdigest_agg?
3 participants