-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add T-digest type and functions #5158
Conversation
7a3e140
to
4782cbf
Compare
...n/src/main/java/io/prestosql/operator/aggregation/ApproximateLongPercentileAggregations.java
Outdated
Show resolved
Hide resolved
.../main/java/io/prestosql/operator/aggregation/ApproximateLongPercentileArrayAggregations.java
Outdated
Show resolved
Hide resolved
@JsonCreator | ||
public TDigestType() | ||
{ | ||
super(new TypeSignature(StandardTypes.TDIGEST), Slice.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use TDigest as the native type instead of serializing and deserializing. See, for instance, LongTimestampType
Applied comments. In the last commit, This PR depends on airlift/airlift#873 for the correctness of T-Digest implementation. |
// TDigest requires that percentiles list be ordered. Sort the percentiles list and rearrange the output according to original percentile order. | ||
List<Double> sortedPercentiles = Ordering.natural().sortedCopy(percentiles); | ||
List<Double> valuesAtPercentiles = digest.valuesAt(sortedPercentiles); | ||
Map<Double, Double> percentilesToValues = new HashMap<>(); | ||
for (int i = 0; i < sortedPercentiles.size(); i++) { | ||
percentilesToValues.put(sortedPercentiles.get(i), valuesAtPercentiles.get(i)); | ||
} | ||
return percentiles.stream() | ||
.map(percentilesToValues::get) | ||
.collect(toImmutableList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be more efficient with fastutil's indirect sorting (you should give it a try). Something like:
int[] indexes = new int[percentiles.size()];
double[] sortedPercentiles = new double[percentiles.size()];
for (int i = 0; i < indexes.length; i++) {
indexes[i] = i;
sortedPercentiles[i] = percentiles.get(i);
}
it.unimi.dsi.fastutil.Arrays.quickSort(0, percentiles.size(), (a, b) -> Doubles.compare(sortedPercentiles[a], sortedPercentiles[b]), (a, b) -> {
double tempPercentile = sortedPercentiles[a];
sortedPercentiles[a] = sortedPercentiles[b];
sortedPercentiles[b] = tempPercentile;
int tempIndex = indexes[a];
indexes[a] = indexes[b];
indexes[b] = tempIndex;
});
List<Double> valuesAtPercentiles = digest.valuesAt(Doubles.asList(sortedPercentiles));
List<Double> result = new ArrayList<>(valuesAtPercentiles.size());
for (int i = 0; i < valuesAtPercentiles.size(); i++) {
result.add(valuesAtPercentiles.get(indexes[i]));
}
return result;
Rebased and removed final commit. |
presto-main/src/test/java/io/prestosql/sql/query/TestTDigestFunctions.java
Outdated
Show resolved
Hide resolved
--------- | ||
|
||
.. function:: merge(tdigest) -> tdigest | ||
:noindex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why :noindex:
for all of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird. Apparently, I can remove one arbitrary :noindex:
but it I remove two, it fails.
Applied comments. Docs fixes are in separate commit for easier review. |
========================= | ||
T-Digest Functions | ||
========================= | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a introductory sentence
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help me with that? Isn't the data structure description enough for an introduction?
|
||
.. function:: tdigest_agg(x) -> tdigest | ||
|
||
Returns the ``tdigest`` which is composed of all input values of ``x``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Composes all input values of x
into a T-digest.
And need to add what x is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And need to add what x is...
Should I say that x
are numeric values? I think it's rather obvious. What else could I add here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well... what kind of numeric types work? I assume all .. but then we should say that
|
||
.. function:: tdigest_agg(x, w) -> tdigest | ||
|
||
Returns the ``tdigest`` which is composed of all input values of ``x`` using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above and explain x and w
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says per-item weight w
, so I consider w
explained. I added a note that w
must be >= 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any < restriction for w?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No upper bound restriction.
I added information that x
and w
can be of any numeric type.
Applied most comments, need help with some. |
eaaeea3
to
25e18db
Compare
6f5e2d4
to
f79a904
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc updates look good now.
Use T-digest instead of QuantileDigest in the implementation of - `approx_percentile(x, percentage)`, - `approx_percentile(x, weight, percentage)`. This change doesn't apply to approx_percentile(x, weight, percentage, accuracy).
Use T-digest instead of QuantileDigest in the implementation of - `approx_percentile(x, percentages)`, - `approx_percentile(x, weight, percentages)`.
Squashed fixup and rebased. |
Also, use T-digest as internal structure for approx_percentile() functions
fixes #4975