Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create warning for approx_distinct and approx_set with low maxStandardError #17433

Merged
merged 1 commit into from
Mar 16, 2022

Conversation

stevechuck
Copy link
Contributor

@stevechuck stevechuck commented Mar 8, 2022

approx_distinct and approx_set can produce imprecise results when the max standard error is low, this commit issues a performance warning when approx_distinct or approx_set is invoked with a maxStandardError less than or equal to the threshold (currently set to 0.004).

Test plan

  • Unit tests
  • Testing on local client (presto-cli/target/presto-cli-*-executable.jar --catalog tpch --schema sf1 --debug)
SELECT approx_distinct(nationkey) FROM customer GROUP BY mktsegment;
_col0 
-------
    25 
    25 
    25 
    25 
    25 
(5 rows)

WARNING: approx_distinct can produce low-precision results with the current standard error: 0.0230 (<=0.0230)


SELECT approx_distinct(nationkey, 0.0229E0) FROM customer GROUP BY mktsegment;
_col0 
-------
    25 
    25 
    25 
    25 
    25 
(5 rows)

WARNING: approx_distinct can produce low-precision results with the current standard error: 0.0229 (<=0.0230)


SELECT approx_distinct(nationkey, 0.0245E0) FROM customer GROUP BY mktsegment;
_col0 
-------
    25 
    25 
    25 
    25 
    25 
(5 rows)

SELECT approx_set(nationkey) FROM customer GROUP BY mktsegment;
_col0                      
-------------------------------------------------
 02 0c 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0c 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0c 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+

WARNING: approx_set can produce low-precision results with the current standard error: 0.0163 (<=0.0163)


SELECT approx_set(nationkey, 0.01550E0) FROM customer GROUP BY mktsegment;
_col0                      
-------------------------------------------------
 02 0d 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0d 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0d 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+

WARNING: approx_set can produce low-precision results with the current standard error: 0.0155 (<=0.0163)

SELECT approx_set(nationkey, 0.02530E0) FROM customer GROUP BY mktsegment;
_col0                      
-------------------------------------------------
 02 0b 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0b 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
 c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
 40 0e c3 ec 81 e2 49 f2                         
 02 0b 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+

SELECT approx_distinct(nationkey, 0.0041E0), approx_set(nationkey, 0.0051E0) FROM customer GROUP BY mktsegment;

 _col0 |                      _col1                      
-------+-------------------------------------------------
    25 | 02 10 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
       | 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
       | 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
       | 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
       | c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
       | 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
       | 40 0e c3 ec 81 e2 49 f2                         
    25 | 02 10 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
       | 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+
       | 81 29 e1 30 00 63 ae 38 82 d2 87 39 00 fd db 53+
       | 00 58 3d 5b 41 3e 9a 5d 82 a9 3b 61 02 62 14 7c+
       | c0 88 eb 91 c2 f7 e1 95 80 bb 20 9d 42 ba 62 a7+
       | 81 9e 0c a9 01 d2 19 b4 00 e5 ec bb 80 20 08 de+
       | 40 0e c3 ec 81 e2 49 f2                         
    25 | 02 10 19 00 80 03 44 00 40 ec c9 06 c0 c5 d4 0f+
       | 00 34 2f 12 86 1d 34 1b 80 ae 15 28 80 63 df 28+

WARNING: approx_distinct can produce low-precision results with the current standard error: 0.0041 (<=0.0080)
WARNING: approx_set can produce low-precision results with the current standard error: 0.0051 (<=0.0080)
== RELEASE NOTES ==

General Changes

* Raise warnings on functions ``approx_distinct`` and ``approx_set`` producing low-precision results when the input standard error is too large. The threshold can be set through session property `hyperloglog_standard_error_warning_threshold` or config `hyperloglog-standard-error-warning-threshold` with a default value of `0.4%`

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 8, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: stevechuck / name: Steve Chuck (65ea71ae4c67b06ca5d73f7cd1661ab7a2678bc7)

@stevechuck
Copy link
Contributor Author

@highker I couldn't request a review for some reason, could you have a look at this? thanks!

@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context)
PERFORMANCE_WARNING,
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough"));
}
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) {
double maxStandardError = LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in the config/session properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more context on this? (classes, modules etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureConfigs and SystemSessionProperties. There are a lot of examples in those two classes. Feel free to add one more flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeaturesConfig/SystemSessionProperties classes. See the pattern we use there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointers!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaikalur @highker What configuration we need here? We can rely on the DefaultApproximateCountDistinctAggregation.DEFAULT_STANDARD_ERROR right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we want these things to be configurable so deployments (or individual queries) can use different values

@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context)
PERFORMANCE_WARNING,
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough"));
}
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WE should do this for approx_set as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets handle approx_set as well in this PR?

Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits only

Comment on lines 75 to 77
boolean isApproxDistinctFunction(FunctionHandle functionHandle);

FunctionHandle approxDistinctFunction(Type valueType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the whole word like "approximateCountDistinctFunction"

@@ -116,6 +117,8 @@
private final WarningCollector warningCollector;
private final FunctionResolution functionResolution;

private static final double LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR = 0.023;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this one; the default value can be directly obtained in class DefaultApproximateCountDistinctAggregation. Make DEFAULT_STANDARD_ERROR public in that class

@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context)
PERFORMANCE_WARNING,
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough"));
}
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) {
double maxStandardError = LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureConfigs and SystemSessionProperties. There are a lot of examples in those two classes. Feel free to add one more flag

if (maxStandardError <= LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR) {
warningCollector.add(new PrestoWarning(
PERFORMANCE_WARNING,
String.format("approx_distinct can be a very expensive operation when the max standard error is too low (<=%f)", LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning message is more like "approx_distinct can produce low-precision result with the existing standard error %s". Something like that.

@stevechuck stevechuck force-pushed the master branch 2 times, most recently from 74414f1 to c7d9d09 Compare March 11, 2022 05:18
@stevechuck stevechuck changed the title Create warning for approx_distinct with low maxStandardError Create warning for approx_distinct and approx_set with low maxStandardError Mar 11, 2022
@stevechuck stevechuck requested review from kaikalur and highker March 11, 2022 06:38
Copy link
Contributor

@highker highker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits only; otherwise LGTM

Comment on lines 222 to 223
public static final String LOWEST_APPROXIMATE_COUNT_DISTINCT_MAX_STANDARD_ERROR = "lowest_approximate_count_distinct_max_standard_error";
public static final String LOWEST_APPROXIMATE_SET_MAX_STANDARD_ERROR = "lowest_approximate_set_max_standard_error";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can combine these two together into "hyperloglog_standard_error_warning_threshold"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the DEFAULT_MAX_STANDARD_ERROR for approx_distinct and approx_set seems to have different values, which one should hyperloglog_standard_error_warning_threshold be defaulted to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is probably fine. Because the underlying data structures are the same. We can use the same small number. The default as suggested by the user is 0.004.

@@ -110,6 +115,7 @@

private final Metadata metadata;
private final Analysis analysis;
private final Session session;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move session right after warningCollector to match the order to constructor parameters and assignment

Comment on lines 348 to 350
boolean nodeIsApproximateCountDistinctFunction = functionResolution.isApproximateCountDistinctFunction(analysis.getFunctionHandle(node));
boolean nodeIsApproximateSetFunction = functionResolution.isApproximateSetFunction(analysis.getFunctionHandle(node));
if (nodeIsApproximateCountDistinctFunction || nodeIsApproximateSetFunction) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove node: isApprox...

Comment on lines 365 to 374
if (maxStandardError <= lowestMaxStandardError) {
warningCollector.add(new PrestoWarning(PERFORMANCE_WARNING, String.format("%s can produce low-precision results with the current standard error: %.4f (<=%.4f)", functionName, maxStandardError, lowestMaxStandardError)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to separate into two functions. Because if a query contains both approx_distinct and approx_set, we should emit two warnings.

@Test
public void testApproxDistinctPerformanceWarning()
{
WarningCollector warningCollector = analyzeWithWarnings("SELECT approx_distinct(a) FROM t1 GROUP BY b");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a test to cover two functions in one query and assert we can get two warnings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to add multiple warnings of the same type to warningCollector? it seems that WarningCollector stores warnings in a Map<WarningCode, PrestoWarning> so only one PERFORMANCE_WARNING would get added

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, shall we change it to a multi map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I can work on that

@highker highker merged commit 65d08a1 into prestodb:master Mar 16, 2022
@mshang816 mshang816 mentioned this pull request May 17, 2022
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants