-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create warning for approx_distinct and approx_set with low maxStandardError #17433
Conversation
@highker I couldn't request a review for some reason, could you have a look at this? thanks! |
@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context) | |||
PERFORMANCE_WARNING, | |||
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough")); | |||
} | |||
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) { | |||
double maxStandardError = LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be in the config/session properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide more context on this? (classes, modules etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureConfigs
and SystemSessionProperties
. There are a lot of examples in those two classes. Feel free to add one more flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeaturesConfig/SystemSessionProperties classes. See the pattern we use there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointers!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally we want these things to be configurable so deployments (or individual queries) can use different values
@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context) | |||
PERFORMANCE_WARNING, | |||
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough")); | |||
} | |||
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WE should do this for approx_set as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets handle approx_set as well in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits only
boolean isApproxDistinctFunction(FunctionHandle functionHandle); | ||
|
||
FunctionHandle approxDistinctFunction(Type valueType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the whole word like "approximateCountDistinctFunction"
@@ -116,6 +117,8 @@ | |||
private final WarningCollector warningCollector; | |||
private final FunctionResolution functionResolution; | |||
|
|||
private static final double LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR = 0.023; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this one; the default value can be directly obtained in class DefaultApproximateCountDistinctAggregation
. Make DEFAULT_STANDARD_ERROR
public in that class
@@ -335,6 +338,21 @@ protected Boolean visitFunctionCall(FunctionCall node, Void context) | |||
PERFORMANCE_WARNING, | |||
"COUNT(DISTINCT xxx) can be a very expensive operation when the cardinality is high for xxx. In most scenarios, using approx_distinct instead would be enough")); | |||
} | |||
if (functionResolution.isApproxDistinctFunction(analysis.getFunctionHandle(node))) { | |||
double maxStandardError = LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureConfigs
and SystemSessionProperties
. There are a lot of examples in those two classes. Feel free to add one more flag
if (maxStandardError <= LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR) { | ||
warningCollector.add(new PrestoWarning( | ||
PERFORMANCE_WARNING, | ||
String.format("approx_distinct can be a very expensive operation when the max standard error is too low (<=%f)", LOWEST_APPROX_DISTINCT_MAX_STANDARD_ERROR))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning message is more like "approx_distinct can produce low-precision result with the existing standard error %s". Something like that.
74414f1
to
c7d9d09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits only; otherwise LGTM
public static final String LOWEST_APPROXIMATE_COUNT_DISTINCT_MAX_STANDARD_ERROR = "lowest_approximate_count_distinct_max_standard_error"; | ||
public static final String LOWEST_APPROXIMATE_SET_MAX_STANDARD_ERROR = "lowest_approximate_set_max_standard_error"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine these two together into "hyperloglog_standard_error_warning_threshold"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the DEFAULT_MAX_STANDARD_ERROR for approx_distinct and approx_set seems to have different values, which one should hyperloglog_standard_error_warning_threshold be defaulted to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is probably fine. Because the underlying data structures are the same. We can use the same small number. The default as suggested by the user is 0.004
.
@@ -110,6 +115,7 @@ | |||
|
|||
private final Metadata metadata; | |||
private final Analysis analysis; | |||
private final Session session; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move session right after warningCollector to match the order to constructor parameters and assignment
boolean nodeIsApproximateCountDistinctFunction = functionResolution.isApproximateCountDistinctFunction(analysis.getFunctionHandle(node)); | ||
boolean nodeIsApproximateSetFunction = functionResolution.isApproximateSetFunction(analysis.getFunctionHandle(node)); | ||
if (nodeIsApproximateCountDistinctFunction || nodeIsApproximateSetFunction) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove node
: isApprox...
if (maxStandardError <= lowestMaxStandardError) { | ||
warningCollector.add(new PrestoWarning(PERFORMANCE_WARNING, String.format("%s can produce low-precision results with the current standard error: %.4f (<=%.4f)", functionName, maxStandardError, lowestMaxStandardError))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to separate into two functions. Because if a query contains both approx_distinct and approx_set, we should emit two warnings.
@Test | ||
public void testApproxDistinctPerformanceWarning() | ||
{ | ||
WarningCollector warningCollector = analyzeWithWarnings("SELECT approx_distinct(a) FROM t1 GROUP BY b"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's have a test to cover two functions in one query and assert we can get two warnings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to add multiple warnings of the same type to warningCollector? it seems that WarningCollector stores warnings in a Map<WarningCode, PrestoWarning> so only one PERFORMANCE_WARNING would get added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, shall we change it to a multi map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, I can work on that
approx_distinct and approx_set can produce imprecise results when the max standard error is low, this commit issues a performance warning when approx_distinct or approx_set is invoked with a maxStandardError less than or equal to the threshold (currently set to 0.004).
Test plan
presto-cli/target/presto-cli-*-executable.jar --catalog tpch --schema sf1 --debug
)SELECT approx_distinct(nationkey, 0.0041E0), approx_set(nationkey, 0.0051E0) FROM customer GROUP BY mktsegment;