-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove redundant distinct over group by #18512
Remove redundant distinct over group by #18512
Conversation
ab20709
to
e20e813
Compare
e20e813
to
bcb24f9
Compare
@@ -238,6 +238,7 @@ | |||
private String nativeExecutionExecutablePath = "./presto_server"; | |||
private boolean randomizeOuterJoinNullKey; | |||
private boolean isOptimizeConditionalAggregationEnabled; | |||
private boolean isRemoveRedundantDistinctAggregationEnabled; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sest default to true
Let's make the default to true as this is a safe and general optimization. |
Is this functionally different from the existing rule RemoveRedundantDistinct? |
Interesting - I thought that's coming from uniqueness constraints. But this is using the planproperties. If/when we unify these two concepts we can get rid of one of them. |
bcb24f9
to
bec5eea
Compare
bec5eea
to
a080ac3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please rebase as well
new PruneRedundantProjectionAssignments(), | ||
new InlineProjections(metadata.getFunctionAndTypeManager()), | ||
new RemoveRedundantIdentityProjections())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same again... do we really need to run this again? Or we can just merge RemoveRedundantDistinctAggregation
into the above rule?
Always keep in mind that running optimizer is costly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this optimizer rule and get rid of these additional projection rules.
@@ -238,6 +238,7 @@ | |||
private String nativeExecutionExecutablePath = "./presto_server"; | |||
private boolean randomizeOuterJoinNullKey; | |||
private boolean isOptimizeConditionalAggregationEnabled; | |||
private boolean isRemoveRedundantDistinctAggregationEnabled = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure we want to enable it by default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure we want to enable it by default?
Yes this is a very general optimization that should always help and we are being very conservative and also adding more tests for making sure it works for different patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that if there is a bug in the code and it would big pain to do fixes in prod. If we have high confidence with full correctness in verifiers, I'm also ok either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main worry is that if there is a bug in the code and it would big pain to do fixes in prod. If we have high confidence with full correctness in verifiers, I'm also ok either way.
Yes - we will do some targeted verifier runs as the pattern is relatively easy to look for in the logs.
Main issue is if we add something turned off we rarely turn it on - e.g optimize_nulls_in_join has been there for ~2 years but we never turned it on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - we will do some targeted verifier runs as the pattern is relatively easy to look for in the logs.
Yeah, I will run verifier test and report here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get about 40 queries which trigger this optimization, with most queries showing about 20% reduction in cpu time.
} | ||
} | ||
|
||
private class Rewriter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@feilong-liu like we discussed yesterday, add more tests to the harness: select distinct x, random() from (.. group by x) |
a080ac3
to
463c909
Compare
Can we rebase and push? |
4fbf590
to
2cc71b9
Compare
Excuse me but this problem is already handled in a very general way by #16416. If you there are specific use cases that are not covered by that implementation or if you have difficulty understanding it please shoot me an email at dave@ahana.io and I will be happy to discuss. Please close this PR in the meantime. Thanks. |
Quick addendum to the last comment. The work of #16416 is disabled by default (it shouldn't be but that is the only way that could get reviewers to approve). If you enable it you will likely see that the existing rule RemoveRedundantDistinct will have already done the job. In fact there is another rule RemoveRedundantAggregateDistinct that will remove distinct specification from aggregate functions if based on preexisting keys or provable max cardinality of 1. I would be thrilled if you would extend this implementation to cover any additional use cases. Again, available to discuss dave@ahana.io |
Remove distinct if the corresponding output is already distinct after a group by operation.
2cc71b9
to
bd1aec4
Compare
Yes we should definitely look into integrating these properties but we currently don't have bandwidth to test it as the constraints PR is a rather big one that touches a lot of parts (also the reason to merge it disabled by default). So for now, we will get this PR in and look into unifying these things later - tracking issue: #18547 |
What's the change?
Add an optimization which remove distinct if the corresponding output is already distinct after a group by operation.
An example is query "SELECT DISTINCT orderpriority, SUM(totalprice) FROM orders GROUP BY orderpriority", where the distinct operation is redundant.
Test plan - (Please fill in how you tested your changes)
Add unit test.
Benchmark results
Sql query:
select distinct orderkey, partkey, suppkey, avg(extendedprice) from lineitem group by orderkey, partkey, suppkey
.Control: 100.324 cpu ms
Test: 71.382 cpu ms