Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set filter_and_project_min_output_page_row_count to 1 for final DistinctLimit #17640

Merged
merged 1 commit into from
Apr 25, 2022

Conversation

kaikalur
Copy link
Contributor

@kaikalur kaikalur commented Apr 13, 2022

Addressing issue:

#17631

For specific case of DistinctLimit

Test plan - tests already exist

== RELEASE NOTES ==
General Changes
* Added a new optimization for showing results for (interactive) distinct limit N as they become available with no buffering.  This can be enabled using the session param: quick_distinct_limit_enabled to true.

@rschlussel
Copy link
Contributor

what's the effect on queries that have lots of distinct values (didn't end up waiting a long time for results before)? Also, if the distinct limit isn't at the end of the query or it writes to a table, so the project isn't flushing straight to output.

@kaikalur
Copy link
Contributor Author

kaikalur commented Apr 14, 2022

Should be fine still because it's final step and the limit is generally low. Those are corner cases and they have other issues.

As for lot of distinct values, our RR shuffle fixes that part. Note that we are doing this for the FINAL distinct limit.

@kaikalur
Copy link
Contributor Author

what's the effect on queries that have lots of distinct values (didn't end up waiting a long time for results before)? Also, if the distinct limit isn't at the end of the query or it writes to a table, so the project isn't flushing straight to output.

Also the issue we had with was not so much about having too many distinct values - it's about the limit - when the limit is greater than the number of distinct values, it was getting stuck. When the limit is lower than the distinct values, it's fast and that behavior hasn't changed as we simply short circuit the distinct limit.

@kaikalur
Copy link
Contributor Author

Ping - anything else needed?

@rschlussel
Copy link
Contributor

what's the effect on queries that have lots of distinct values (didn't end up waiting a long time for results before)? Also, if the distinct limit isn't at the end of the query or it writes to a table, so the project isn't flushing straight to output.

Also the issue we had with was not so much about having too many distinct values - it's about the limit - when the limit is greater than the number of distinct values, it was getting stuck. When the limit is lower than the distinct values, it's fast and that behavior hasn't changed as we simply short circuit the distinct limit.

Yeah, i understand how this fixes the limit issue. My concern is when there are a lot of distinct values (where it would have worked fine before) you can now end up creating lots of pages. Can we gate this with the distinct limit session property in case things go wrong?

@kaikalur
Copy link
Contributor Author

what's the effect on queries that have lots of distinct values (didn't end up waiting a long time for results before)? Also, if the distinct limit isn't at the end of the query or it writes to a table, so the project isn't flushing straight to output.

Also the issue we had with was not so much about having too many distinct values - it's about the limit - when the limit is greater than the number of distinct values, it was getting stuck. When the limit is lower than the distinct values, it's fast and that behavior hasn't changed as we simply short circuit the distinct limit.

Yeah, i understand how this fixes the limit issue. My concern is when there are a lot of distinct values (where it would have worked fine before) you can now end up creating lots of pages. Can we gate this with the distinct limit session property in case things go wrong?

Note that this is completely indepedent of the total number of distinct values. It will have at most N pages if the LIMIT is N as we do this for only the FINAL distinctlimit. But sure I can rename and use the previous flag.

@kaikalur kaikalur force-pushed the fix_distinct_limit branch from 2f04de2 to 166f3e3 Compare April 21, 2022 17:06
@kaikalur
Copy link
Contributor Author

what's the effect on queries that have lots of distinct values (didn't end up waiting a long time for results before)? Also, if the distinct limit isn't at the end of the query or it writes to a table, so the project isn't flushing straight to output.

Also the issue we had with was not so much about having too many distinct values - it's about the limit - when the limit is greater than the number of distinct values, it was getting stuck. When the limit is lower than the distinct values, it's fast and that behavior hasn't changed as we simply short circuit the distinct limit.

Yeah, i understand how this fixes the limit issue. My concern is when there are a lot of distinct values (where it would have worked fine before) you can now end up creating lots of pages. Can we gate this with the distinct limit session property in case things go wrong?

Note that this is completely indepedent of the total number of distinct values. It will have at most N pages if the LIMIT is N as we do this for only the FINAL distinctlimit. But sure I can rename and use the previous flag.

Done.

I renamed the previous flag we were using for the RR shuffle as quick_distinct_limit_enabled so we can do these two together. Please take a look.

@kaikalur kaikalur force-pushed the fix_distinct_limit branch from 166f3e3 to fa56678 Compare April 21, 2022 17:35
@kaikalur kaikalur force-pushed the fix_distinct_limit branch from fa56678 to 2fe7ca1 Compare April 21, 2022 17:43
@rschlussel rschlussel merged commit d5290b5 into prestodb:master Apr 25, 2022
@mshang816 mshang816 mentioned this pull request May 17, 2022
14 tasks
rmarduga pushed a commit to rmarduga/presto that referenced this pull request Jul 19, 2022
Often times, while specifying the joining condition in JOIN ON clause,
users make mistakes and not referencing the joining table along side
with other table in join condition that result in performance issue due
to conditional join resulting in CROSS JOIN.

This change will reduce the number of cases when user is shown the
PERFORMANCE_WARNING when JOIN ON clause is actually correct.

See also: prestodb#17333, prestodb#17640
highker pushed a commit that referenced this pull request Jul 19, 2022
Often times, while specifying the joining condition in JOIN ON clause,
users make mistakes and not referencing the joining table along side
with other table in join condition that result in performance issue due
to conditional join resulting in CROSS JOIN.

This change will reduce the number of cases when user is shown the
PERFORMANCE_WARNING when JOIN ON clause is actually correct.

See also: #17333, #17640
shrinidhijoshi pushed a commit to shrinidhijoshi/presto that referenced this pull request Sep 7, 2022
Often times, while specifying the joining condition in JOIN ON clause,
users make mistakes and not referencing the joining table along side
with other table in join condition that result in performance issue due
to conditional join resulting in CROSS JOIN.

This change will reduce the number of cases when user is shown the
PERFORMANCE_WARNING when JOIN ON clause is actually correct.

See also: prestodb#17333, prestodb#17640
@kaikalur kaikalur mentioned this pull request Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants