-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a local round robin shuffle before DisinctLimitPartial #17536
Add a local round robin shuffle before DisinctLimitPartial #17536
Conversation
9356c91
to
54512ea
Compare
8a4588b
to
72dc7d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this works. why does adding a local round robin exchange remove the need for a remote exchange before distinct limit final? Won't it just depend on whether the partial distinct limit is distributed or on a single node? Also, local exchanges get added after remote exchanges, so the remote exchange would already be there. Maybe some plan examples would help.
Also, make sure the commit message follows the commit message guidelines: https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#commit-formatting-and-pull-requests
@@ -207,7 +209,7 @@ public PlanWithProperties visitGroupId(GroupIdNode node, HashComputationSet pare | |||
public PlanWithProperties visitDistinctLimit(DistinctLimitNode node, HashComputationSet parentPreference) | |||
{ | |||
// skip hash variable generation for single bigint | |||
if (canSkipHashGeneration(node.getDistinctVariables())) { | |||
if (canSkipHashGeneration(node.getDistinctVariables()) || SystemSessionProperties.isRoundRobinShuffleBeforePartialAgg(session)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain what's going on with skipping hash generation?
(Also should match on the plan here rather than session properties - but if this code stays, should be static import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - now removed this change
@@ -264,8 +265,17 @@ public PlanWithProperties visitDistinctLimit(DistinctLimitNode node, StreamPrefe | |||
StreamPreferredProperties requiredProperties; | |||
StreamPreferredProperties preferredProperties; | |||
if (node.isPartial()) { | |||
requiredProperties = parentPreferences.withoutPreference().withDefaultParallelism(session); | |||
preferredProperties = parentPreferences.withDefaultParallelism(session); | |||
if (SystemSessionProperties.isRoundRobinShuffleBeforePartialAgg(session)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static import isRoundRobinShuffle...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
No it doesn't but removing the precomputed hash does. That's what we are using to indicate shuffle. Very odd. But the local round robin seems to dramatically (1/10-1/50) size reduction in partial distinct limit. Before:
After:
|
Also, this should have plan tests in addition to the correctness test you added to make sure the plan is as expected for all cases. |
I'm generally quite skeptical about the usefulness of plan tests but I will add them :( |
I still don't understand. They both have the same number of remote exchanges. I do see how round robin exchange might be helpful to improve the effectiveness of the partial distinct limit because it randomizes the rows. I think maybe the hash thing is something to do with how the DistinctLimitOperator itself works with/without the precomputed hash? |
Basically, the thinking is it's not very useful to shuffle for small amount of data (LIMIT is generally a small number) so we can just let it go to the single node which will do the final distinct anyway. |
but both examples go to a single node. The hash generation optimizer runs after all exchanges are added, so it doesn't change anything about the data distribution. It just means a precomputed hash gets passed around instead of the distinct limit operator having to calculate the hash for the distinct itself. Also, just want to confirm that the hash generation thing is unrelated to the added round robin exchange. |
Yeah we don't generate hash even before for a case of distinct on a single bigint - which is what made me realize that works a lot better. But the issue with that is if you have a lot of data, all the <= N distinct values will show but it will keep going forever doing the distinct on the single node for millions of rows which is not good. The round robin seems to magically solve the issue. Hope someone here has a better explanation as to why |
The operator does not use the precomputed hash at all! It uses something like groupby for doing distinct keeping max N elements. The precomputed hash seems to result in remote shuffle after the partial distinctlimit. So the issue is if N > max distinct count, it will never finish as it will keep looking for more and shuffling the tons of data. I have seen situations where someone is scanning 3Tn rows, but has only 20 distinct values but due to limit 10k, it will keep going for 1hr and timeout because it will never find 10k. |
72dc7d0
to
91098ee
Compare
Talked to Rebecca offline and based on real queries here, we see the RR shuffle is reducing the intermediate data dramatically. So I will make the PR just doing that part (and not the hash computation part) |
91098ee
to
163aa72
Compare
163aa72
to
ce209f4
Compare
@@ -230,6 +230,7 @@ | |||
public static final String KEY_BASED_SAMPLING_FUNCTION = "key_based_sampling_function"; | |||
public static final String HASH_BASED_DISTINCT_LIMIT_ENABLED = "hash_based_distinct_limit_enabled"; | |||
public static final String HASH_BASED_DISTINCT_LIMIT_THRESHOLD = "hash_based_distinct_limit_threshold"; | |||
public static final String ROUND_ROBIN_SHUFFLE_BEFORE_PARTIAL_AGG = "round_robin_shuffle_before_partial_agg"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would call this shuffle_before_partial_distinct_limit
since it's specific to distinct limit, not all partial aggs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking this could be useful for others as well so we can use it whenever applicable like Distinct (and maybe other group by as well).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But on second thoughts, changed it.
false), | ||
booleanProperty( | ||
ROUND_ROBIN_SHUFFLE_BEFORE_PARTIAL_AGG, | ||
"Add a local roundrobin shuffle before partial agg", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agg -> distinct limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previouis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description still needs to be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kaikalur this still says partial agg instead of partial distinct limit.
return roundRobinShuffleBeforePartialAgg; | ||
} | ||
|
||
@Config("round-robin-shuffle-before-partial-agg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimizer properties should have the optimizer
prefix. So this + the name change would be
optimizer.shuffle-before-partial-distinct-limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -112,7 +112,7 @@ public PlanNode optimize(PlanNode plan, Session session, TypeProvider types, Pla | |||
requireNonNull(variableAllocator, "variableAllocator is null"); | |||
requireNonNull(idAllocator, "idAllocator is null"); | |||
if (SystemSessionProperties.isOptimizeHashGenerationEnabled(session)) { | |||
PlanWithProperties result = plan.accept(new Rewriter(idAllocator, variableAllocator, functionAndTypeManager), new HashComputationSet()); | |||
PlanWithProperties result = plan.accept(new Rewriter(session, idAllocator, variableAllocator, functionAndTypeManager), new HashComputationSet()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert the changes to HashGenerationOptimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
ce209f4
to
3161b87
Compare
Addressed all comments |
@@ -338,6 +339,7 @@ public void testExplicitPropertyMappings() | |||
.put("execution-policy.max-stage-count-for-eager-scheduling", "123") | |||
.put("hyperloglog-standard-error-warning-threshold", "0.02") | |||
.put("optimizer.prefer-merge-join", "true") | |||
.put("round-robin-shuffle-before-partial-agg", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to use the updated name
3161b87
to
391d07e
Compare
Does it make sense to only apply this optimization on top of |
But it won't hurt right? Local RR shuffle looks really cheap (almost free)? |
For intermediate stages that are already partitioned on the distinct keys, this won’t help and will just add extra work. For intermediate stages that aren’t already partitioned appropriately, there’s a chance that a local partitioned exchange would be better (in some cases, maybe not others depending on the overall plan shape). |
More interestingly, I'm thinking if this will help other kinds of aggregations as well to reduce the intermediate data size
Yeah - that's what I'm saying when it helps it helps greatly but not much overhead otherwise. |
It could, yeah. There's ultimately a trade-off though, because if the subsequent partial aggregation / distinct operation is expensive enough then you might end up harming the input processing throughput enough to actually regress performance. Actually, now that I think about it- forcing a local exchange of any kind on top of a source stage might cause problems in relation to the |
Well that's why it's optional optimization :) We are planning to run it on some of our interactive workloads where it is common for users to look for distinct values - with a small number (several dozen) of distincts but a much larger limit (like 10k) in XL sized datasets. There this helped quite a bit. |
Addressed all the comments |
@pettyjamesm Can you explain this more? I'm not so familiar with how split scheduling works nor how local exchange changes things. How is this different than if this happens without the local exchange, or similarly if there were an expensive function as part of the processing? Also, since there is also a targetConcurrency, there is still an upper limit on how many tasks will get scheduled. |
For input stages, the number of target "leaf" splits that should be running at any time is controlled by For every leaf split that completes, it will also either increment or decrement the target concurrency depending on whether the current utilization is greater or less than 0.5. Greater than 0.5: decrement the target value, less than 0.5: increment it. The "utilization" value that it receives to drive this decision is the "percent full" of the task-wide output buffer compared to the max output buffer size. Basically, there's an hidden assumption that the logic is making: that so long as the output buffer is < 50% full, it's safe to increase the concurrency of leaf split processing. This approximation is OK if you assume that leaf split pipelines are always essentially linear paths straight to the output buffer that can be arbitrarily thread-parallelized to increase throughput. If you break a source stage into two pipelines where the input pipeline feeds a local exchange sink instead of an output buffer and some other pipeline reading from a local exchange source is what the sends data to the output buffer, the concurrency controller will make bad decisions about when to adjust the target concurrency for leaf splits for two reasons:
The risk is that you'll have a "slow" leg on the pipeline after the local exchange that bottlenecks data to the output buffer, and the number of leaf splits that get started will just climb and climb. Even in the case of This should be fixable if the split concurrency controller received a utilization value that somehow estimated the percentage of time that the input pipeline's local exchange sink or (and/or associated source) was blocked instead of just looking at the output buffer utilization- but without extra logic to handle that, creating plans that contain leaf split pipelines terminating in a local exchange (of any kind) instead of an output buffer is dangerous. |
Another problem: it’s possible that pages with not-yet-loaded LazyBlocks still present could transit through the local exchange which would be a thread safety issue. You would need to force those to be loaded before passing the pages into the local exchange sink. |
I have seen this (local RR exchange) in the plans for INNER JOIN as well so I thought this should be safe. If not we should add a check to the plan validator to disallow it. |
Broadcast inner joins would have a leaf pipeline and a pipeline for broadcast input, but those would be linked to each other via the join operators and the leaf split pipeline would still process data directly through to the task output buffer. I’m not aware of any plan shapes that would terminate a leaf pipeline with a local exchange instead of output operator, but I might have missed some relevant changes and maybe my understanding is out of date. Can you provide an example of a query / plan that already does that? |
One query shape I have seen is partitioned join of: (A UNION ALL B) JOIN C - in that case we read the two remote sources for A and B and then add a local RR exchange (ostensibly to mitigate skew?) and then join. I will try and craft a tpch query that shows it.
Plan:
|
Thanks for the detailed explanation! To make sure I'm understanding correctly- is the problem that leaf stages have a variable task concurrency, but the operators after the local exchange will have a fixed task concurrency, so you'll keep adding more splits to the leaf stage without increasing the concurrency of the post-exchange operations? I wonder how that compares to other existing situations. It sounds like the algorithm basically assumes that the leaf processing is always going to be cheap, and the bottleneck will be in the output buffer. There are already plenty of cases where the leaf stage could be expensive and not produce a lot of output because it is busy (expensive filter/partial aggregation, extreme regex, even a hardware issue that just makes processing super slow). Wouldn't those cases also keep getting splits piled on beyond what they can handle? |
@pettyjamesm Here is an example where a leaf stage has a local exchange:
|
Yep, that's the essence of where the potential problems could begin.
This is true, but the potential problems are different here. If you have an expensive filter or regex operation- you can still fundamentally hope to increase throughput by increasing the parallelism factor because splits are processed independently of one another. Bottlenecks in one driver don't directly create pressure on another, except in competing for scheduled time on a CPU (so long as they aren't competing for output buffer space which is what What can happen here, is that post-exchange drivers can create a bottleneck where increasing the concurrency of the leaf splits will just create more competition for the exchange source / sink, but splits will presumably be finishing. As @kaikalur pointed out though, apparently these plans already do occur which was news to me. I think in the example given with Fundamentally, I'm not opposed to this change so don't consider this a blocking review- but I think it's risky and wanted to call that out. |
Yes, I have it disabled by default. My plan is to shadow real workloads with this feature enabled and see how it does and enable it for some specific use cases. |
Thanks @pettyjamesm for the super clear explanation. That makes a lot of sense. |
391d07e
to
3ecbf9e
Compare
Done |
Local round robin makes the result of the partial agg much smaller than before resulting in quicker results. For some of our internal tests, we have seen it go down from 81M to 48K for one example and similar dramatic improvements in other cases as well.
Test plan - Added a new test