Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413 #16416

Merged
merged 3 commits into from
Jun 13, 2022

Conversation

simmend
Copy link
Contributor

@simmend simmend commented Jul 13, 2021

This feature adds an initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413. In particular, it adds support for constraint optimizations involving primary and unique key constraints. This support manifests as a general-purpose optimization framework for associating logical properties with the result table produced by a plan node that optimization rules then exploit to generate more efficient plans. These logical properties initially derive from unique key constraints defined on database tables. They are further augmented and refined by the grouping, limiting, predicate application, and other operations performed by plan nodes. The feature adds several new optimization rules that exploit logical properties to discover and remove redundant query operations. Future work will extend this framework with constraint optimizations involving logical properties derived from referential integrity constraints, functional dependencies, order dependencies, and other types of constraints. This work exploits Hive 3.1.2 catalog capabilities that allow for the definition of informational constraints on Hive tables. However, this feature can extend to any data source that provides enforced or informational table constraints. Please see the design and test strategy documents attached issue #16413 for additional details.

Please find the detailed design document here:
https://docs.google.com/document/d/1h9C7dJck2PFPtvhksUCB74082zIn9sAZG1jNlQ7kqaU/edit

Test plan - (Please fill in how you tested your changes)
Please see the test plan here: https://docs.google.com/document/d/19SpdkE6z4Q_hT6BIox9zR5DvBbwUW_RlSZty862emn4/edit

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Add support for constraint optimizations in Hive MetaStore. This feature can be turned on by setting the session property "exploit_constraints" and config property "optimizer.exploit-constraints" to true.

@yingsu00 yingsu00 changed the title This feature adds an initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413 WIP - initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413 Jul 13, 2021
@simmend simmend force-pushed the constraintOptimization branch from a0a6c8f to 299c4ae Compare July 14, 2021 01:43
@simmend simmend requested review from kaikalur and rongrong July 14, 2021 02:50
@simmend simmend marked this pull request as ready for review July 14, 2021 02:50
@simmend simmend requested a review from rschlussel July 14, 2021 02:53
@simmend simmend changed the title WIP - initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413 Initial phase of support for constraint optimization to the PrestoDB optimizer in response to issue #16413 Jul 14, 2021
@yingsu00 yingsu00 linked an issue Jul 14, 2021 that may be closed by this pull request
@yingsu00 yingsu00 self-requested a review July 15, 2021 01:53
@yingsu00
Copy link
Contributor

@simmend Just browsing the PR and noticed two small things before reading other parts:

  1. It seems the documentation change is missing. This feature introduces new language constructs and users need to know about it from Presto documentation. For example, you may want to add an example how to create table with primary key in https://prestodb.io/docs/current/sql/create-table.html

  2. Release notes section also needs to be updated. The guidelines can be found https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines

@nmahadevuni
Copy link
Member

@simmend Just browsing the PR and noticed two small things before reading other parts:

  1. It seems the documentation change is missing. This feature introduces new language constructs and users need to know about it from Presto documentation. For example, you may want to add an example how to create table with primary key in https://prestodb.io/docs/current/sql/create-table.html
  2. Release notes section also needs to be updated. The guidelines can be found https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines

@yingsu00 : This change does't introduce any syntax changes to create table. It uses the already existing constraints defined for the Hive tables from Hive client interfaces.

@rschlussel
Copy link
Contributor

This change touches 92 files. I know it's a bit annoying to change after the fact, but it would make it much easier to review if you split it into multiple commits (can all still be in this PR). Especially splitting out parts that are refactoring/threading some additional properties through vs. core logic. If there are logically distinct pieces of the core changes, that can be helpful to split out too.

Haven't gone through the design doc yet.

@simmend simmend force-pushed the constraintOptimization branch 2 times, most recently from ddffe44 to 738e2f5 Compare July 26, 2021 22:28
@simmend simmend force-pushed the constraintOptimization branch from 738e2f5 to 4e0ea30 Compare August 13, 2021 20:00
Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial high level comments

@simmend simmend force-pushed the constraintOptimization branch from 4e0ea30 to 4b09d29 Compare October 6, 2021 16:44
@rubenssoto
Copy link

@simmend Gently reminder, this is a nice feature!

@simmend
Copy link
Contributor Author

simmend commented Nov 22, 2021

@simmend Just browsing the PR and noticed two small things before reading other parts:

  1. It seems the documentation change is missing. This feature introduces new language constructs and users need to know about it from Presto documentation. For example, you may want to add an example how to create table with primary key in https://prestodb.io/docs/current/sql/create-table.html
  2. Release notes section also needs to be updated. The guidelines can be found https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines

Tables with constraints must be defined directly via Hive 3.0. Please see section 2 of the design doc. That is, this feature does not introduce CREATE/ALTER DDL changes to Presto. Those changes will require thinking about how to abstract constraint definitions across various federated sources and are deferred to future work. Please see section 5.4 of the design document. Will look into the release notes guidelines. Thank you for pointing that out.

@simmend
Copy link
Contributor Author

simmend commented Nov 22, 2021

@simmend Gently reminder, this is a nice feature!

Yes it is. I cannot wait until it merges. There are some powerful extensions and optimization that can build off of this work.

Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments: Please read the https://chris.beams.io/posts/git-commit/ for commit title and commit message guidelines.

@simmend simmend force-pushed the constraintOptimization branch 5 times, most recently from 6e3311b to 7443301 Compare December 8, 2021 23:22
@simmend simmend force-pushed the constraintOptimization branch 10 times, most recently from 9a1112b to 2e613d0 Compare May 20, 2022 16:28
simmend added 2 commits June 6, 2022 17:34
Retrieved table constraints are associated with TableConnectorMetadata
and subsequently made available to the optimizer via a TableScanNode
argument. Subsequent commits will take advantage of these constraints
by mapping them into logical properties that can be exploited by
optimization rules. Note that if the session variable
exploit_constraints=false (the default now), no attempt is even made
made to read constraints from HMS.
Logical properties are initially derived from constraints defined for
base tables and from properties of values nodes.These logical
properties hold for the result table produced by a plan node. These
base logical properties are then propagated through various query
operations including filters, projects, joins, and aggregations.
Logical properties are only computed by iterative planners that pass
a logical property provider as input. See the design doc linked from
issue 16413 for futher details. Such optimizers will be introduced by
next commit; however, there are test cases in this commit that trigger
logical property propgation. Note that if the session variable
exploit_constraints=false (the default now) no attempt is made to
compute logical properties and hence optimization rules that seek them
out will simply fail to fire.
@simmend simmend force-pushed the constraintOptimization branch 2 times, most recently from d5b85bf to 081ab07 Compare June 6, 2022 23:33
Implements iterative optimizers that look to exploit logical properties
propagated as per the previous commit. Note that if the session
variable exploit_constraints=false (the default now) no attempt is made
to compute logical properties and the optimization rules commited here
will not fire.
@simmend simmend force-pushed the constraintOptimization branch from 081ab07 to 7bce0d4 Compare June 7, 2022 12:11
Copy link
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 2nd commit.

return otherEquivalenceClassProperty.equivalenceClasses.entrySet()
.stream()
.allMatch(e -> {
final Set<RowExpression> otherEqClass = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: final keyword is not necessary. Also, why not using immutable set here as well?

extractConjuncts(predicate).stream()
.filter(CallExpression.class::isInstance)
.map(CallExpression.class::cast)
.filter(e -> isVariableEqualVariableOrConstant(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: filter(EquivalenceClassProperty::isVariableEqualVariableOrConstant)

if (head1 instanceof ConstantExpression) {
return head1;
}
else if (head2 instanceof ConstantExpression) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: else is not needed.

{
requireNonNull(equivalenceClassProperty, "Equivalence class property must be provided.");
Set<VariableReferenceExpression> unBoundVariables = new HashSet<>();
variables.stream().forEach(v -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: variables.forEach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it's not clear to me why you use forEach at some places but for(.. : ..) at others. I don't have strong opinions. Generally speaking, if the logic is simple, use forEeach, otherwise use for loop. So here you might want to use for loop.

*/
private void addNonRedundantKey(Key newKey)
{
requireNonNull(newKey, "newKey is null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this null check is unnecessary. it's a private method.

Comment on lines +310 to +311
private KeyProperty keyProperty = new KeyProperty();
private EquivalenceClassProperty equivalenceClassProperty;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equivalenceClassProperty is updated in class methods and cannot be final.

* This logical properties builder should be used by PlanNode's that propagate their
* source properties and add a limit. For example, TopNNode and LimitNode.
*/
public static class PropagateAndLimitBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these Builder classes are actual builders. They don't need to exist. Just have static creator methods like public static LogicalPropertiesImpl propagateAndLimitProperties(...), public static tableScanProperties(...), etc.

@Override
public LogicalProperties computeLogicalProperties(LogicalPropertiesProvider logicalPropertiesProvider)
{
requireNonNull(logicalPropertiesProvider, "logicalPropertiesProvider cannot be null.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: we typically don't check nulls in methods.

node.getSource(),
node.getAggregations().entrySet().stream().collect(Collectors.toMap(e -> e.getKey(), e ->
(e.getValue().isDistinct() &&
((GroupReference) node.getSource()).getLogicalProperties().get().isDistinct(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have to run this same logic again in apply, might as well use a simpler pattern, and check for match once here.

@yingsu00
Copy link
Contributor

@rongrong Thank you for reviewing again. I'll address your comments in a later PR. Issue created here:#17869
@kaikalur Sreeni, thank you for reviewing. Did Dave's explanation make sense to you? If yes would you please update your review?

@rongrong rongrong requested a review from kaikalur June 13, 2022 22:25
@rongrong rongrong merged commit 5c3ac4c into prestodb:master Jun 13, 2022
@highker highker mentioned this pull request Jul 6, 2022
7 tasks

//already in same equivalence class, nothing to do
//note that we do not check head1.equal(head2) so that two different variable reference objects
//referencing the same reference are both added to the equivalence class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simmend can you explain why we want this to be true? (and it seems like when we use it, it's always as part of a set, or inside a map, so the distinction isn't preserved).

@yingsu00
Copy link
Contributor

Pasting for @rschlussel
"I think the code also should be refactored to change all the property classes to be immutable. There isn’t really a good reason for them to be constructed and then updated. Most actually start with an empty property and then update once to some source properties. There are a few that combine property classes, but it would be better to have a way to combine that constructs a new combined instance rather than the mutable objects we currently have. They never need to be mutated after they are fully constructed by the logical properties builders, so we should make them immutable."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Constraint Support and Optimizations
7 participants