Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Heuristic CTE Materialization Strategy #21720

Merged
merged 1 commit into from
Mar 6, 2024

Conversation

jaystarshot
Copy link
Member

@jaystarshot jaystarshot commented Jan 17, 2024

  1. This PR adds support Heuristic CTE materialization.
    We have rolled this in one of our production cluster and confirmed that this is stable.
    Our current testing has shown upto 50% cpu gain and > 50% less hdfs read for affected queries.
    Its heuristic so there may be cases where non-ideal plans are generated.

  2. Changed cteName member variable to cteId for cte plan nodes.

  3. Also fixed CteInformation to include cteId so that correct references are incremented which crucial for cte materialization.

In heuristic strategy, the relational planner adds cteReferences everywhere and the LogicalCteOptimizer decides which references to implement

Fixes #21637

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add Heuristic CTE Materialization strategy which auto materialized expensive ctes. This is configurable by setting ``cte_materialization_strategy`` to ``HEURISTIC`` or ``HEURISTIC_COMPLEX_QUERIES_ONLY``. (default ``NONE``)

@jaystarshot jaystarshot marked this pull request as ready for review January 25, 2024 06:00
@jaystarshot jaystarshot requested a review from a team as a code owner January 25, 2024 06:00
@jaystarshot jaystarshot force-pushed the 21637 branch 9 times, most recently from cfa8b25 to b3d82f7 Compare January 31, 2024 06:34
@jaystarshot
Copy link
Member Author

@feilong-liu @mlyublena Please review when you get time, thanks!

@@ -357,7 +357,9 @@ public boolean isAdoptingMergedPreference()
public enum CteMaterializationStrategy
{
ALL, // Materialize all CTES
NONE // Materialize no ctes
NONE, // Materialize no CTES
HEURISTIC,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add comments for the new values

NONE // Materialize no ctes
NONE, // Materialize no CTES
HEURISTIC,
HEURISTIC_STRICT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better name than strict: perhaps HEURISTIC_COMPLEX_QUERIES_ONLY? we already have "complex query" reference in another session property (RESTRICT_HISTORY_BASED_OPTIMIZATION_TO_COMPLEX_QUERY)

HashMap<String, CTEInformation> cteInformationMap = session.getCteInformationCollector().getCteInformationMap();
CTEInformation cteInfo = cteInformationMap.get(node.getCteId());
switch (getCteMaterializationStrategy(session)) {
case HEURISTIC_STRICT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to also disable materialization of VALUES nodes? so only proceed if the query is a proper join/aggregation on top of table scans

@mlyublena
Copy link
Contributor

LGTM modulo some nits

@jaystarshot jaystarshot force-pushed the 21637 branch 4 times, most recently from 93ee4fb to 693ad79 Compare February 2, 2024 04:38
@jaystarshot
Copy link
Member Author

@mlyublena Thanks for your review, I have

  1. updated the PR to include the values check
  2. Added tests in case to assert no materialization in heuristic cases
  3. Backported a small fix from our repo in case there is no materialization in heuristic cases (cte references were not removed)

@jaystarshot jaystarshot requested a review from mlyublena February 2, 2024 04:40
@jaystarshot jaystarshot force-pushed the 21637 branch 2 times, most recently from 7b1cc96 to db3edae Compare February 6, 2024 21:41
@jaystarshot
Copy link
Member Author

Rebased

@@ -1082,6 +1083,11 @@ public SystemSessionProperties(
"Enable pushing of filters and projections inside common table expressions.",
featuresConfig.getCteFilterAndProjectionPushdownEnabled(),
false),
integerProperty(
CTE_HEURISTIC_REPLICATION_THRESHOLD,
"Used with CTE Materialization Strategy = Heuristic. CTES are only materialized if they are used greater than or equal to this number",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps change comment to "if they are used more than this number of times"

public CteConsumerTransformer(PlanNodeIdAllocator idAllocator, VariableAllocator variableAllocator)
private final Session session;

private static final List<Class<? extends PlanNode>> PRECOMPUTE_PLAN_NODES = ImmutableList.of(JoinNode.class, SemiJoinNode.class, AggregationNode.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call these two COMPLEX_PLAN_NODES and COMPLEX_DATA_SOURCE_NODES or something like that to better correlate with the session property name?

mlyublena
mlyublena previously approved these changes Feb 6, 2024
pranjalssh
pranjalssh previously approved these changes Feb 14, 2024
@pranjalssh
Copy link
Contributor

Please make sure failing tests are not related

@jaystarshot jaystarshot dismissed stale reviews from pranjalssh and mlyublena via f91f0b9 March 4, 2024 20:35
@jaystarshot jaystarshot force-pushed the 21637 branch 2 times, most recently from f91f0b9 to df68261 Compare March 4, 2024 20:36
@jaystarshot
Copy link
Member Author

jaystarshot commented Mar 4, 2024

@mlyublena @pranjalssh Thank you for the review. However I found a test case added here wherein, the outer CTE would be materialized even if it wasn't complex if the inner CTE was complex,
hence I changed the flow a bit to accommodate this. With this i think the cbo replacement would also be easier

  1. Now, we first determine if a CTE is complex and then perform the rewrite.
  2. Also cleaned up some session property handling, along with some renaming in the classes for clarity.
    Please re-review!
    cc: @tdcmeehan

pranjalssh
pranjalssh previously approved these changes Mar 6, 2024
@@ -364,7 +364,9 @@ public boolean isAdoptingMergedPreference()
public enum CteMaterializationStrategy
{
ALL, // Materialize all CTES
NONE // Materialize no ctes
NONE, // Materialize no CTES
HEURISTIC, // Materialze CTES occuring > CTE_HEURISTIC_REPLICATION_THRESHOLD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HEURISTIC, // Materialze CTES occuring > CTE_HEURISTIC_REPLICATION_THRESHOLD
HEURISTIC, // Materialize CTES occuring > CTE_HEURISTIC_REPLICATION_THRESHOLD

Summary:
Add 2 materialization strategies Heuristic and Heuristic_Strict.
In Heuristic Strategy the Ctes will only be materialized if they occured > x times.
In Heuristic_strict strategy the CTE need to have a join or aggregate
Copy link

github-actions bot commented Mar 6, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 9dfc461...09db46e.

No notifications.

@jaystarshot
Copy link
Member Author

@pranjalssh Need a review again due to the comment change

@jaystarshot jaystarshot merged commit 68c507d into prestodb:master Mar 6, 2024
56 checks passed
@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Heuristic Based CTE Materialization
4 participants