Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect HBO stats from complete stages in failed queries #20947

Merged
merged 1 commit into from
Mar 13, 2024

Conversation

feilong-liu
Copy link
Contributor

@feilong-liu feilong-liu commented Sep 23, 2023

Description

Record the operator stats if the stage completes, even if the query failed.

Motivation and Context

Currently in HBO, we only record stats if the query is successful. However, even if the query failed, some of its stage can still be successful, and we can store the statistics for these operators.

Impact

More stats available for HBO. It can also apply HBO to previous failed queries, and potentially make failed queries successful.

Test Plan

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add a session property `track_history_based_plan_statistics_from_complete_stages_in_failed_query` to enable tracking hbo statistics from complete stages in failed queries

@feilong-liu feilong-liu requested a review from a team as a code owner September 23, 2023 00:28
@feilong-liu feilong-liu marked this pull request as draft September 23, 2023 00:30
@feilong-liu feilong-liu changed the title Collect HBO stats from complete stages in faile queries Collect HBO stats from complete stages in failed queries Oct 13, 2023
@feilong-liu feilong-liu force-pushed the record_failed_hbo branch 4 times, most recently from b267529 to be0fb46 Compare March 8, 2024 18:40
@feilong-liu feilong-liu marked this pull request as ready for review March 8, 2024 18:41
kaikalur
kaikalur previously approved these changes Mar 8, 2024
@rschlussel
Copy link
Contributor

looks good, but the property name is pretty long. i wonder if there's a shorter name we could use that would still be clear.

@feilong-liu
Copy link
Contributor Author

looks good, but the property name is pretty long. i wonder if there's a shorter name we could use that would still be clear.

Rename to track_complete_stages_stats_from_failed_query after consulting metamate lol

jaystarshot
jaystarshot previously approved these changes Mar 8, 2024
@@ -1522,6 +1523,11 @@ public SystemSessionProperties(
"Track history based plan statistics service in query optimizer",
featuresConfig.isTrackHistoryBasedPlanStatistics(),
false),
booleanProperty(
TRACK_COMPLETE_STAGES_STATS_FROM_FAILED_QUERY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need "COMPLETE_STAGES" as part of the property name - perhaps leave it out and only specify it in the documentation string
another idea would be to turn this property around and make it track_history_from_successful_queries_only

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if we need "COMPLETE_STAGES" as part of the property name - perhaps leave it out and only specify it in the documentation string another idea would be to turn this property around and make it track_history_from_successful_queries_only

So session params should be default true to reflect the preferred default behavior for clarity. I prefer true values for session params as default. So I say stick with failed_queries version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to avoid words with negative meaning in the flags because it makes it hard to reason about enabling/disabling but it makes sense to leave it as in this case. We would probably only need to disable it if there is a problem with it

pranjalssh
pranjalssh previously approved these changes Mar 11, 2024
Copy link
Contributor

@pranjalssh pranjalssh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Lyublena's name suggestion. Accepting in advance

@feilong-liu feilong-liu dismissed stale reviews from pranjalssh and jaystarshot via 0db0c32 March 11, 2024 19:31
Copy link

github-actions bot commented Mar 11, 2024

Codenotify: Notifying subscribers in CODENOTIFY files for diff 58d5767...8db9bad.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/admin/properties.rst
presto-docs/src/main/sphinx/optimizer/history-based-optimization.rst

@feilong-liu feilong-liu requested a review from rschlussel March 11, 2024 21:25
rschlussel
rschlussel previously approved these changes Mar 12, 2024
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. just recommend updating the property name to say queries rather than query

@@ -812,6 +812,14 @@ Optimizer Properties
Enable analysis and propagation of logical properties (distinct keys, cardinality, etc.) among the nodes of
a query plan. The optimizer may then use these properties to perform various optimizations.

``optimizer.track-history-stats-from-failed-query``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update documentation if you change the property name

Suggested change
``optimizer.track-history-stats-from-failed-query``
``optimizer.track-history-stats-from-failed-queries``

@@ -124,7 +126,7 @@ public Map<PlanNodeWithHash, PlanStatisticsWithSourceInfo> getQueryStats(QueryIn
}

StageInfo outputStage = queryInfo.getOutputStage().get();
List<StageInfo> allStages = outputStage.getAllStages();
List<StageInfo> allStages = trackStatsForFailedQueries ? outputStage.getAllStages().stream().filter(x -> x.isFinalStageInfo()).collect(toImmutableList()) : outputStage.getAllStages();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test that the final stage info is only present stages that succeeded and not for stages that finished with a failure? You may also need to add a check that the stage state is "FINISHED"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed condition to check the stage state is FINISHED

@kaikalur
Copy link
Contributor

kaikalur commented Mar 12, 2024

Maybe you can add a complete e2e test to run a query with join with build side stats missing with failure of say join but get the stats for the build side and make sure it's stats are recorded? Also make it a test on insert query so we get as close to real life test as possible.

mlyublena
mlyublena previously approved these changes Mar 12, 2024
@feilong-liu feilong-liu dismissed stale reviews from mlyublena and rschlussel via 8db9bad March 13, 2024 20:55
@Test
public void testFailedQuery()
{
String sql = "select o.orderkey, l.partkey, l.mapcol[o.orderkey] from (select orderkey, partkey, mapcol from (select *, map(array[1], array[2]) mapcol from lineitem)) l " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join will fail, but the build input will succeed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this comment in the test

@Test
public void testFailedQuery()
{
String sql = "select o.orderkey, l.partkey, l.mapcol[o.orderkey] from (select orderkey, partkey, mapcol from (select *, map(array[1], array[2]) mapcol from lineitem)) l " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add this comment in the test

Copy link
Contributor

@kaikalur kaikalur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing the e2e test!

@feilong-liu feilong-liu merged commit e088622 into prestodb:master Mar 13, 2024
57 checks passed
@feilong-liu feilong-liu deleted the record_failed_hbo branch March 13, 2024 22:10
@elharo
Copy link
Contributor

elharo commented Mar 14, 2024

This morning I started to see the flake in TestHistoryBasedStatsTracking shown below. Any chance that was caused by this PR? Timing and classes change look suspicious though I don't immediately see the cause.

2024-03-14T11:13:03.2037597Z [ERROR] Tests run: 61, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 203.158 s <<< FAILURE! - in TestSuite
2024-03-14T11:13:03.2039919Z [ERROR] com.facebook.presto.execution.TestHistoryBasedStatsTracking.testHistoryBasedStatsCalculator  Time elapsed: 0.103 s  <<< FAILURE!
2024-03-14T11:13:03.2041698Z java.lang.AssertionError: 
2024-03-14T11:13:03.2042236Z Plan does not match, expected [
2024-03-14T11:13:03.2042619Z 
2024-03-14T11:13:03.2042831Z - anyTree
2024-03-14T11:13:03.2043243Z     - node(FilterNode)
2024-03-14T11:13:03.2043717Z         expectedOutputRowCount(2.0)
2024-03-14T11:13:03.2044279Z         expectedOutputSize(199.0)
2024-03-14T11:13:03.2044826Z         - node
2024-03-14T11:13:03.2045059Z 
2024-03-14T11:13:03.2045211Z ] but found [
2024-03-14T11:13:03.2045439Z 
2024-03-14T11:13:03.2047776Z - Output[PlanNodeId 6][nationkey, name, regionkey, comment] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
2024-03-14T11:13:03.2053014Z     - ScanFilter[PlanNodeId 0,108][table = TableHandle {connectorId='tpch', connectorHandle='nation:sf0.01', layout='Optional[nation:sf0.01]'}, filterPredicate = (substr(name, BIGINT'1', BIGINT'1')) = (VARCHAR'A')] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
2024-03-14T11:13:03.2055339Z             regionkey := tpch:regionkey (1:15)
2024-03-14T11:13:03.2055952Z             name := tpch:name (1:15)
2024-03-14T11:13:03.2056545Z             comment := tpch:comment (1:15)
2024-03-14T11:13:03.2057173Z             nationkey := tpch:nationkey (1:15)
2024-03-14T11:13:03.2057603Z 
2024-03-14T11:13:03.2057742Z ]
2024-03-14T11:13:03.2058575Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:56)
2024-03-14T11:13:03.2060103Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:40)
2024-03-14T11:13:03.2062121Z 	at 

@feilong-liu
Copy link
Contributor Author

This morning I started to see the flake in TestHistoryBasedStatsTracking shown below. Any chance that was caused by this PR? Timing and classes change look suspicious though I don't immediately see the cause.

2024-03-14T11:13:03.2037597Z [ERROR] Tests run: 61, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 203.158 s <<< FAILURE! - in TestSuite
2024-03-14T11:13:03.2039919Z [ERROR] com.facebook.presto.execution.TestHistoryBasedStatsTracking.testHistoryBasedStatsCalculator  Time elapsed: 0.103 s  <<< FAILURE!
2024-03-14T11:13:03.2041698Z java.lang.AssertionError: 
2024-03-14T11:13:03.2042236Z Plan does not match, expected [
2024-03-14T11:13:03.2042619Z 
2024-03-14T11:13:03.2042831Z - anyTree
2024-03-14T11:13:03.2043243Z     - node(FilterNode)
2024-03-14T11:13:03.2043717Z         expectedOutputRowCount(2.0)
2024-03-14T11:13:03.2044279Z         expectedOutputSize(199.0)
2024-03-14T11:13:03.2044826Z         - node
2024-03-14T11:13:03.2045059Z 
2024-03-14T11:13:03.2045211Z ] but found [
2024-03-14T11:13:03.2045439Z 
2024-03-14T11:13:03.2047776Z - Output[PlanNodeId 6][nationkey, name, regionkey, comment] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
2024-03-14T11:13:03.2053014Z     - ScanFilter[PlanNodeId 0,108][table = TableHandle {connectorId='tpch', connectorHandle='nation:sf0.01', layout='Optional[nation:sf0.01]'}, filterPredicate = (substr(name, BIGINT'1', BIGINT'1')) = (VARCHAR'A')] => [nationkey:bigint, name:varchar(25), regionkey:bigint, comment:varchar(152)]
2024-03-14T11:13:03.2055339Z             regionkey := tpch:regionkey (1:15)
2024-03-14T11:13:03.2055952Z             name := tpch:name (1:15)
2024-03-14T11:13:03.2056545Z             comment := tpch:comment (1:15)
2024-03-14T11:13:03.2057173Z             nationkey := tpch:nationkey (1:15)
2024-03-14T11:13:03.2057603Z 
2024-03-14T11:13:03.2057742Z ]
2024-03-14T11:13:03.2058575Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:56)
2024-03-14T11:13:03.2060103Z 	at com.facebook.presto.sql.planner.assertions.PlanAssert.assertPlan(PlanAssert.java:40)
2024-03-14T11:13:03.2062121Z 	at 

@elharo This PR shouldn't fail this test. Can you share a link to the failed test?

@wanglinsong wanglinsong mentioned this pull request May 1, 2024
48 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants