Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25784][SQL] Infer filters from constraints after rewriting predicate subquery #22778

Closed
wants to merge 6 commits into from
Closed

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Oct 20, 2018

What changes were proposed in this pull request?

SPARK-22662 fixed failed to prune columns after rewriting predicate subquery. but infer filters is still missing, this pr fix it.

set spark.sql.autoBroadcastJoinThreshold=-1;
create table t1 using parquet as select 1 as col1, 2 as col2;
create table t2 using parquet as select * from t1;

Before this patch:

spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from t2);
== Physical Plan ==
SortMergeJoin [col1#3], [col1#5], LeftSemi
:- *(2) Sort [col1#3 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(col1#3, 200)
:     +- *(1) FileScan parquet default.t1[col1#3,col2#4] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int,col2:int>
+- *(4) Sort [col1#5 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(col1#5, 200)
      +- *(3) Project [col1#5]
         +- *(3) FileScan parquet default.t2[col1#5] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/data/spark/spark-3.0.0-SNAPSHOT-bin-thriftserver/spark-ware..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col1:int>

After this patch:

spark-sql> explain select t1.* from t1 where t1.col1 in (select col1 from t2);
== Physical Plan ==
SortMergeJoin [col1#6], [col1#11], LeftSemi
:- *(2) Sort [col1#6 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(col1#6, 200)
:     +- *(1) Project [col1#6, col2#7]
:        +- *(1) Filter isnotnull(col1#6)
:           +- *(1) FileScan parquet default.t1[col1#6,col2#7] Batched: true, DataFilters: [isnotnull(col1#6)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/t1], PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int,col2:int>
+- *(4) Sort [col1#11 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(col1#11, 200)
      +- *(3) Project [col1#11]
         +- *(3) Filter isnotnull(col1#11)
            +- *(3) FileScan parquet default.t2[col1#11] Batched: true, DataFilters: [isnotnull(col1#11)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/t2], PartitionFilters: [], PushedFilters: [IsNotNull(col1)], ReadSchema: struct<col1:int>

How was this patch tested?

unit tests and benchmark tests

1. Benchmark test 1

withTempView("t1", "t2") {
  withTempDir { dir =>
    spark.range(3000000)
      .selectExpr("cast(null as int) as c1", "if(id % 2 = 0, null, id) as c2", "id as c3")
      .coalesce(1)
      .orderBy("c2")
      .write
      .mode("overwrite")
      .option("parquet.block.size", 10485760)
      .parquet(dir.getCanonicalPath)

    spark.read.parquet(dir.getCanonicalPath).createTempView("t1")
    spark.read.parquet(dir.getCanonicalPath).createTempView("t2")

    Seq("c1", "c2", "c3").foreach { column =>
      val benchmark = new Benchmark(s"join key $column", 10)
      Seq(false, true).foreach { inferFilters =>
        benchmark.addCase(s"Is infer filters $inferFilters", numIters = 5) { _ =>
          withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> inferFilters.toString) {
            sql(s"select t1.* from t1 where t1.$column in (select $column from t2)").count()
          }
        }
      }
      benchmark.run()
    }
  }
}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c1:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Is infer filters false                        2005 / 2163          0.0   200481431.0       1.0X
Is infer filters true                          190 /  207          0.0    18962935.7      10.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c2:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Is infer filters false                        2368 / 2498          0.0   236803743.1       1.0X
Is infer filters true                         1234 / 1268          0.0   123443912.3       1.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
join key c3:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
Is infer filters false                        2754 / 2907          0.0   275376009.7       1.0X
Is infer filters true                         2237 / 2255          0.0   223739457.8       1.2X

2. Benchmark test 2

This change affect TPC-DS: q4, q5, q8, q10, q11, q14a, q14b, q16, q23a, q23b, q33, q35, q56, q58, q60, q69, q70, q74, q83, q93, q94, q95, q5a, q10a, q14, q35, q35a, q70a, q74. and this is benchmark result(note that: you can click SQL name to compare the optimized plan before and after):

SQL Before Best/Avg Time(ms) After Best/Avg Time(ms) Before Avg Time(ms) After Avg Time(ms) Relative
q35a-v2.7 18534 / 21968 17030 / 18824 21968 18824 0.8569
q69 21560 / 26378 21565 / 23928 26378 23928 0.9071
q35 25057 / 27146 18702 / 25185 27146 25185 0.9278
q56 31076 / 34614 25384 / 32604 34614 32604 0.9419
q33 25762 / 29239 25451 / 28012 29239 28012 0.958
q14-v2.7 131016 / 140958 131630 / 137203 140958 137203 0.9734
q4 145648 / 147785 140788 / 143970 147785 143970 0.9742
q10a-v2.7 19750 / 21240 19044 / 20814 21240 20814 0.9799
q70a-v2.7 16184 / 16571 16135 / 16279 16571 16279 0.9824
q11-v2.7 50646 / 54571 46192 / 53667 54571 53667 0.9834
q95 121299 / 129225 121197 / 128010 129225 128010 0.9906
q16 40738 / 42717 40133 / 42402 42717 42402 0.9926
q83 25370 / 26521 24994 / 26405 26521 26405 0.9956
q8 8837 / 11696 8688 / 11670 11696 11670 0.9978
q14b 165137 / 169002 164642 / 168943 169002 168943 0.9997
q23a 143793 / 151616 143302 / 151598 151616 151598 0.9999
q14a-v2.7 192375 / 212209 194224 / 212386 212209 212386 1.0008
q23b 135824 / 145186 137020 / 145420 145186 145420 1.0016
q14a 210252 / 214839 210073 / 215748 214839 215748 1.0042
q10 17649 / 21792 18157 / 21951 21792 21951 1.0073
q58 29045 / 31574 29685 / 31910 31574 31910 1.0106
q74 41370 / 46157 43644 / 46670 46157 46670 1.0111
q5a-v2.7 58671 / 64598 59746 / 65335 64598 65335 1.0114
q60 31167 / 35465 29486 / 35930 35465 35930 1.0131
q74-v2.7 41394 / 44803 42327 / 45914 44803 45914 1.0248
q11 49738 / 54953 52977 / 57312 54953 57312 1.0429
q93 72402 / 74155 73451 / 77526 74155 77526 1.0455
q5 62540 / 65776 66606 / 69293 65776 69293 1.0535
q35-v2.7 17534 / 19987 19913 / 21244 19987 21244 1.0629
q94 20375 / 20631 23385 / 24338 20631 24338 1.1797
q70 17182 / 19610 18252 / 23405 19610 23405 1.1935

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97629 has finished for PR 22778 at commit c8d1b91.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2018

Test build #97655 has finished for PR 22778 at commit 6596327.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Oct 20, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2018

Test build #97669 has finished for PR 22778 at commit 6596327.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Oct 21, 2018

cc @gatorsmile @cloud-fan @maropu

@maropu
Copy link
Member

maropu commented Oct 23, 2018

Can you put the concrete example of the missing case you described in the PR description?

@maropu
Copy link
Member

maropu commented Oct 23, 2018

Also, to make sure no performance regression in the optimizer, can you check optimizer statistics in TPCDS by running TPCDSQuerySuite, too?

CollapseProject,
RemoveRedundantProject) :: Nil
}

test("Column pruning after rewriting predicate subquery") {
val relation = LocalRelation('a.int, 'b.int)
val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)
withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to modify this existing test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, spark.sql.constraintPropagation.enabled=false to test ColumnPruning.
spark.sql.constraintPropagation.enabled=true to test ColumnPruning, InferFiltersFromConstraints and PushDownPredicate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Thanks.

}
}

test("Infer filters and push down predicate after rewriting predicate subquery") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the column pruning in the test title?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making the test title simple, then leaving comments about what's tested clearly here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about refactor these test to:

  val relation = LocalRelation('a.int, 'b.int)
  val relInSubquery = LocalRelation('x.int, 'y.int, 'z.int)

  test("Column pruning") {
    withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "false") {
      val query = relation.where('a.in(ListQuery(relInSubquery.select('x)))).select('a)

      val optimized = Optimize.execute(query.analyze)
      val correctAnswer = relation
        .select('a)
        .join(relInSubquery.select('x), LeftSemi, Some('a === 'x))
        .analyze

      comparePlans(optimized, correctAnswer)
    }
  }

  test("Column pruning, infer filters and push down predicate") {
    withSQLConf(SQLConf.CONSTRAINT_PROPAGATION_ENABLED.key -> "true") {
      val query = relation.where('a.in(ListQuery(relInSubquery.select('x)))).select('a)

      val optimized = Optimize.execute(query.analyze)
      val correctAnswer = relation
        .where(IsNotNull('a)).select('a)
        .join(relInSubquery.where(IsNotNull('x)).select('x), LeftSemi, Some('a === 'x))
        .analyze

      comparePlans(optimized, correctAnswer)
    }
  }

RewritePredicateSubquery,
ColumnPruning,
InferFiltersFromConstraints,
PushDownPredicate,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, cc @gatorsmile @maryannxue

@wangyum
Copy link
Member Author

wangyum commented Oct 25, 2018

@maropu This is optimizer statistics before and after this patch.

=== Metrics of Analyzer/Optimizer Rules before this patch ===                                                                                                                                                 === Metrics of Analyzer/Optimizer Rules after this patch ===
Total number of runs: 220860                                                                                                                                                                                  Total number of runs: 221842
Total time: 92.850628879 seconds                                                                                                                                                                              Total time: 100.810327761 seconds

Rule                                                                                               Effective Time / Total Time                     Effective Runs / Total Runs                                Effective Time / Total Time                     Effective Runs / Total Runs

org.apache.spark.sql.catalyst.optimizer.Optimizer$OptimizeSubqueries                               10132656768 / 10253620575                       47 / 390                                                   12304351112 / 12424750794                       47 / 390
org.apache.spark.sql.catalyst.optimizer.ColumnPruning                                              1880308393 / 8329162616                         391 / 2333                                                 2117896745 / 9071585339                         391 / 2333
org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution                                    5092672075 / 5110064107                         58 / 890                                                   4887656663 / 4903590901                         58 / 890
org.apache.spark.sql.catalyst.optimizer.PruneFilters                                               61752234 / 4105761812                           5 / 1943                                                   70559207 / 4465280263                           5 / 1943
org.apache.spark.sql.catalyst.optimizer.ReorderJoin                                                2721845400 / 3697707596                         181 / 1943                                                 3356876470 / 4365735954                         181 / 1943
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                                  2152781072 / 3114708964                         846 / 2296                                                 2014517480 / 3580105409                         312 / 780
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions                          875336763 / 2522798268                          45 / 2296                                                  2054648919 / 3061650250                         846 / 2296
org.apache.spark.sql.catalyst.optimizer.PushPredicateThroughJoin                                   1301162184 / 1925078331                         811 / 1943                                                 880342476 / 2535797956                          45 / 2296
org.apache.spark.sql.catalyst.optimizer.BooleanSimplification                                      35280989 / 1822909975                           9 / 1943                                                   1415726619 / 2070669333                         811 / 1943
org.apache.spark.sql.catalyst.analysis.DecimalPrecision                                            1486561519 / 1791324588                         288 / 2296                                                 38775682 / 1883707910                           9 / 1943
org.apache.spark.sql.catalyst.optimizer.InferFiltersFromConstraints                                1289610205 / 1384744531                         278 / 390                                                  1444130367 / 1745543467                         288 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability                                     29636254 / 1355280611                           12 / 820                                                   1109816330 / 1619640495                         853 / 2333
org.apache.spark.sql.catalyst.optimizer.ConstantFolding                                            206713838 / 1331971096                          194 / 1943                                                 69725969 / 1417183573                           42 / 1943
org.apache.spark.sql.catalyst.optimizer.PushDownPredicate                                          959279687 / 1331761891                          820 / 1943                                                 0 / 1404733663                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.NullPropagation                                            65910473 / 1324952383                           42 / 1943                                                  195479597 / 1340347970                          194 / 1943
org.apache.spark.sql.catalyst.optimizer.ReorderAssociativeOperator                                 0 / 1268803431                                  0 / 1943                                                   0 / 1331100388                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.OptimizeIn                                                 23201608 / 1258483307                           27 / 1943                                                  1105179405 / 1317882379                         59 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery                                    1040041843 / 1251992635                         59 / 2296                                                  74349060 / 1291276637                           84 / 1943
org.apache.spark.sql.catalyst.optimizer.LikeSimplification                                         1347508 / 1248925860                            1 / 1943                                                   0 / 1289545537                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.EliminateOuterJoin                                         45563225 / 1246306320                           15 / 1943                                                  0 / 1288186620                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.SimplifyConditionals                                       0 / 1234509354                                  0 / 1943                                                   23417314 / 1279842595                           27 / 1943
org.apache.spark.sql.catalyst.optimizer.SimplifyBinaryComparison                                   0 / 1224533697                                  0 / 1943                                                   0 / 1279108156                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.SimplifyCasts                                              75406425 / 1215567094                           84 / 1943                                                  1361911 / 1268795854                            1 / 1943
org.apache.spark.sql.catalyst.optimizer.SimplifyCaseConversionExpressions                          0 / 1196031525                                  0 / 1943                                                   42577656 / 1223661384                           15 / 1943
org.apache.spark.sql.catalyst.optimizer.RemoveDispensableExpressions                               0 / 1184848914                                  0 / 1943                                                   0 / 1200658411                                  0 / 1943
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences                           31233040 / 1184415886                           10 / 2296                                                  29312354 / 1162683260                           10 / 2296
org.apache.spark.sql.catalyst.optimizer.SimplifyExtractValueOps                                    0 / 1159058474                                  0 / 1943                                                   60950940 / 1155752729                           47 / 2333
org.apache.spark.sql.catalyst.optimizer.RemoveRedundantAliases                                     5698821 / 1153787317                            10 / 1943                                                  122109389 / 1152651163                          155 / 2333
org.apache.spark.sql.catalyst.optimizer.RemoveRedundantProject                                     78177701 / 1123087719                           47 / 2333                                                  6731330 / 1135242708                            10 / 1943
org.apache.spark.sql.catalyst.optimizer.CollapseProject                                            123874416 / 1101850924                          155 / 2333                                                 10567406 / 1130613449                           8 / 1943
org.apache.spark.sql.catalyst.optimizer.CombineFilters                                             449980198 / 1051429309                          636 / 1943                                                 30638232 / 1111306240                           12 / 820
org.apache.spark.sql.catalyst.optimizer.RewriteCorrelatedScalarSubquery                            9712531 / 1010330915                            8 / 1943                                                   29494556 / 1058321219                           43 / 2333
org.apache.spark.sql.catalyst.optimizer.CombineUnions                                              31689453 / 991412499                            43 / 2333                                                  0 / 1057978599                                  0 / 1943
org.apache.spark.sql.catalyst.optimizer.CollapseRepartition                                        0 / 985591919                                   0 / 1943                                                   399285885 / 1036327468                          111 / 2296
org.apache.spark.sql.catalyst.optimizer.ConstantPropagation                                        0 / 953077628                                   0 / 1943                                                   476914099 / 1011633729                          636 / 1943
org.apache.spark.sql.catalyst.optimizer.EliminateSorts                                             0 / 909291634                                   0 / 1943                                                   0 / 1001606023                                  0 / 1943
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts                              365802139 / 908733912                           111 / 2296                                                 0 / 997621178                                   0 / 1943
org.apache.spark.sql.catalyst.optimizer.CollapseWindow                                             0 / 897540927                                   0 / 1943                                                   0 / 927316666                                   0 / 1943
org.apache.spark.sql.catalyst.optimizer.CombineLimits                                              0 / 875781956                                   0 / 1943                                                   0 / 922370179                                   0 / 1943
org.apache.spark.sql.catalyst.optimizer.LimitPushDown                                              0 / 868110787                                   0 / 1943                                                   0 / 917919176                                   0 / 1943
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion                                 26326365 / 865776545                            21 / 1943                                                  23992497 / 900311285                            21 / 1943
org.apache.spark.sql.catalyst.optimizer.EliminateSerialization                                     0 / 854607673                                   0 / 1943                                                   0 / 892746271                                   0 / 1943
org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion                     262665875 / 579658305                           62 / 2296                                                  260616344 / 593811716                           62 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions                                   153745818 / 475319649                           408 / 2296                                                 241689927 / 455832126                           33 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$CaseWhenCoercion                               235816769 / 455590494                           33 / 2296                                                  152047701 / 451570922                           408 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings                                 13635691 / 355369662                            11 / 2296                                                  9052332 / 383939387                             4 / 2296
org.apache.spark.sql.catalyst.analysis.ResolveTimeZone                                             146691973 / 323105943                           566 / 2296                                                 15974079 / 369836047                            11 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion                                   6787826 / 321386706                             4 / 2296                                                   0 / 364936587                                   0 / 390
org.apache.spark.sql.catalyst.analysis.TypeCoercion$Division                                       31419357 / 319844433                            10 / 2296                                                  147860974 / 331495653                           566 / 2296
org.apache.spark.sql.catalyst.optimizer.UpdateNullabilityInAttributeReferences                     667532 / 316522643                              1 / 390                                                    32499587 / 329200084                            10 / 2296
org.apache.spark.sql.catalyst.optimizer.PropagateEmptyRelation                                     9098104 / 315856886                             5 / 785                                                    635601 / 324689035                              1 / 390
org.apache.spark.sql.catalyst.analysis.TypeCoercion$DateTimeOperations                             9601865 / 307128084                             43 / 2296                                                  0 / 321984844                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion                                     0 / 304912772                                   0 / 2296                                                   8226740 / 319740326                             43 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality                                0 / 304196921                                   0 / 2296                                                   0 / 315499126                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator                                   0 / 300839906                                   0 / 2296                                                   0 / 313318332                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder                                 48154044 / 291379133                            24 / 2296                                                  0 / 312265841                                   0 / 390
org.apache.spark.sql.catalyst.optimizer.DecimalAggregates                                          108634840 / 289690009                           124 / 514                                                  0 / 308258712                                   0 / 2296
org.apache.spark.sql.execution.python.ExtractPythonUDFs                                            0 / 278550649                                   0 / 390                                                    103129068 / 289052907                           124 / 514
org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct                                    0 / 273938538                                   0 / 2296                                                   0 / 265747445                                   0 / 2296
org.apache.spark.sql.catalyst.optimizer.RewriteIntersectAll                                        0 / 268165845                                   0 / 432                                                    10768959 / 260676307                            24 / 2296
org.apache.spark.sql.catalyst.optimizer.PullupCorrelatedPredicates                                 74395591 / 239152525                            27 / 390                                                   0 / 249739752                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$ConcatCoercion                                 0 / 236351130                                   0 / 2296                                                   13236741 / 244925139                            5 / 785
org.apache.spark.sql.catalyst.analysis.TimeWindowing                                               0 / 235225556                                   0 / 2296                                                   0 / 244460591                                   0 / 390
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions                               0 / 230404877                                   0 / 390                                                    204172100 / 242598141                           222 / 390
org.apache.spark.sql.catalyst.analysis.ResolveHigherOrderFunctions                                 0 / 229768333                                   0 / 2296                                                   12417039 / 235806609                            37 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame                                 10792824 / 228415803                            37 / 2296                                                  0 / 232472018                                   0 / 2296
org.apache.spark.sql.execution.datasources.FindDataSourceTable                                     179627028 / 228013811                           294 / 2296                                                 184333924 / 228692095                           294 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$EltCoercion                                    0 / 225389128                                   0 / 2296                                                   59881230 / 226584657                            27 / 390
org.apache.spark.sql.catalyst.analysis.TypeCoercion$WindowFrameCoercion                            0 / 218608000                                   0 / 2296                                                   0 / 221887078                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$StackCoercion                                  0 / 218194486                                   0 / 2296                                                   0 / 215732842                                   0 / 2296
org.apache.spark.sql.catalyst.optimizer.RewritePredicateSubquery                                   173523514 / 214927844                           222 / 390                                                  0 / 213744878                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$MapZipWithCoercion                             0 / 214340662                                   0 / 2296                                                   0 / 213295968                                   0 / 2296
org.apache.spark.sql.catalyst.optimizer.FoldablePropagation                                        0 / 199432071                                   0 / 1943                                                   0 / 194465130                                   0 / 1943
org.apache.spark.sql.catalyst.optimizer.EliminateMapObjects                                        0 / 185179070                                   0 / 390                                                    0 / 178698382                                   0 / 390
org.apache.spark.sql.execution.OptimizeMetadataOnlyQuery                                           0 / 176454024                                   0 / 390                                                    0 / 173558271                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions                           39309018 / 176186699                            38 / 2296                                                  43715379 / 172272368                            38 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed                                  0 / 170022403                                   0 / 2296                                                   0 / 165447407                                   0 / 390
org.apache.spark.sql.execution.python.ExtractPythonUDFFromAggregate                                0 / 159757503                                   0 / 390                                                    0 / 160391099                                   0 / 390
org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTime                                         0 / 158140055                                   0 / 390                                                    41644289 / 159215748                            12 / 2296
org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases                                    141871895 / 156820337                           296 / 390                                                  0 / 158779735                                   0 / 390
org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables                                      0 / 152376408                                   0 / 2296                                                   0 / 157874334                                   0 / 390
org.apache.spark.sql.catalyst.optimizer.ReplaceExpressions                                         0 / 152316462                                   0 / 390                                                    0 / 157132550                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics                           42422112 / 151049951                            12 / 2296                                                  140810447 / 156338363                           296 / 390
org.apache.spark.sql.catalyst.optimizer.GetCurrentDatabase                                         0 / 137267577                                   0 / 390                                                    24557398 / 151324074                            42 / 432
org.apache.spark.sql.catalyst.optimizer.PullOutPythonUDFInJoinCondition                            0 / 133692760                                   0 / 390                                                    71287846 / 142242298                            507 / 1327
org.apache.spark.sql.catalyst.analysis.CleanupAliases                                              68056569 / 131655681                            507 / 1327                                                 0 / 134581169                                   0 / 390
org.apache.spark.sql.catalyst.optimizer.RemoveRedundantSorts                                       0 / 130701173                                   0 / 390                                                    0 / 133089737                                   0 / 390
org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic                            0 / 127233011                                   0 / 820                                                    0 / 130835073                                   0 / 390
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases                                     17107572 / 125147408                            53 / 2296                                                  0 / 128705527                                   0 / 820
org.apache.spark.sql.catalyst.optimizer.CombineTypedFilters                                        0 / 123463473                                   0 / 390                                                    0 / 125210046                                   0 / 432
org.apache.spark.sql.catalyst.optimizer.RewriteExceptAll                                           0 / 114173237                                   0 / 432                                                    12476277 / 121836057                            24 / 432
org.apache.spark.sql.catalyst.optimizer.ReplaceIntersectWithSemiJoin                               10889547 / 112652867                            24 / 432                                                   21381778 / 118895727                            53 / 2296
org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithFilter                                    0 / 110008368                                   0 / 432                                                    0 / 116345587                                   0 / 432
org.apache.spark.sql.catalyst.optimizer.ReplaceDistinctWithAggregate                               21610795 / 109828668                            42 / 432                                                   0 / 113808944                                   0 / 432
org.apache.spark.sql.catalyst.optimizer.ReplaceExceptWithAntiJoin                                  1216445 / 108203801                             1 / 432                                                    1720957 / 112284623                             1 / 432
org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates                                  0 / 107287411                                   0 / 390                                                    59563957 / 106598698                            294 / 2306
org.apache.spark.sql.catalyst.optimizer.RemoveRepetitionFromGroupExpressions                       2159170 / 104265980                             2 / 392                                                    0 / 106179171                                   0 / 390
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer                                0 / 103215140                                   0 / 2296                                                   0 / 103787927                                   0 / 390
org.apache.spark.sql.catalyst.analysis.EliminateView                                               0 / 102296786                                   0 / 390                                                    0 / 103305027                                   0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                                   50917277 / 97159389                             294 / 2306                                                 4385370 / 102288812                             2 / 392
org.apache.spark.sql.catalyst.optimizer.RemoveLiteralFromGroupExpressions                          0 / 92841832                                    0 / 392                                                    0 / 97691358                                    0 / 392
org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates                                   3915170 / 92591154                              62 / 2296                                                  4051673 / 94971539                              62 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$LookupFunctions                                    0 / 89117456                                    0 / 830                                                    0 / 93198423                                    0 / 390
org.apache.spark.sql.catalyst.optimizer.ReplaceDeduplicateWithAggregate                            0 / 88821013                                    0 / 390                                                    0 / 90764806                                    0 / 830
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast                                      0 / 84561708                                    0 / 2296                                                   19857036 / 89051825                             24 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance                                 0 / 84221700                                    0 / 2296                                                   0 / 82792409                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes                         21598496 / 83356934                             24 / 2296                                                  0 / 81606995                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF                             0 / 71959898                                    0 / 820                                                    2297618 / 79871191                              8 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy                  2555449 / 69189122                              8 / 2296                                                   0 / 70548261                                    0 / 820
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot                                       0 / 63485517                                    0 / 2296                                                   45895772 / 59588549                             24 / 820
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy                           0 / 58879583                                    0 / 2296                                                   0 / 57888686                                    0 / 2296
org.apache.spark.sql.catalyst.expressions.codegen.package$ExpressionCanonicalizer$CleanExpressions 80429 / 57856725                                76 / 11912                                                 0 / 54729566                                    0 / 2296
org.apache.spark.sql.execution.datasources.DataSourceAnalysis                                      41813139 / 55552144                             24 / 820                                                   0 / 54556018                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions                                 0 / 53367328                                    0 / 2306                                                   0 / 50850649                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate                                    0 / 51405189                                    0 / 2296                                                   72764 / 50825747                                76 / 12114
org.apache.spark.sql.catalyst.analysis.ResolveInlineTables                                         0 / 51249460                                    0 / 2296                                                   0 / 50057052                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin                         0 / 50353332                                    0 / 2296                                                   0 / 49708235                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases                       0 / 49502239                                    0 / 2296                                                   0 / 48841848                                    0 / 2296
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation                              0 / 49417950                                    0 / 2296                                                   0 / 48215992                                    0 / 2296
org.apache.spark.sql.execution.datasources.ResolveSQLOnFile                                        0 / 49374943                                    0 / 2296                                                   0 / 47911673                                    0 / 2306
org.apache.spark.sql.execution.datasources.PreprocessTableCreation                                 0 / 30942645                                    0 / 820                                                    0 / 33275799                                    0 / 820
org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals                                4537227 / 30392738                              8 / 890                                                    4410251 / 32033067                              8 / 890
org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveBroadcastHints                          0 / 27361418                                    0 / 830                                                    0 / 28707886                                    0 / 830
org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts                                     0 / 22963407                                    0 / 390                                                    0 / 24315925                                    0 / 390
org.apache.spark.sql.catalyst.analysis.EliminateUnions                                             0 / 22593930                                    0 / 890                                                    0 / 20247865                                    0 / 820
org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution                                0 / 22089269                                    0 / 890                                                    0 / 19302887                                    0 / 890
org.apache.spark.sql.catalyst.analysis.UpdateOuterReferences                                       0 / 20769130                                    0 / 820                                                    0 / 19285784                                    0 / 890
org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints                           0 / 20604280                                    0 / 830                                                    0 / 16660442                                    0 / 830
org.apache.spark.sql.catalyst.analysis.AliasViewChild                                              0 / 15787426                                    0 / 820                                                    0 / 15849217                                    0 / 820
org.apache.spark.sql.execution.datasources.PreprocessTableInsertion                                0 / 14039160                                    0 / 820                                                    0 / 13514297                                    0 / 820
org.apache.spark.sql.catalyst.analysis.ResolveHints$RemoveAllHints                                 0 / 12451609                                    0 / 830                                                    0 / 12090842                                    0 / 830
org.apache.spark.sql.catalyst.optimizer.CombineConcats                                             0 / 9915780                                     0 / 1943                                                   0 / 9786602                                     0 / 1943
org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder                                       0 / 4431280                                     0 / 390                                                    0 / 4314429                                     0 / 390
org.apache.spark.sql.catalyst.optimizer.EliminateDistinct                                          0 / 4138833                                     0 / 390                                                    0 / 3707721                                     0 / 390
org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaPruning                            0 / 3070796                                     0 / 390                                                    0 / 3494547                                     0 / 390

@wangyum
Copy link
Member Author

wangyum commented Oct 29, 2018

We also need to add the CombineFilters based on #22879.

@@ -171,10 +171,13 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
// "Extract PythonUDF From JoinCondition".
Batch("Check Cartesian Products", Once,
CheckCartesianProducts) :+
Batch("RewriteSubquery", Once,
Batch("Rewrite Subquery", Once,
Copy link
Member

@gatorsmile gatorsmile Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not have a good answer for this PR. Ideally, we should run the whole batch operatorOptimizationBatch. However, running the whole batch could be very time consuming. I would suggest to add a new parameter for introducing the time bound limit for each batch.

cc @maryannxue WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Do you think its a good time to revisit Natt's PR to convert subquery expressions to Joins early in the optimization process ? Perhaps then we can take advantage of all the subsequent rules firing after the subquery rewrite ?

Copy link
Contributor

@maryannxue maryannxue Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I think @dilipbiswal's suggestion is the right way to go. If you think of this subquery rewriting as another kind of de-correlation, it should be a pre-optimization rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. That sounds also good to me. @dilipbiswal Could you take the PR #17520 over?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Sure Sean.. Let me give it a try.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98246 has finished for PR 22778 at commit db519c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Let us hold this PR and try to fix #17520 instead.

@SparkQA
Copy link

SparkQA commented Oct 31, 2018

Test build #98294 has finished for PR 22778 at commit 80bf621.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum closed this Mar 4, 2019
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/RewriteSubquerySuite.scala
@wangyum wangyum reopened this Aug 28, 2020
@SparkQA
Copy link

SparkQA commented Aug 28, 2020

Test build #127986 has finished for PR 22778 at commit 6b2a2da.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants