[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258

gatorsmile · 2019-03-31T02:51:48Z

What changes were proposed in this pull request?

With this change, we can easily identify the plan difference when subquery is reused.

When the reuse is enabled, the plan looks like

== Physical Plan ==
CollectLimit 1
+- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253]
   :  :- Subquery subquery240
   :  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250])
   :  :     +- Exchange SinglePartition
   :  :        +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L])
   :  :           +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :  :              +- Scan[obj#12]
   :  +- ReusedSubquery Subquery subquery240
   +- *(1) SerializeFromObject
      +- Scan[obj#12]

When the reuse is disabled, the plan looks like

== Physical Plan ==
CollectLimit 1
+- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299]
   :  :- Subquery subquery286
   :  :  +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296])
   :  :     +- Exchange SinglePartition
   :  :        +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L])
   :  :           +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :  :              +- Scan[obj#12]
   :  +- Subquery subquery287
   :     +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298])
   :        +- Exchange SinglePartition
   :           +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L])
   :              +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13]
   :                 +- Scan[obj#12]
   +- *(1) SerializeFromObject
      +- Scan[obj#12]

How was this patch tested?

Modified the existing test.

gatorsmile · 2019-03-31T02:52:04Z

cc @cloud-fan @maryannxue @hvanhovell @adrian-wang @maropu

gatorsmile · 2019-03-31T02:53:11Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -113,33 +113,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
    }
  }

-  test("Reuse Subquery") {


moved to SubquerySuite.scala

SparkQA · 2019-03-31T07:05:02Z

Test build #104128 has finished for PR 24258 at commit 14f439b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-31T07:05:02Z

Test build #104127 has finished for PR 24258 at commit 2ebdcab.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class BaseSubqueryExec extends SparkPlan
case class SubqueryExec(name: String, child: SparkPlan)
case class ReusedSubqueryExec(child: BaseSubqueryExec)
abstract class ExecSubqueryExpression extends PlanExpression[BaseSubqueryExec]

adrian-wang · 2019-03-31T10:36:52Z

retest this please.

SparkQA · 2019-03-31T14:43:01Z

Test build #104138 has finished for PR 24258 at commit 14f439b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-04-01T00:54:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

    // Normalize the outer references in the subquery plan.
    val normalizedPlan = plan.transformAllExpressions {
      case OuterReference(r) => OuterReference(QueryPlan.normalizeExprId(r, attrs))
-    }
+    }.canonicalized


We need to the canonicalized plan here? It seems all the canonicalization should happen in line 69?

maropu · 2019-04-01T01:21:56Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -25,7 +25,7 @@ import java.util.concurrent.atomic.AtomicBoolean
 import org.apache.spark.{AccumulatorSuite, SparkException}
 import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart}
 import org.apache.spark.sql.catalyst.util.StringUtils
-import org.apache.spark.sql.execution.{aggregate, ScalarSubquery, SubqueryExec}
+import org.apache.spark.sql.execution.aggregate


nit: we can remove this import by changing aggregate.HashAggregateExec -> HashAggregateExec in line 263?

maropu · 2019-04-01T01:26:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+
+  override def executeCollect(): Array[InternalRow] = {
+    child.executeCollect()
+  }


super nit:

protected override def doPrepare(): Unit = child.prepare() protected override def doExecute(): RDD[InternalRow] = child.execute() override def executeCollect(): Array[InternalRow] = child.executeCollect()

maropu · 2019-04-01T01:34:53Z

btw, is it not worth filing a new jira for this refactoring?

maropu · 2019-04-01T01:53:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

-    "dataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size"),
-    "collectTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to collect"))
+abstract class BaseSubqueryExec extends SparkPlan {
+  def name: String


We need name in this base class for subquery exec?

both ReusedSubqueryExec and SubqueryExec have the name

cloud-fan · 2019-04-01T07:02:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/subquery.scala

    plan transformAllExpressions {
      case sub: ExecSubqueryExpression =>
-        val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]())
+        val sameSchema =
+          subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[BaseSubqueryExec]())


unnecessary change?

change it to BaseSubqueryExec

cloud-fan · 2019-04-01T07:06:37Z

Shall we create a new JIRA? This improves the SQL web UI

SparkQA · 2019-04-05T01:02:28Z

Test build #104306 has finished for PR 24258 at commit 7036cfa.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-05T07:05:03Z

Test build #4691 has finished for PR 24258 at commit 7036cfa.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-04-05T07:17:45Z

retest this please

SparkQA · 2019-04-05T11:48:34Z

Test build #104314 has finished for PR 24258 at commit 7036cfa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-04-05T15:32:15Z

Thanks! Merged to master.

…is reused With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

fix

2ebdcab

gatorsmile commented Mar 31, 2019

View reviewed changes

remove explain

14f439b

maropu reviewed Apr 1, 2019

View reviewed changes

cloud-fan reviewed Apr 1, 2019

View reviewed changes

address comments.

7036cfa

gatorsmile changed the title ~~[SPARK-27279][SQL][FOLLOW-UP] Reuse subquery should compare child plan of SubqueryExec~~ [SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused Apr 4, 2019

maropu approved these changes Apr 5, 2019

View reviewed changes

gatorsmile closed this in 5678e68 Apr 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258

[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258

gatorsmile commented Mar 31, 2019

gatorsmile commented Mar 31, 2019 •

edited

Loading

gatorsmile Mar 31, 2019

SparkQA commented Mar 31, 2019

SparkQA commented Mar 31, 2019

adrian-wang commented Mar 31, 2019

SparkQA commented Mar 31, 2019

maropu Apr 1, 2019

cloud-fan Apr 1, 2019

maropu Apr 1, 2019

maropu Apr 1, 2019

maropu commented Apr 1, 2019

maropu Apr 1, 2019

gatorsmile Apr 4, 2019

cloud-fan Apr 1, 2019

gatorsmile Apr 4, 2019

cloud-fan commented Apr 1, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

maropu commented Apr 5, 2019

SparkQA commented Apr 5, 2019

gatorsmile commented Apr 5, 2019

[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258

[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258

Conversation

gatorsmile commented Mar 31, 2019

What changes were proposed in this pull request?

How was this patch tested?

gatorsmile commented Mar 31, 2019 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 31, 2019

SparkQA commented Mar 31, 2019

adrian-wang commented Mar 31, 2019

SparkQA commented Mar 31, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Apr 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Apr 1, 2019

SparkQA commented Apr 5, 2019

SparkQA commented Apr 5, 2019

maropu commented Apr 5, 2019

SparkQA commented Apr 5, 2019

gatorsmile commented Apr 5, 2019

gatorsmile commented Mar 31, 2019 •

edited

Loading