-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27393][SQL] Show ReusedSubquery in the plan when the subquery is reused #24258
Conversation
@@ -113,33 +113,6 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
} | |||
} | |||
|
|||
test("Reuse Subquery") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to SubquerySuite.scala
Test build #104128 has finished for PR 24258 at commit
|
Test build #104127 has finished for PR 24258 at commit
|
retest this please. |
Test build #104138 has finished for PR 24258 at commit
|
// Normalize the outer references in the subquery plan. | ||
val normalizedPlan = plan.transformAllExpressions { | ||
case OuterReference(r) => OuterReference(QueryPlan.normalizeExprId(r, attrs)) | ||
} | ||
}.canonicalized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to the canonicalized plan here? It seems all the canonicalization should happen in line 69?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -25,7 +25,7 @@ import java.util.concurrent.atomic.AtomicBoolean | |||
import org.apache.spark.{AccumulatorSuite, SparkException} | |||
import org.apache.spark.scheduler.{SparkListener, SparkListenerJobStart} | |||
import org.apache.spark.sql.catalyst.util.StringUtils | |||
import org.apache.spark.sql.execution.{aggregate, ScalarSubquery, SubqueryExec} | |||
import org.apache.spark.sql.execution.aggregate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can remove this import by changing aggregate.HashAggregateExec
-> HashAggregateExec
in line 263?
|
||
override def executeCollect(): Array[InternalRow] = { | ||
child.executeCollect() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit:
protected override def doPrepare(): Unit = child.prepare()
protected override def doExecute(): RDD[InternalRow] = child.execute()
override def executeCollect(): Array[InternalRow] = child.executeCollect()
btw, is it not worth filing a new jira for this refactoring? |
"dataSize" -> SQLMetrics.createSizeMetric(sparkContext, "data size"), | ||
"collectTime" -> SQLMetrics.createTimingMetric(sparkContext, "time to collect")) | ||
abstract class BaseSubqueryExec extends SparkPlan { | ||
def name: String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need name
in this base class for subquery exec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both ReusedSubqueryExec and SubqueryExec have the name
plan transformAllExpressions { | ||
case sub: ExecSubqueryExpression => | ||
val sameSchema = subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[SubqueryExec]()) | ||
val sameSchema = | ||
subqueries.getOrElseUpdate(sub.plan.schema, ArrayBuffer[BaseSubqueryExec]()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unnecessary change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to BaseSubqueryExec
Shall we create a new JIRA? This improves the SQL web UI |
SubqueryExec
Test build #104306 has finished for PR 24258 at commit
|
Test build #4691 has finished for PR 24258 at commit
|
retest this please |
Test build #104314 has finished for PR 24258 at commit
|
Thanks! Merged to master. |
…is reused With this change, we can easily identify the plan difference when subquery is reused. When the reuse is enabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery240 + ReusedSubquery Subquery subquery240) AS (scalarsubquery() + scalarsubquery())#253] : :- Subquery subquery240 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#250]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#256, count#257L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- ReusedSubquery Subquery subquery240 +- *(1) SerializeFromObject +- Scan[obj#12] ``` When the reuse is disabled, the plan looks like ``` == Physical Plan == CollectLimit 1 +- *(1) Project [(Subquery subquery286 + Subquery subquery287) AS (scalarsubquery() + scalarsubquery())#299] : :- Subquery subquery286 : : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#296]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#302, count#303L]) : : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : : +- Scan[obj#12] : +- Subquery subquery287 : +- *(2) HashAggregate(keys=[], functions=[avg(cast(key#13 as bigint))], output=[avg(key)#298]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_avg(cast(key#13 as bigint))], output=[sum#306, count#307L]) : +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13] : +- Scan[obj#12] +- *(1) SerializeFromObject +- Scan[obj#12] ``` Modified the existing test. Closes apache#24258 from gatorsmile/followupSPARK-27279. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
With this change, we can easily identify the plan difference when subquery is reused.
When the reuse is enabled, the plan looks like
When the reuse is disabled, the plan looks like
How was this patch tested?
Modified the existing test.