[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

huaxingao · 2020-08-12T07:26:39Z

What changes were proposed in this pull request?

Remove fullOutput from RowDataSourceScanExec

Why are the changes needed?

RowDataSourceScanExec requires the full output instead of the scan output after column pruning. However, in v2 code path, we don't have the full output anymore so we just pass the pruned output. RowDataSourceScanExec.fullOutput is actually meaningless so we should remove it.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

SparkQA · 2020-08-12T12:03:53Z

Test build #127371 has finished for PR 29415 at commit 9558823.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-08-12T16:59:29Z

cc @cloud-fan @MaxGekk @viirya

cloud-fan · 2020-08-12T17:23:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -99,17 +99,14 @@ trait DataSourceScanExec extends LeafExecNode {

 /** Physical plan node for scanning data from a relation. */
 case class RowDataSourceScanExec(
-    fullOutput: Seq[Attribute],


can you find out the PR that added it? I can't quite remember why we have it.

It was introduced in #18600 for plan equality comparison.
I manually print out the two canonicalized plans for df1 and df2 in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68 to check my change.
Before my change:

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#25]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#1] PushedFilters: [], ReadSchema: structnone:int,none:int

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#52]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#2] PushedFilters: [], ReadSchema: structnone:int,none:int

After my change :

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#25]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int>

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0])
+- Exchange hashpartitioning(none#0, 5), true, [id=#52]
+- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4])
+- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,C#2] PushedFilters: [], ReadSchema: struct<A:int,C:int>

viirya

fullOutput seems having no actual usage except for plan comparison. If we can make sure we don't break it, looks ok to remove fullOutput.

cloud-fan · 2020-08-13T03:45:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

@@ -143,7 +140,6 @@ case class RowDataSourceScanExec(
  // Don't care about `rdd` and `tableIdentifier` when canonicalizing.
  override def doCanonicalize(): SparkPlan =
    copy(
-      fullOutput.map(QueryPlan.normalizeExpressions(_, fullOutput)),


don't we need to normalize output now?

FileSourceScanExec does it as well.

We may need to add requiredSchema to RowDataSourceScanExec

Sorry I didn't know that we need to use the normalized exprId in canonicalized plan. If we do, then probably we can't remove fullOutput from RowDataSourceScanExec, because using the normalized pruned output would cause problem. For example, in https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/RowDataSourceStrategySuite.scala#L68, normalized the pruned output will give none#0,none#1 for both df1 and df2, and then both of them have exactly the same plan

*(2) HashAggregate(keys=[none#0], functions=[min(none#0)], output=[none#0, #0]) +- Exchange hashpartitioning(none#0, 5), true, [id=#110] +- *(1) HashAggregate(keys=[none#0], functions=[partial_min(none#1)], output=[none#0, none#4]) +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [none#0,none#1] PushedFilters: [], ReadSchema: struct<none:int,none:int>

Then in df1.union(df2), it takes ReusedExchange code path since both plans are equal

== Physical Plan == Union :- *(2) HashAggregate(keys=[a#0], functions=[min(b#1)], output=[a#0, min(b)#12]) : +- Exchange hashpartitioning(a#0, 5), true, [id=#34] : +- *(1) HashAggregate(keys=[a#0], functions=[partial_min(b#1)], output=[a#0, min#28]) : +- *(1) Scan JDBCRelation(TEST.INTTYPES) [numPartitions=1] [A#0,B#1] PushedFilters: [], ReadSchema: struct<A:int,B:int> +- *(4) HashAggregate(keys=[a#0], functions=[min(c#2)], output=[a#0, min(c)#24]) +- ReusedExchange [a#0, min#30], Exchange hashpartitioning(a#0, 5), true, [id=#34]

The union result will be

+---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 2| +---+------+

instead of

+---+------+ | a|min(b)| +---+------+ | 1| 2| | 1| 3| +---+------+

yea that's why I propose to add requiredSchema, like what FileSourceScanExec does. But I'm not sure how hard it is.

@cloud-fan I added requiredSchema, could you please take a look to see if that's what you want?

SparkQA · 2020-08-13T20:48:21Z

Test build #127416 has finished for PR 29415 at commit bd58665.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-14T08:08:26Z

thanks, merging to master!

huaxingao · 2020-08-14T15:07:55Z

Thanks! @cloud-fan @viirya

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec

9558823

probot-autolabeler bot added the SQL label Aug 12, 2020

cloud-fan reviewed Aug 12, 2020

View reviewed changes

viirya reviewed Aug 12, 2020

View reviewed changes

cloud-fan reviewed Aug 13, 2020

View reviewed changes

add requiredSchema

bd58665

cloud-fan closed this in 14003d4 Aug 14, 2020

huaxingao deleted the rm_full_output branch August 14, 2020 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

huaxingao commented Aug 12, 2020

SparkQA commented Aug 12, 2020

huaxingao commented Aug 12, 2020

cloud-fan Aug 12, 2020

huaxingao Aug 12, 2020

viirya left a comment

cloud-fan Aug 13, 2020

cloud-fan Aug 13, 2020

cloud-fan Aug 13, 2020

huaxingao Aug 13, 2020

cloud-fan Aug 13, 2020

huaxingao Aug 13, 2020

SparkQA commented Aug 13, 2020

cloud-fan commented Aug 14, 2020

huaxingao commented Aug 14, 2020

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

[SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec #29415

Conversation

huaxingao commented Aug 12, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Aug 12, 2020

huaxingao commented Aug 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 13, 2020

cloud-fan commented Aug 14, 2020

huaxingao commented Aug 14, 2020