[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

hvanhovell · 2020-02-10T20:56:27Z

What changes were proposed in this pull request?

This is a small follow-up for #27400. This PR makes an empty LocalTableScanExec return an RDD without partitions.

Why are the changes needed?

It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test to SparkPlanSuite.

SparkQA · 2020-02-11T01:19:34Z

Test build #118179 has finished for PR 27530 at commit 5d5fd4f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-02-11T10:18:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala

-
-  private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)
+  @transient private lazy val rdd: RDD[InternalRow] = {
+    if (rows.isEmpty) {


Maybe, unsafeRows.isEmpty? Otherwise I have to look at the difference between unsafeRows and rows.

This way we avoid materializing the unsafeRows lazy val.

MaxGekk · 2020-02-11T10:19:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/LocalTableScanExec.scala

+      sqlContext.sparkContext.emptyRDD
+    } else {
+      val numSlices = math.min(unsafeRows.length, sqlContext.sparkContext.defaultParallelism)
+      sqlContext.sparkContext.parallelize(unsafeRows, numSlices)


Just in case, does it make sense to put this code (handling empty rows) inside of parallelize?

parallelize need to respect the numSlices parameter, even if the data is empty.

SparkQA · 2020-02-12T00:08:49Z

Test build #118260 has finished for PR 27530 at commit 6d46dec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-02-12T01:48:00Z

Merged to master, and branch-3.0 to consistent with #27400.

…ions ### What changes were proposed in this pull request? This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b25359c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ions ### What changes were proposed in this pull request? This is a small follow-up for apache#27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes apache#27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

Empty LocalTableScan should use RDD without partitions.

5d5fd4f

hvanhovell requested review from cloud-fan and gatorsmile February 10, 2020 20:56

maropu changed the title ~~[SPARK-30780] Empty LocalTableScan should use RDD without partitions~~ [SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions Feb 11, 2020

maropu approved these changes Feb 11, 2020

View reviewed changes

MaxGekk reviewed Feb 11, 2020

View reviewed changes

Fix test

6d46dec

HyukjinKwon approved these changes Feb 12, 2020

View reviewed changes

HyukjinKwon closed this in b25359c Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

hvanhovell commented Feb 10, 2020

SparkQA commented Feb 11, 2020

MaxGekk Feb 11, 2020

hvanhovell Feb 11, 2020

MaxGekk Feb 11, 2020

cloud-fan Feb 11, 2020

SparkQA commented Feb 12, 2020

HyukjinKwon commented Feb 12, 2020

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

Conversation

hvanhovell commented Feb 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Feb 11, 2020

MaxGekk Feb 11, 2020

Choose a reason for hiding this comment

hvanhovell Feb 11, 2020

Choose a reason for hiding this comment

MaxGekk Feb 11, 2020

Choose a reason for hiding this comment

cloud-fan Feb 11, 2020

Choose a reason for hiding this comment

SparkQA commented Feb 12, 2020

HyukjinKwon commented Feb 12, 2020