Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

Closed
wants to merge 2 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This is a small follow-up for #27400. This PR makes an empty LocalTableScanExec return an RDD without partitions.

Why are the changes needed?

It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added test to SparkPlanSuite.

@maropu maropu changed the title [SPARK-30780] Empty LocalTableScan should use RDD without partitions [SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions Feb 11, 2020
@SparkQA
Copy link

SparkQA commented Feb 11, 2020

Test build #118179 has finished for PR 27530 at commit 5d5fd4f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private lazy val rdd = sqlContext.sparkContext.parallelize(unsafeRows, numParallelism)
@transient private lazy val rdd: RDD[InternalRow] = {
if (rows.isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, unsafeRows.isEmpty? Otherwise I have to look at the difference between unsafeRows and rows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we avoid materializing the unsafeRows lazy val.

sqlContext.sparkContext.emptyRDD
} else {
val numSlices = math.min(unsafeRows.length, sqlContext.sparkContext.defaultParallelism)
sqlContext.sparkContext.parallelize(unsafeRows, numSlices)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, does it make sense to put this code (handling empty rows) inside of parallelize?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelize need to respect the numSlices parameter, even if the data is empty.

@SparkQA
Copy link

SparkQA commented Feb 12, 2020

Test build #118260 has finished for PR 27530 at commit 6d46dec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master, and branch-3.0 to consistent with #27400.

HyukjinKwon pushed a commit that referenced this pull request Feb 12, 2020
…ions

### What changes were proposed in this pull request?
This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes #27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit b25359c)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ions

### What changes were proposed in this pull request?
This is a small follow-up for apache#27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes apache#27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants