Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

Closed
wants to merge 1 commit into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

This PR makes SparkSession.emptyDataFrame use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations.

Why are the changes needed?

It allows us to optimize empty dataframes better.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test case to DataFrameSuite.

@hvanhovell
Copy link
Contributor Author

cc @HyukjinKwon can you take a look?

@cloud-fan
Copy link
Contributor

makes sense, +1

@SparkQA
Copy link

SparkQA commented Jan 30, 2020

Test build #117560 has finished for PR 27400 at commit 4bab400.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lazy val emptyDataFrame: DataFrame = {
createDataFrame(sparkContext.emptyRDD[Row].setName("empty"), StructType(Nil))
}
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())
lazy val emptyDataFrame: DataFrame = emptyDataset(RowEncoder(new StructType()))

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be better? This is basically re-implementing Dataset.ofRows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is arguable for sure. DataFrame is Dataset[Row]. We already have a function which can construct empty Dataset[T], why not just re-use it. emptyDataFrame is some kind of specialization of emptyDataset.

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take this suggestion, maybe: emptyDataset(RowEncoder(StructType(Nil))).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.ofRows(self, LocalRelation()) looks simpler as I don't need to jump into another method when reading the code.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me too. I think it makes sense.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@HyukjinKwon
Copy link
Member

Let me merge instead. Seems hitting a network issue (?).

HyukjinKwon pushed a commit that referenced this pull request Feb 12, 2020
…ions

### What changes were proposed in this pull request?
This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes #27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
HyukjinKwon pushed a commit that referenced this pull request Feb 12, 2020
…ions

### What changes were proposed in this pull request?
This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes #27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
(cherry picked from commit b25359c)
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…ions

### What changes were proposed in this pull request?
This is a small follow-up for apache#27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions.

### Why are the changes needed?
It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added test to `SparkPlanSuite`.

Closes apache#27530 from hvanhovell/SPARK-30780.

Authored-by: herman <herman@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants