[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

hvanhovell · 2020-01-30T11:07:20Z

What changes were proposed in this pull request?

This PR makes SparkSession.emptyDataFrame use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations.

Why are the changes needed?

It allows us to optimize empty dataframes better.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a test case to DataFrameSuite.

hvanhovell · 2020-01-30T11:09:04Z

cc @HyukjinKwon can you take a look?

cloud-fan · 2020-01-30T13:25:38Z

makes sense, +1

SparkQA · 2020-01-30T15:01:50Z

Test build #117560 has finished for PR 27400 at commit 4bab400.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-01-30T16:58:25Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

-  lazy val emptyDataFrame: DataFrame = {
-    createDataFrame(sparkContext.emptyRDD[Row].setName("empty"), StructType(Nil))
-  }
+  lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())


Suggested change

lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())

lazy val emptyDataFrame: DataFrame = emptyDataset(RowEncoder(new StructType()))

?

Why would that be better? This is basically re-implementing Dataset.ofRows.

This is arguable for sure. DataFrame is Dataset[Row]. We already have a function which can construct empty Dataset[T], why not just re-use it. emptyDataFrame is some kind of specialization of emptyDataset.

If we take this suggestion, maybe: emptyDataset(RowEncoder(StructType(Nil))).

Dataset.ofRows(self, LocalRelation()) looks simpler as I don't need to jump into another method when reading the code.

HyukjinKwon

+1 from me too. I think it makes sense.

viirya

+1

cloud-fan · 2020-01-31T06:54:45Z

thanks, merging to master!

HyukjinKwon · 2020-01-31T07:14:25Z

Let me merge instead. Seems hitting a network issue (?).

…ions ### What changes were proposed in this pull request? This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ions ### What changes were proposed in this pull request? This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b25359c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…ions ### What changes were proposed in this pull request? This is a small follow-up for apache#27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes apache#27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

emptyDataFrame should use a LocalRelation

4bab400

dongjoon-hyun added the SQL label Jan 30, 2020

hvanhovell requested a review from cloud-fan January 30, 2020 13:10

MaxGekk reviewed Jan 30, 2020

View reviewed changes

HyukjinKwon reviewed Jan 31, 2020

View reviewed changes

viirya approved these changes Jan 31, 2020

View reviewed changes

HyukjinKwon closed this in a5c7090 Jan 31, 2020

hvanhovell mentioned this pull request Feb 10, 2020

[SPARK-30780][SQL] Empty LocalTableScan should use RDD without partitions #27530

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

hvanhovell commented Jan 30, 2020

hvanhovell commented Jan 30, 2020

cloud-fan commented Jan 30, 2020

SparkQA commented Jan 30, 2020

MaxGekk Jan 30, 2020

hvanhovell Jan 30, 2020

MaxGekk Jan 30, 2020

HyukjinKwon Jan 31, 2020 •

edited

Loading

cloud-fan Jan 31, 2020

HyukjinKwon left a comment

viirya left a comment

cloud-fan commented Jan 31, 2020

HyukjinKwon commented Jan 31, 2020

	lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())
	lazy val emptyDataFrame: DataFrame = emptyDataset(RowEncoder(new StructType()))

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

Conversation

hvanhovell commented Jan 30, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

hvanhovell commented Jan 30, 2020

cloud-fan commented Jan 30, 2020

SparkQA commented Jan 30, 2020

MaxGekk Jan 30, 2020

Choose a reason for hiding this comment

hvanhovell Jan 30, 2020

Choose a reason for hiding this comment

MaxGekk Jan 30, 2020

Choose a reason for hiding this comment

HyukjinKwon Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

cloud-fan Jan 31, 2020

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Jan 31, 2020

HyukjinKwon commented Jan 31, 2020

HyukjinKwon Jan 31, 2020 •

edited

Loading