-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400
Conversation
cc @HyukjinKwon can you take a look? |
makes sense, +1 |
Test build #117560 has finished for PR 27400 at commit
|
lazy val emptyDataFrame: DataFrame = { | ||
createDataFrame(sparkContext.emptyRDD[Row].setName("empty"), StructType(Nil)) | ||
} | ||
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation()) | |
lazy val emptyDataFrame: DataFrame = emptyDataset(RowEncoder(new StructType())) |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would that be better? This is basically re-implementing Dataset.ofRows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is arguable for sure. DataFrame is Dataset[Row]. We already have a function which can construct empty Dataset[T], why not just re-use it. emptyDataFrame
is some kind of specialization of emptyDataset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we take this suggestion, maybe: emptyDataset(RowEncoder(StructType(Nil)))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset.ofRows(self, LocalRelation())
looks simpler as I don't need to jump into another method when reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 from me too. I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
thanks, merging to master! |
Let me merge instead. Seems hitting a network issue (?). |
…ions ### What changes were proposed in this pull request? This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ions ### What changes were proposed in this pull request? This is a small follow-up for #27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes #27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit b25359c) Signed-off-by: HyukjinKwon <gurwls223@apache.org>
…ions ### What changes were proposed in this pull request? This is a small follow-up for apache#27400. This PR makes an empty `LocalTableScanExec` return an `RDD` without partitions. ### Why are the changes needed? It is a bit unexpected that the RDD contains partitions if there is not work to do. It also can save a bit of work when this is used in a more complex plan. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added test to `SparkPlanSuite`. Closes apache#27530 from hvanhovell/SPARK-30780. Authored-by: herman <herman@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
What changes were proposed in this pull request?
This PR makes
SparkSession.emptyDataFrame
use an empty local relation instead of an empty RDD. This allows to optimizer to recognize this as an empty relation, and creates the opportunity to do some more aggressive optimizations.Why are the changes needed?
It allows us to optimize empty dataframes better.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a test case to
DataFrameSuite
.