Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30671][SQL] emptyDataFrame should use a LocalRelation #27400

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -274,9 +274,7 @@ class SparkSession private(
* @since 2.0.0
*/
@transient
lazy val emptyDataFrame: DataFrame = {
createDataFrame(sparkContext.emptyRDD[Row].setName("empty"), StructType(Nil))
}
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lazy val emptyDataFrame: DataFrame = Dataset.ofRows(self, LocalRelation())
lazy val emptyDataFrame: DataFrame = emptyDataset(RowEncoder(new StructType()))

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would that be better? This is basically re-implementing Dataset.ofRows.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is arguable for sure. DataFrame is Dataset[Row]. We already have a function which can construct empty Dataset[T], why not just re-use it. emptyDataFrame is some kind of specialization of emptyDataset.

Copy link
Member

@HyukjinKwon HyukjinKwon Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we take this suggestion, maybe: emptyDataset(RowEncoder(StructType(Nil))).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset.ofRows(self, LocalRelation()) looks simpler as I don't need to jump into another method when reading the code.


/**
* Creates a new [[Dataset]] of type T containing zero elements.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.catalyst.expressions.Uuid
import org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation
import org.apache.spark.sql.catalyst.plans.logical.{OneRowRelation, Union}
import org.apache.spark.sql.catalyst.plans.logical.{LocalRelation, OneRowRelation, Union}
import org.apache.spark.sql.execution.{FilterExec, QueryExecution, WholeStageCodegenExec}
import org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanHelper
import org.apache.spark.sql.execution.aggregate.HashAggregateExec
Expand Down Expand Up @@ -2287,6 +2287,17 @@ class DataFrameSuite extends QueryTest
}
assert(err.getMessage.contains("cannot resolve '`d`'"))
}

test("emptyDataFrame should be foldable") {
val emptyDf = spark.emptyDataFrame.withColumn("id", lit(1L))
val joined = spark.range(10).join(emptyDf, "id")
joined.queryExecution.optimizedPlan match {
case LocalRelation(Seq(id), Nil, _) =>
assert(id.name == "id")
case _ =>
fail("emptyDataFrame should be foldable")
}
}
}

case class GroupByKey(a: Int, b: Int)