[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` #20705

dongjoon-hyun · 2018-03-01T18:51:54Z

What changes were proposed in this pull request?

Currently, some tests have an assumption that spark.sql.sources.default=parquet. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.

This PR aims to

Improve test suites more robust and makes it easy to test new data sources in the future.
Test new native ORC data source with the full existing Apache Spark test coverage.

As an example, the PR uses spark.sql.sources.default=orc during reviews. The value should be parquet when this PR is accepted.

How was this patch tested?

Pass the Jenkins with updated tests.

dongjoon-hyun · 2018-03-01T18:54:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -526,7 +526,7 @@ object SQLConf {
  val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
    .doc("The default data source to use in input/output.")
    .stringConf
-    .createWithDefault("parquet")
+    .createWithDefault("orc")


This is a testing purpose during reviews.

Can you change it back?

Yep. It's back now, @gatorsmile .

…ark.sql.sources.default`

dongjoon-hyun · 2018-03-01T19:29:33Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

@@ -739,15 +739,15 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
    withTempPath { dir =>
      df.write.format("parquet").partitionBy(partitionColumns.map(_.name): _*).save(dir.toString)
      val fields = schema.map(f => Column(f.name).cast(f.dataType))
-      checkAnswer(spark.read.load(dir.toString).select(fields: _*), row)
+      checkAnswer(spark.read.parquet(dir.toString).select(fields: _*), row)


Since this is ParquetPartitionDiscoverySuite, parquet is more proper than load.

dongjoon-hyun · 2018-03-01T19:31:26Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

          // This testcase verifies that setting `hive.default.fileformat` has no impact on
          // the target table's fileformat in case of CTAS.
-          assert(sessionState.conf.defaultDataSourceName === "parquet")
-          checkRelation(tableName = table, isDataSourceTable = true, format = "parquet")
+          checkRelation(tableName = table, isDataSourceTable = true, format = dataSourceFormat)
        }
      }


Previously, spark.sql.source.default=orc with hive.default.fileformat=textfile is not tested properly.

SparkQA · 2018-03-01T21:53:42Z

Test build #87849 has finished for PR 20705 at commit 427b6f0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-01T22:33:28Z

Test build #87851 has finished for PR 20705 at commit eb62c2f.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-01T23:46:21Z

python/pyspark/sql/readwriter.py

-        >>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
-        ...     opt2=1, opt3='str')
+        >>> df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
+        ...     opt1=True, opt2=1, opt3='str')


Unlike the other things, there is some difference from the original semantics.
As an alternative approach, we can add the following if we need to keep the original spark.read.load.

spark.conf.set("spark.sql.sources.default", "parquet")

SparkQA · 2018-03-02T03:27:41Z

Test build #87861 has finished for PR 20705 at commit 5192e6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-02T09:06:17Z

@dongjoon-hyun, I agree with the idea in general but just to be clear do you target to improve the test coverage for both when orc and parquet are set to spark.sql.sources.default here?

If not, how about we set parquet to spark.sql.sources.default explicitly where it's required for now?

dongjoon-hyun · 2018-03-02T17:36:01Z

@HyukjinKwon . Yep. Actually, I had a plan for some suitable test suites. So, in general I generalized parquet to load/save.

Your idea also sounds good. I'll try to do for ParquetPartitionDiscoverySuite.scala and python part according to your advice. Thanks!

gatorsmile · 2018-03-02T22:23:19Z

Could you change the default value to json? Can all the tests pass?

dongjoon-hyun · 2018-03-03T00:07:07Z

Sure, @gatorsmile . So far, I didn't update the PR according to @HyukjinKwon . I try the json test first on the as-is.

SparkQA · 2018-03-03T02:09:20Z

Test build #87916 has finished for PR 20705 at commit 3ec9309.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-03T05:16:45Z

@gatorsmile and @HyukjinKwon .
Two failures are due to limitation of the current JSON data source implementation. Here, we can see that the test suite correctly tests the target data source.

resolveRelation for a FileFormat DataSource without userSchema scan filesystem only once
For Json source, the statistic count becomes 2.
Pre insert nullability check (MapType)
Since Json source save as string, it raises ClassCastException when the given user or table schema is different.

scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").save("/tmp/json")

scala> spark.read.json("/tmp/json").printSchema
root
 |-- a: struct (nullable = true)
 |    |-- 1: string (nullable = true)

scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").saveAsTable("map")
18/03/02 21:13:49 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `default`.`map` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

scala> spark.read.json("/tmp/json").printSchema
root
 |-- a: struct (nullable = true)
 |    |-- 1: string (nullable = true)

scala> spark.table("map").printSchema
root
 |-- a: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = true)

scala> spark.table("map").show
18/03/02 21:14:12 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)

For JSON format, could you confirm this, @HyukjinKwon ?

dongjoon-hyun · 2018-03-03T05:22:37Z

Since we verified JSON result, I'll update the PR to address @HyukjinKwon 's comment.

dongjoon-hyun · 2018-03-03T05:50:56Z

python/pyspark/sql/readwriter.py

@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options):
                       or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
        :param options: all other string options

+        >>> spark.conf.set("spark.sql.sources.default", "parquet")
        >>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,


The built-in test data is parquet.

dongjoon-hyun · 2018-03-03T05:51:56Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

@@ -57,6 +57,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
  val timeZone = TimeZone.getDefault()
  val timeZoneId = timeZone.getID

+  protected override def beforeAll(): Unit = {
+    super.beforeAll()
+    spark.conf.set(SQLConf.DEFAULT_DATA_SOURCE_NAME.key, "parquet")


Since this is ParquetPartitionDiscoverySuite, the test cases' assumption is legitimate.

SparkQA · 2018-03-03T08:05:01Z

Test build #87925 has finished for PR 20705 at commit 144460d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-03T08:12:52Z

Retest this please.

SparkQA · 2018-03-03T09:01:13Z

Test build #87926 has finished for PR 20705 at commit 144460d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-03T09:04:15Z

Retest this please.

SparkQA · 2018-03-03T12:39:35Z

Test build #87928 has finished for PR 20705 at commit 144460d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-03T17:05:21Z

For #20705 (comment), yup. JSON uses string for keys in MapType.

HyukjinKwon

How about we explicitly set spark.sql.sources.default to parquet for both test cases in #20705 (comment) too?

HyukjinKwon · 2018-03-03T17:05:15Z

python/pyspark/sql/readwriter.py

@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options):
                       or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
        :param options: all other string options

+        >>> spark.conf.set("spark.sql.sources.default", "parquet")


Can we just call format('parquet') like the doctest for JSON below?

Yep. That was my first commit here. I'll rollback this.

HyukjinKwon · 2018-03-03T17:31:51Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -2150,7 +2150,8 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

  test("data source table created in InMemoryCatalog should be able to read/write") {
    withTable("tbl") {
-      sql("CREATE TABLE tbl(i INT, j STRING) USING parquet")
+      val provider = spark.sessionState.conf.defaultDataSourceName


Hm .. how about just explicitly setting spark.sql.sources.default to parquet in all places rather than using the default? If it's set to, for example, text, this test becomes failed. I thought it's a bit odd that a test is dependent on a default value.

This is SQLQuerySuite. The test case is correctly testing its purpose. Every data source have its own capability and limitation. Your example is only text data source's limitation supporting a single column schema, isn't it? For the other csv/json/orc/parquet will pass this specific test.

So far, the purpose of this PR is setting once in SQLConf.scala to order to test a new data source to find out the limitation instead of touching every data suite.

BTW, spark.sql.sources.default=parquet doesn't help this existing code because the SQL has a fixed string USING parquet.

dongjoon-hyun · 2018-03-03T18:23:04Z

For the above two JSON failures, they are MetastoreDataSourcesSuite and PartitionedTablePerfStatsSuite. They are not designed to parquet. They are general test cases. If we explicitly set it to parquet, we should change it when we test it to new data source again and again.

How about we explicitly set spark.sql.sources.default to parquet for both test cases in #20705 (comment) too?

Instead, I think we had better file two JIRA issues as JSON improvement if that is feasible.

SparkQA · 2018-03-03T22:03:01Z

Test build #87937 has finished for PR 20705 at commit d9d2564.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-05T16:41:50Z

Retest this please.

SparkQA · 2018-03-05T20:06:05Z

Test build #87967 has finished for PR 20705 at commit d9d2564.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-03-08T20:06:35Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -2476,7 +2477,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
    withTempDir { dir =>
      val parquetDir = new File(dir, "parquet").getCanonicalPath
      spark.range(10).withColumn("_col", $"id").write.partitionBy("_col").save(parquetDir)


Since the data format may not be parquet, maybe the directory name should be more generic, like dataDir.

Thank you for review, @bersprockets .

SparkQA · 2018-03-09T00:22:51Z

Test build #88103 has finished for PR 20705 at commit 159489c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

bersprockets · 2018-03-09T00:04:39Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

@@ -591,7 +591,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
  }

  test("Pre insert nullability check (ArrayType)") {
-    withTable("arrayInParquet") {
+    withTable("array") {


It would be good, maybe in a future cleanup, to replace all these repeating string literals (e.g, "array" 7 times, "map" 7 times) with a variable name.

bersprockets · 2018-03-09T00:11:59Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/MetastoreDataSourcesSuite.scala

      checkAnswer(
-        sql("SELECT p.c1, c2 FROM insertParquet p"),
+        sql("SELECT p.c1, c2 FROM t p"),
        (70 to 79).map(i => Row(i, s"str$i")))


Curious about why the test named "SPARK-8156:create table to specific database by 'use dbname'" still has a hard-coded format of parquet. Is it testing functionality that is orthogonal to the format maybe?

I changed the hard-coded format to json, orc, and csv, and each time that test passed.

Similarly with
Suite: org.apache.spark.sql.sources.SaveLoadSuite
Test: SPARK-23459: Improve error message when specified unknown column in partition columns

That is because this PR minimally changed only the test case causing failures. We cannot generalize all test cases at an one-shot huge PR for all modules. That will make it difficult to backport the other commits. The main goal of this PR is improving test-ability for new data sources.

For example, although SPARK-8156:create table to specific database by 'use dbname' writes to parquet, but reads with SQL, not by read.load. So, it doesn't fail. That's not generalized test case, but also not too much malicious.

dongjoon-hyun · 2018-03-12T02:23:13Z

Retest this please.

SparkQA · 2018-03-12T06:08:54Z

Test build #88167 has finished for PR 20705 at commit 159489c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-14T23:54:16Z

Test build #88240 has finished for PR 20705 at commit 2975aff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-03-15T00:04:38Z

The failure is irrelevant to this PR.

 org.apache.spark.sql.execution.streaming.RateSourceV2Suite.basic microbatch execution

dongjoon-hyun · 2018-03-15T00:04:45Z

Retest this please.

SparkQA · 2018-03-15T03:38:58Z

Test build #88245 has finished for PR 20705 at commit 2975aff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-03-16T16:35:57Z

LGTM

gatorsmile · 2018-03-16T16:36:56Z

Thanks! Merged to master/2.3

…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20705 from dongjoon-hyun/SPARK-23553. (cherry picked from commit 5414abc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

dongjoon-hyun · 2018-03-16T18:22:53Z

Thank you, @gatorsmile , @HyukjinKwon , @bersprockets .

…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20705 from dongjoon-hyun/SPARK-23553.

…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20705 from dongjoon-hyun/SPARK-23553. (cherry picked from commit 5414abc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

dongjoon-hyun commented Mar 1, 2018

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default~~ [SPARK-23553][TESTS][WIP] Tests should not assume the default value of spark.sql.sources.default Mar 1, 2018

[SPARK-23553][TESTS] Tests should not assume the default value of `sp…

eb62c2f

…ark.sql.sources.default`

dongjoon-hyun changed the title ~~[SPARK-23553][TESTS][WIP] Tests should not assume the default value of spark.sql.sources.default~~ [SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default Mar 1, 2018

dongjoon-hyun commented Mar 1, 2018

View reviewed changes

fix readwriter.py

5192e6a

dongjoon-hyun commented Mar 1, 2018

View reviewed changes

dongjoon-hyun mentioned this pull request Mar 2, 2018

[DO-NOT-MERGE] [TEST] Try different spark.sql.sources.default #20721

Closed

Test with json.

3ec9309

Address comments.

144460d

dongjoon-hyun commented Mar 3, 2018

View reviewed changes

HyukjinKwon reviewed Mar 3, 2018

View reviewed changes

Rollback to original python change.

d9d2564

bersprockets reviewed Mar 8, 2018

View reviewed changes

Address comment

159489c

bersprockets reviewed Mar 9, 2018

View reviewed changes

Back to parquet

2975aff

asfgit closed this in 5414abc Mar 16, 2018

dongjoon-hyun deleted the SPARK-23553 branch March 16, 2018 18:22

[SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default #20705

[SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default #20705

Conversation

dongjoon-hyun commented Mar 1, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun Mar 1, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2018

SparkQA commented Mar 1, 2018

dongjoon-hyun Mar 1, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2018

HyukjinKwon commented Mar 2, 2018

dongjoon-hyun commented Mar 2, 2018

gatorsmile commented Mar 2, 2018

dongjoon-hyun commented Mar 3, 2018 • edited Loading

SparkQA commented Mar 3, 2018

dongjoon-hyun commented Mar 3, 2018

dongjoon-hyun commented Mar 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2018

dongjoon-hyun commented Mar 3, 2018

SparkQA commented Mar 3, 2018

dongjoon-hyun commented Mar 3, 2018

SparkQA commented Mar 3, 2018

HyukjinKwon commented Mar 3, 2018

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Mar 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 3, 2018

SparkQA commented Mar 3, 2018

dongjoon-hyun commented Mar 5, 2018

SparkQA commented Mar 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Mar 9, 2018 • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Mar 12, 2018

SparkQA commented Mar 12, 2018

SparkQA commented Mar 14, 2018

dongjoon-hyun commented Mar 15, 2018

dongjoon-hyun commented Mar 15, 2018

SparkQA commented Mar 15, 2018

gatorsmile commented Mar 16, 2018

gatorsmile commented Mar 16, 2018

dongjoon-hyun commented Mar 16, 2018

[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` #20705

[SPARK-23553][TESTS] Tests should not assume the default value of `spark.sql.sources.default` #20705

dongjoon-hyun commented Mar 1, 2018 •

edited

Loading

dongjoon-hyun Mar 1, 2018 •

edited

Loading

dongjoon-hyun Mar 1, 2018 •

edited

Loading

dongjoon-hyun commented Mar 3, 2018 •

edited

Loading

HyukjinKwon Mar 3, 2018 •

edited

Loading

dongjoon-hyun Mar 9, 2018 •

edited

Loading