Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default #20705

Closed
wants to merge 7 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 1, 2018

What changes were proposed in this pull request?

Currently, some tests have an assumption that spark.sql.sources.default=parquet. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.

This PR aims to

  • Improve test suites more robust and makes it easy to test new data sources in the future.
  • Test new native ORC data source with the full existing Apache Spark test coverage.

As an example, the PR uses spark.sql.sources.default=orc during reviews. The value should be parquet when this PR is accepted.

How was this patch tested?

Pass the Jenkins with updated tests.

@@ -526,7 +526,7 @@ object SQLConf {
val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default")
.doc("The default data source to use in input/output.")
.stringConf
.createWithDefault("parquet")
.createWithDefault("orc")
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a testing purpose during reviews.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change it back?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's back now, @gatorsmile .

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default [SPARK-23553][TESTS][WIP] Tests should not assume the default value of spark.sql.sources.default Mar 1, 2018
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-23553][TESTS][WIP] Tests should not assume the default value of spark.sql.sources.default [SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default Mar 1, 2018
@@ -739,15 +739,15 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
withTempPath { dir =>
df.write.format("parquet").partitionBy(partitionColumns.map(_.name): _*).save(dir.toString)
val fields = schema.map(f => Column(f.name).cast(f.dataType))
checkAnswer(spark.read.load(dir.toString).select(fields: _*), row)
checkAnswer(spark.read.parquet(dir.toString).select(fields: _*), row)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is ParquetPartitionDiscoverySuite, parquet is more proper than load.

// This testcase verifies that setting `hive.default.fileformat` has no impact on
// the target table's fileformat in case of CTAS.
assert(sessionState.conf.defaultDataSourceName === "parquet")
checkRelation(tableName = table, isDataSourceTable = true, format = "parquet")
checkRelation(tableName = table, isDataSourceTable = true, format = dataSourceFormat)
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, spark.sql.source.default=orc with hive.default.fileformat=textfile is not tested properly.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87849 has finished for PR 20705 at commit 427b6f0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2018

Test build #87851 has finished for PR 20705 at commit eb62c2f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
... opt2=1, opt3='str')
>>> df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned',
... opt1=True, opt2=1, opt3='str')
Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the other things, there is some difference from the original semantics.
As an alternative approach, we can add the following if we need to keep the original spark.read.load.

spark.conf.set("spark.sql.sources.default", "parquet")

@SparkQA
Copy link

SparkQA commented Mar 2, 2018

Test build #87861 has finished for PR 20705 at commit 5192e6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@dongjoon-hyun, I agree with the idea in general but just to be clear do you target to improve the test coverage for both when orc and parquet are set to spark.sql.sources.default here?

If not, how about we set parquet to spark.sql.sources.default explicitly where it's required for now?

@dongjoon-hyun
Copy link
Member Author

@HyukjinKwon . Yep. Actually, I had a plan for some suitable test suites. So, in general I generalized parquet to load/save.

Your idea also sounds good. I'll try to do for ParquetPartitionDiscoverySuite.scala and python part according to your advice. Thanks!

@gatorsmile
Copy link
Member

Could you change the default value to json? Can all the tests pass?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Mar 3, 2018

Sure, @gatorsmile . So far, I didn't update the PR according to @HyukjinKwon . I try the json test first on the as-is.

@SparkQA
Copy link

SparkQA commented Mar 3, 2018

Test build #87916 has finished for PR 20705 at commit 3ec9309.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

@gatorsmile and @HyukjinKwon .
Two failures are due to limitation of the current JSON data source implementation. Here, we can see that the test suite correctly tests the target data source.

  1. resolveRelation for a FileFormat DataSource without userSchema scan filesystem only once
    For Json source, the statistic count becomes 2.
  2. Pre insert nullability check (MapType)
    Since Json source save as string, it raises ClassCastException when the given user or table schema is different.
scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").save("/tmp/json")

scala> spark.read.json("/tmp/json").printSchema
root
 |-- a: struct (nullable = true)
 |    |-- 1: string (nullable = true)

scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").saveAsTable("map")
18/03/02 21:13:49 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `default`.`map` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.

scala> spark.read.json("/tmp/json").printSchema
root
 |-- a: struct (nullable = true)
 |    |-- 1: string (nullable = true)

scala> spark.table("map").printSchema
root
 |-- a: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = true)

scala> spark.table("map").show
18/03/02 21:14:12 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
        at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)

For JSON format, could you confirm this, @HyukjinKwon ?

@dongjoon-hyun
Copy link
Member Author

Since we verified JSON result, I'll update the PR to address @HyukjinKwon 's comment.

@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options):
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
:param options: all other string options

>>> spark.conf.set("spark.sql.sources.default", "parquet")
>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The built-in test data is parquet.

@@ -57,6 +57,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha
val timeZone = TimeZone.getDefault()
val timeZoneId = timeZone.getID

protected override def beforeAll(): Unit = {
super.beforeAll()
spark.conf.set(SQLConf.DEFAULT_DATA_SOURCE_NAME.key, "parquet")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is ParquetPartitionDiscoverySuite, the test cases' assumption is legitimate.

@SparkQA
Copy link

SparkQA commented Mar 3, 2018

Test build #87925 has finished for PR 20705 at commit 144460d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 3, 2018

Test build #87926 has finished for PR 20705 at commit 144460d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 3, 2018

Test build #87928 has finished for PR 20705 at commit 144460d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

For #20705 (comment), yup. JSON uses string for keys in MapType.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we explicitly set spark.sql.sources.default to parquet for both test cases in #20705 (comment) too?

@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options):
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
:param options: all other string options

>>> spark.conf.set("spark.sql.sources.default", "parquet")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call format('parquet') like the doctest for JSON below?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. That was my first commit here. I'll rollback this.

@@ -2150,7 +2150,8 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

test("data source table created in InMemoryCatalog should be able to read/write") {
withTable("tbl") {
sql("CREATE TABLE tbl(i INT, j STRING) USING parquet")
val provider = spark.sessionState.conf.defaultDataSourceName
Copy link
Member

@HyukjinKwon HyukjinKwon Mar 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm .. how about just explicitly setting spark.sql.sources.default to parquet in all places rather than using the default? If it's set to, for example, text, this test becomes failed. I thought it's a bit odd that a test is dependent on a default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SQLQuerySuite. The test case is correctly testing its purpose. Every data source have its own capability and limitation. Your example is only text data source's limitation supporting a single column schema, isn't it? For the other csv/json/orc/parquet will pass this specific test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, the purpose of this PR is setting once in SQLConf.scala to order to test a new data source to find out the limitation instead of touching every data suite.

BTW, spark.sql.sources.default=parquet doesn't help this existing code because the SQL has a fixed string USING parquet.

@dongjoon-hyun
Copy link
Member Author

For the above two JSON failures, they are MetastoreDataSourcesSuite and PartitionedTablePerfStatsSuite. They are not designed to parquet. They are general test cases. If we explicitly set it to parquet, we should change it when we test it to new data source again and again.

How about we explicitly set spark.sql.sources.default to parquet for both test cases in #20705 (comment) too?

Instead, I think we had better file two JIRA issues as JSON improvement if that is feasible.

@SparkQA
Copy link

SparkQA commented Mar 3, 2018

Test build #87937 has finished for PR 20705 at commit d9d2564.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 5, 2018

Test build #87967 has finished for PR 20705 at commit d9d2564.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -2476,7 +2477,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
withTempDir { dir =>
val parquetDir = new File(dir, "parquet").getCanonicalPath
spark.range(10).withColumn("_col", $"id").write.partitionBy("_col").save(parquetDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the data format may not be parquet, maybe the directory name should be more generic, like dataDir.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @bersprockets .

@SparkQA
Copy link

SparkQA commented Mar 9, 2018

Test build #88103 has finished for PR 20705 at commit 159489c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -591,7 +591,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv
}

test("Pre insert nullability check (ArrayType)") {
withTable("arrayInParquet") {
withTable("array") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good, maybe in a future cleanup, to replace all these repeating string literals (e.g, "array" 7 times, "map" 7 times) with a variable name.

checkAnswer(
sql("SELECT p.c1, c2 FROM insertParquet p"),
sql("SELECT p.c1, c2 FROM t p"),
(70 to 79).map(i => Row(i, s"str$i")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about why the test named "SPARK-8156:create table to specific database by 'use dbname'" still has a hard-coded format of parquet. Is it testing functionality that is orthogonal to the format maybe?

I changed the hard-coded format to json, orc, and csv, and each time that test passed.

Similarly with
Suite: org.apache.spark.sql.sources.SaveLoadSuite
Test: SPARK-23459: Improve error message when specified unknown column in partition columns

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is because this PR minimally changed only the test case causing failures. We cannot generalize all test cases at an one-shot huge PR for all modules. That will make it difficult to backport the other commits. The main goal of this PR is improving test-ability for new data sources.

For example, although SPARK-8156:create table to specific database by 'use dbname' writes to parquet, but reads with SQL, not by read.load. So, it doesn't fail. That's not generalized test case, but also not too much malicious.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88167 has finished for PR 20705 at commit 159489c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 14, 2018

Test build #88240 has finished for PR 20705 at commit 2975aff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

The failure is irrelevant to this PR.

 org.apache.spark.sql.execution.streaming.RateSourceV2Suite.basic microbatch execution

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Mar 15, 2018

Test build #88245 has finished for PR 20705 at commit 2975aff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master/2.3

asfgit pushed a commit that referenced this pull request Mar 16, 2018
…ark.sql.sources.default`

## What changes were proposed in this pull request?

Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.

This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in the future.
- Test new native ORC data source with the full existing Apache Spark test coverage.

As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted.

## How was this patch tested?

Pass the Jenkins with updated tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #20705 from dongjoon-hyun/SPARK-23553.

(cherry picked from commit 5414abc)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 5414abc Mar 16, 2018
@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile , @HyukjinKwon , @bersprockets .

@dongjoon-hyun dongjoon-hyun deleted the SPARK-23553 branch March 16, 2018 18:22
mstewart141 pushed a commit to mstewart141/spark that referenced this pull request Mar 24, 2018
…ark.sql.sources.default`

## What changes were proposed in this pull request?

Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.

This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in the future.
- Test new native ORC data source with the full existing Apache Spark test coverage.

As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted.

## How was this patch tested?

Pass the Jenkins with updated tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#20705 from dongjoon-hyun/SPARK-23553.
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
…ark.sql.sources.default`

## What changes were proposed in this pull request?

Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.

This PR aims to
- Improve test suites more robust and makes it easy to test new data sources in the future.
- Test new native ORC data source with the full existing Apache Spark test coverage.

As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted.

## How was this patch tested?

Pass the Jenkins with updated tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes apache#20705 from dongjoon-hyun/SPARK-23553.

(cherry picked from commit 5414abc)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants