-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23553][TESTS] Tests should not assume the default value of spark.sql.sources.default
#20705
Conversation
@@ -526,7 +526,7 @@ object SQLConf { | |||
val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") | |||
.doc("The default data source to use in input/output.") | |||
.stringConf | |||
.createWithDefault("parquet") | |||
.createWithDefault("orc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a testing purpose during reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. It's back now, @gatorsmile .
spark.sql.sources.default
spark.sql.sources.default
…ark.sql.sources.default`
spark.sql.sources.default
spark.sql.sources.default
@@ -739,15 +739,15 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha | |||
withTempPath { dir => | |||
df.write.format("parquet").partitionBy(partitionColumns.map(_.name): _*).save(dir.toString) | |||
val fields = schema.map(f => Column(f.name).cast(f.dataType)) | |||
checkAnswer(spark.read.load(dir.toString).select(fields: _*), row) | |||
checkAnswer(spark.read.parquet(dir.toString).select(fields: _*), row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is ParquetPartitionDiscoverySuite
, parquet
is more proper than load
.
// This testcase verifies that setting `hive.default.fileformat` has no impact on | ||
// the target table's fileformat in case of CTAS. | ||
assert(sessionState.conf.defaultDataSourceName === "parquet") | ||
checkRelation(tableName = table, isDataSourceTable = true, format = "parquet") | ||
checkRelation(tableName = table, isDataSourceTable = true, format = dataSourceFormat) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, spark.sql.source.default=orc
with hive.default.fileformat=textfile
is not tested properly.
Test build #87849 has finished for PR 20705 at commit
|
Test build #87851 has finished for PR 20705 at commit
|
>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True, | ||
... opt2=1, opt3='str') | ||
>>> df = spark.read.format("parquet").load('python/test_support/sql/parquet_partitioned', | ||
... opt1=True, opt2=1, opt3='str') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the other things, there is some difference from the original semantics.
As an alternative approach, we can add the following if we need to keep the original spark.read.load
.
spark.conf.set("spark.sql.sources.default", "parquet")
Test build #87861 has finished for PR 20705 at commit
|
@dongjoon-hyun, I agree with the idea in general but just to be clear do you target to improve the test coverage for both when If not, how about we set |
@HyukjinKwon . Yep. Actually, I had a plan for some suitable test suites. So, in general I generalized Your idea also sounds good. I'll try to do for |
Could you change the default value to json? Can all the tests pass? |
Sure, @gatorsmile . So far, I didn't update the PR according to @HyukjinKwon . I try the |
Test build #87916 has finished for PR 20705 at commit
|
@gatorsmile and @HyukjinKwon .
scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").save("/tmp/json")
scala> spark.read.json("/tmp/json").printSchema
root
|-- a: struct (nullable = true)
| |-- 1: string (nullable = true)
scala> (Tuple1(Map(1 -> (null: Integer))) :: Nil).toDF("a").write.mode("overwrite").saveAsTable("map")
18/03/02 21:13:49 WARN HiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider json. Persisting data source table `default`.`map` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
scala> spark.read.json("/tmp/json").printSchema
root
|-- a: struct (nullable = true)
| |-- 1: string (nullable = true)
scala> spark.table("map").printSchema
root
|-- a: map (nullable = true)
| |-- key: integer
| |-- value: integer (valueContainsNull = true)
scala> spark.table("map").show
18/03/02 21:14:12 ERROR Executor: Exception in task 0.0 in stage 10.0 (TID 10)
java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101) For JSON format, could you confirm this, @HyukjinKwon ? |
Since we verified JSON result, I'll update the PR to address @HyukjinKwon 's comment. |
python/pyspark/sql/readwriter.py
Outdated
@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options): | |||
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). | |||
:param options: all other string options | |||
|
|||
>>> spark.conf.set("spark.sql.sources.default", "parquet") | |||
>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The built-in test data is parquet
.
@@ -57,6 +57,16 @@ class ParquetPartitionDiscoverySuite extends QueryTest with ParquetTest with Sha | |||
val timeZone = TimeZone.getDefault() | |||
val timeZoneId = timeZone.getID | |||
|
|||
protected override def beforeAll(): Unit = { | |||
super.beforeAll() | |||
spark.conf.set(SQLConf.DEFAULT_DATA_SOURCE_NAME.key, "parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is ParquetPartitionDiscoverySuite
, the test cases' assumption is legitimate.
Test build #87925 has finished for PR 20705 at commit
|
Retest this please. |
Test build #87926 has finished for PR 20705 at commit
|
Retest this please. |
Test build #87928 has finished for PR 20705 at commit
|
For #20705 (comment), yup. JSON uses string for keys in MapType. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we explicitly set spark.sql.sources.default
to parquet
for both test cases in #20705 (comment) too?
python/pyspark/sql/readwriter.py
Outdated
@@ -147,6 +147,7 @@ def load(self, path=None, format=None, schema=None, **options): | |||
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``). | |||
:param options: all other string options | |||
|
|||
>>> spark.conf.set("spark.sql.sources.default", "parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call format('parquet')
like the doctest for JSON below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. That was my first commit here. I'll rollback this.
@@ -2150,7 +2150,8 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
|
|||
test("data source table created in InMemoryCatalog should be able to read/write") { | |||
withTable("tbl") { | |||
sql("CREATE TABLE tbl(i INT, j STRING) USING parquet") | |||
val provider = spark.sessionState.conf.defaultDataSourceName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm .. how about just explicitly setting spark.sql.sources.default
to parquet
in all places rather than using the default? If it's set to, for example, text
, this test becomes failed. I thought it's a bit odd that a test is dependent on a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is SQLQuerySuite
. The test case is correctly testing its purpose. Every data source have its own capability and limitation. Your example is only text
data source's limitation supporting a single column schema
, isn't it? For the other csv/json/orc/parquet
will pass this specific test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, the purpose of this PR is setting once in SQLConf.scala
to order to test a new data source to find out the limitation instead of touching every data suite.
BTW, spark.sql.sources.default=parquet
doesn't help this existing code because the SQL has a fixed string USING parquet
.
For the above two JSON failures, they are
Instead, I think we had better file two JIRA issues as JSON improvement if that is feasible. |
Test build #87937 has finished for PR 20705 at commit
|
Retest this please. |
Test build #87967 has finished for PR 20705 at commit
|
@@ -2476,7 +2477,7 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
withTempDir { dir => | |||
val parquetDir = new File(dir, "parquet").getCanonicalPath | |||
spark.range(10).withColumn("_col", $"id").write.partitionBy("_col").save(parquetDir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the data format may not be parquet, maybe the directory name should be more generic, like dataDir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review, @bersprockets .
Test build #88103 has finished for PR 20705 at commit
|
@@ -591,7 +591,7 @@ class MetastoreDataSourcesSuite extends QueryTest with SQLTestUtils with TestHiv | |||
} | |||
|
|||
test("Pre insert nullability check (ArrayType)") { | |||
withTable("arrayInParquet") { | |||
withTable("array") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good, maybe in a future cleanup, to replace all these repeating string literals (e.g, "array" 7 times, "map" 7 times) with a variable name.
checkAnswer( | ||
sql("SELECT p.c1, c2 FROM insertParquet p"), | ||
sql("SELECT p.c1, c2 FROM t p"), | ||
(70 to 79).map(i => Row(i, s"str$i"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about why the test named "SPARK-8156:create table to specific database by 'use dbname'" still has a hard-coded format of parquet. Is it testing functionality that is orthogonal to the format maybe?
I changed the hard-coded format to json, orc, and csv, and each time that test passed.
Similarly with
Suite: org.apache.spark.sql.sources.SaveLoadSuite
Test: SPARK-23459: Improve error message when specified unknown column in partition columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because this PR minimally changed only the test case causing failures. We cannot generalize all test cases at an one-shot huge PR for all modules. That will make it difficult to backport the other commits. The main goal of this PR is improving test-ability for new data sources.
For example, although SPARK-8156:create table to specific database by 'use dbname'
writes to parquet, but reads with SQL, not by read.load
. So, it doesn't fail. That's not generalized test case, but also not too much malicious.
Retest this please. |
Test build #88167 has finished for PR 20705 at commit
|
Test build #88240 has finished for PR 20705 at commit
|
The failure is irrelevant to this PR.
|
Retest this please. |
Test build #88245 has finished for PR 20705 at commit
|
LGTM |
Thanks! Merged to master/2.3 |
…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #20705 from dongjoon-hyun/SPARK-23553. (cherry picked from commit 5414abc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Thank you, @gatorsmile , @HyukjinKwon , @bersprockets . |
…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20705 from dongjoon-hyun/SPARK-23553.
…ark.sql.sources.default` ## What changes were proposed in this pull request? Currently, some tests have an assumption that `spark.sql.sources.default=parquet`. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format. This PR aims to - Improve test suites more robust and makes it easy to test new data sources in the future. - Test new native ORC data source with the full existing Apache Spark test coverage. As an example, the PR uses `spark.sql.sources.default=orc` during reviews. The value should be `parquet` when this PR is accepted. ## How was this patch tested? Pass the Jenkins with updated tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes apache#20705 from dongjoon-hyun/SPARK-23553. (cherry picked from commit 5414abc) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
Currently, some tests have an assumption that
spark.sql.sources.default=parquet
. In fact, that is a correct assumption, but that assumption makes it difficult to test new data source format.This PR aims to
As an example, the PR uses
spark.sql.sources.default=orc
during reviews. The value should beparquet
when this PR is accepted.How was this patch tested?
Pass the Jenkins with updated tests.