[SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL #19003

gatorsmile · 2017-08-20T03:43:28Z

What changes were proposed in this pull request?

For Hive-serde tables, we always respect the schema stored in Hive metastore, because the schema could be altered by the other engines that share the same metastore. Thus, we always trust the metastore-controlled schema for Hive-serde tables when the schemas are different (without considering the nullability and cases). However, in some scenarios, Hive metastore also could INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in serde are different.

The proposed solution is to introduce a table-specific option for such scenarios. For a specific table, users can make Spark always respect Spark-inferred/controlled schema instead of trusting metastore-controlled schema. By default, we trust Hive metastore-controlled schema.

How was this patch tested?

Added a cross-version test case

SparkQA · 2017-08-20T06:24:52Z

Test build #80882 has finished for PR 19003 at commit 4c7349f.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SourceOptions(

SparkQA · 2017-08-20T13:39:18Z

Test build #80899 has finished for PR 19003 at commit 4c7349f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SourceOptions(

gatorsmile · 2017-08-20T23:16:30Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/client/VersionsSuite.scala

+             |  INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
+             |  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
+             |LOCATION '$location'
+             |TBLPROPERTIES ('avro.schema.literal' = '$avroSchema')


For such an example that requires users setting TBLPROPERTIES, it sounds like we are unable to use the CREATE TABLE USING command. cc @cloud-fan

There was an argument about whether we should add TBLPROPERTIES, and we decided to not add it. I'm totally fine to add it if it's necessary.

gatorsmile · 2017-08-22T00:25:18Z

cc @sameeragarwal @cloud-fan

cloud-fan · 2017-08-22T06:13:32Z

LGTM

gatorsmile · 2017-08-22T17:03:12Z

retest this please

sameeragarwal · 2017-08-22T18:01:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SourceOptions.scala

+import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
+
+/**
+ * Options for the Parquet data source.


nit: update docs

sameeragarwal · 2017-08-22T18:02:35Z

LGTM, thanks! Are these table properties documented somewhere?

gatorsmile · 2017-08-22T18:28:02Z

We might need a dedicated section for documenting all the table-specific conf options.

SparkQA · 2017-08-22T19:50:30Z

Test build #80999 has finished for PR 19003 at commit 4c7349f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SourceOptions(

sameeragarwal · 2017-08-22T20:05:34Z

Merging to master, thanks!

SparkQA · 2017-08-22T21:05:35Z

Test build #81001 has finished for PR 19003 at commit 36339c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fix.

4c7349f

gatorsmile commented Aug 20, 2017

View reviewed changes

sameeragarwal reviewed Aug 22, 2017

View reviewed changes

fix.

36339c8

gatorsmile closed this Aug 22, 2017

gatorsmile reopened this Aug 22, 2017

asfgit closed this in 01a8e46 Aug 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL #19003

[SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL #19003

gatorsmile commented Aug 20, 2017

SparkQA commented Aug 20, 2017

SparkQA commented Aug 20, 2017

gatorsmile Aug 20, 2017

cloud-fan Aug 22, 2017

gatorsmile commented Aug 22, 2017

cloud-fan commented Aug 22, 2017

gatorsmile commented Aug 22, 2017

sameeragarwal Aug 22, 2017

sameeragarwal commented Aug 22, 2017

gatorsmile commented Aug 22, 2017

SparkQA commented Aug 22, 2017

sameeragarwal commented Aug 22, 2017

SparkQA commented Aug 22, 2017

[SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL #19003

[SPARK-21769] [SQL] Add a table-specific option for always respecting schemas inferred/controlled by Spark SQL #19003

Conversation

gatorsmile commented Aug 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 20, 2017

SparkQA commented Aug 20, 2017

gatorsmile Aug 20, 2017

Choose a reason for hiding this comment

cloud-fan Aug 22, 2017

Choose a reason for hiding this comment

gatorsmile commented Aug 22, 2017

cloud-fan commented Aug 22, 2017

gatorsmile commented Aug 22, 2017

sameeragarwal Aug 22, 2017

Choose a reason for hiding this comment

sameeragarwal commented Aug 22, 2017

gatorsmile commented Aug 22, 2017

SparkQA commented Aug 22, 2017

sameeragarwal commented Aug 22, 2017

SparkQA commented Aug 22, 2017