[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

brkyvz · 2019-12-16T22:46:58Z

What changes were proposed in this pull request?

This PR introduces SupportsCatalogOptions as an interface for TableProvider. Through SupportsCatalogOptions, V2 DataSources can implement the two methods extractIdentifier and extractCatalog to support the creation, and existence check of tables without requiring a formal TableCatalog implementation.

We currently don't support all SaveModes for DataSourceV2 in DataFrameWriter.save. The idea here is that eventually File based tables can be written with DataFrameWriter.save(path) will create a PathIdentifier where the name is path, and the V2SessionCatalog will be able to perform FileSystem checks at path to support ErrorIfExists and Ignore SaveModes.

Why are the changes needed?

To support all Save modes for V2 data sources with DataFrameWriter. Since we can now support table creation, we will be able to provide partitioning information when first creating the table as well.

Does this PR introduce any user-facing change?

Introduces a new interface

How was this patch tested?

Will add tests once interface is vetted.

brkyvz · 2019-12-16T22:47:15Z

cc @rdblue @cloud-fan

SparkQA · 2019-12-16T23:05:40Z

Test build #115411 has finished for PR 26913 at commit 0a87228.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-17T21:37:07Z

Test build #115469 has finished for PR 26913 at commit 1578f6c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

rdblue · 2019-12-19T01:31:25Z

Mostly looks good. The only real blocker is using catalogManager instead of calling Catalogs.load.

…Options

SparkQA · 2019-12-20T05:35:53Z

Test build #115595 has finished for PR 26913 at commit 33abbd5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

sql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

SparkQA · 2019-12-20T17:04:25Z

Test build #115631 has finished for PR 26913 at commit b94bfc5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-20T23:58:57Z

Test build #115635 has finished for PR 26913 at commit 33ae658.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-21T02:35:56Z

Test build #115637 has finished for PR 26913 at commit 746e0d1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-12-23T07:43:59Z

It's good to see that we are looking at the big picture, to support user-specified schema, catalog in options, and schema/partition in builtin metastore.

For "1 TableProvider", agree with the expectation here, and I'd say getTable(properties) is the simplest to satisfy the expectation.

For "2 SupportsExternalMetadata", I think itself can be used as a marker and we don't need SupportsUserSpecifiedSchema:

- spark.table(...): Metastore schema + partitioning info + properties is passed in to create Table
- spark.read.load(...): Call inferSchema + inferPartitioning then pass in inferred schema + partitioning + df options to create Table
- spark.read.schema().load(...): Call inferPartitioning, pass in schema + inferred partitioning + df options to create Table

For data source, I don't think they care about where the schema/partitioning come from. It can be inferred by themselves, or specified by end-users, or from Spark's builtin catalog. In getTable(schema, partitioning, properties), they need to validate the schema/partitioning anyway.

For "3 SupportsCatalogOptions", sounds good to me, but looks better if we can fail with user-specified schema not supported like TableProvider.

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

rdblue · 2019-12-23T17:34:28Z

sql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala

+    assert(table.name() === s"$namespace.t1", "Table identifier was wrong")
+    assert(table.partitioning().length === partitionBy.length, "Partitioning did not match")
+    assert(table.partitioning().map(_.references().head.fieldNames().head) === partitionBy,
+      "Partitioning was incorrect")


These assertions are probably easier if you use the extractors:

table.partitioning.head match { case IdentityTransform(FieldReference(field)) => assert(field === Seq(partitionBy.head)) case _ => fail(...) }

rdblue · 2019-12-23T17:38:06Z

I noted a few things, but overall +1.

rdblue · 2019-12-23T17:40:25Z

Retest this please.

rdblue · 2019-12-23T18:26:42Z

One more thing: can we keep the discussion about the TableProvider API changes in one place? Right now we've been talking about it on this PR: #26868

I think that's a good place to continue. I think this PR is independent.

SparkQA · 2019-12-24T00:41:03Z

Test build #115656 has finished for PR 26913 at commit 12f4ce4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-12-24T01:20:18Z

retest this please

SparkQA · 2019-12-24T05:04:55Z

Test build #115671 has finished for PR 26913 at commit 12f4ce4.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2019-12-24T22:24:47Z

retest this please

SparkQA · 2019-12-25T02:25:29Z

Test build #115750 has finished for PR 26913 at commit 12f4ce4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2020-01-04T16:22:52Z

@cloud-fan Any more comments on this? Shall we merge this?

brkyvz · 2020-01-09T14:32:03Z

retest this please

cloud-fan · 2020-01-09T14:37:04Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsCatalogOptions.java

+   *                topic name, etc. It's an immutable case-insensitive string-to-string map.
+   */
+  default String extractCatalog(CaseInsensitiveStringMap options) {
+    return null;


shall we by default return CatalogManager.SESSION_CATALOG_NAME instead of null?

cloud-fan · 2020-01-09T14:39:02Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

-        case Some(schema) => provider.getTable(dsOptions, schema)
-        case _ => provider.getTable(dsOptions)
+      val table = provider match {
+        case hasCatalog: SupportsCatalogOptions =>


let's fail if the user specifies schema.

cloud-fan · 2020-01-09T14:42:44Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+    val partitioning = partitioningColumns.map { colNames =>
+      colNames.map(name => IdentityTransform(FieldReference(name)))
+    }.getOrElse(Seq.empty[Transform])
+    val bucketing = bucketColumnNames.map { cols =>


shall we call CatalogV2Implicits.BucketSpecHelper.asTransform?

cloud-fan · 2020-01-09T15:00:19Z

LGTM except 3 comments

SparkQA · 2020-01-09T18:35:24Z

Test build #116397 has finished for PR 26913 at commit 12f4ce4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz · 2020-01-09T19:17:34Z

Thanks @rdblue and @cloud-fan . Merging to master

SparkQA · 2020-01-09T19:22:33Z

Test build #116402 has finished for PR 26913 at commit 963133e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz added 3 commits December 16, 2019 11:02

Interface definition

6c916c2

save implementation

ed9adc8

Added partitioning checks

0a87228

brkyvz changed the title ~~[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider~~ [RFC][SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider Dec 16, 2019

Update SupportsCatalogOptions.java

1578f6c

rdblue reviewed Dec 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 19, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Show resolved Hide resolved

rdblue mentioned this pull request Dec 19, 2019

[SPARK-29665][SQL] Refine the TableProvider Interface #26868

Closed

brkyvz added 3 commits December 19, 2019 18:02

Added first set of tests

a441604

Added more tests

5c11b94

Merge branch 'catalogOptions' of github.com:brkyvz/spark into catalog…

33abbd5

…Options

brkyvz mentioned this pull request Dec 20, 2019

[SPARK-30314] Add identifier and catalog information to DataSourceV2Relation #26957

Closed

Update SupportsCatalogOptionsSuite.scala

b94bfc5

rdblue reviewed Dec 20, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 20, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/connector/SupportsCatalogOptionsSuite.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 20, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

brkyvz changed the title ~~[RFC][SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider~~ [SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider Dec 20, 2019

brkyvz added 2 commits December 20, 2019 12:34

Address comments

d8fd371

save

33ae658

implement for append and overwrite as well

746e0d1

rdblue reviewed Dec 23, 2019

View reviewed changes

external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Show resolved Hide resolved

rdblue reviewed Dec 23, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 23, 2019

View reviewed changes

Address comments

12f4ce4

cloud-fan reviewed Jan 9, 2020

View reviewed changes

address comments

963133e

cloud-fan approved these changes Jan 9, 2020

View reviewed changes

asfgit closed this in f8d5957 Jan 9, 2020

dongjoon-hyun added the SQL label Feb 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

brkyvz commented Dec 16, 2019

brkyvz commented Dec 16, 2019

SparkQA commented Dec 16, 2019

SparkQA commented Dec 17, 2019

rdblue commented Dec 19, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 21, 2019

cloud-fan commented Dec 23, 2019

rdblue Dec 23, 2019 •

edited

Loading

rdblue commented Dec 23, 2019

rdblue commented Dec 23, 2019

rdblue commented Dec 23, 2019

SparkQA commented Dec 24, 2019

brkyvz commented Dec 24, 2019

SparkQA commented Dec 24, 2019

brkyvz commented Dec 24, 2019

SparkQA commented Dec 25, 2019

brkyvz commented Jan 4, 2020

brkyvz commented Jan 9, 2020

cloud-fan Jan 9, 2020

cloud-fan Jan 9, 2020

cloud-fan Jan 9, 2020

cloud-fan commented Jan 9, 2020

SparkQA commented Jan 9, 2020

brkyvz commented Jan 9, 2020

SparkQA commented Jan 9, 2020

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

Conversation

brkyvz commented Dec 16, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

brkyvz commented Dec 16, 2019

SparkQA commented Dec 16, 2019

SparkQA commented Dec 17, 2019

rdblue commented Dec 19, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 20, 2019

SparkQA commented Dec 21, 2019

cloud-fan commented Dec 23, 2019

rdblue Dec 23, 2019 • edited Loading

Choose a reason for hiding this comment

rdblue commented Dec 23, 2019

rdblue commented Dec 23, 2019

rdblue commented Dec 23, 2019

SparkQA commented Dec 24, 2019

brkyvz commented Dec 24, 2019

SparkQA commented Dec 24, 2019

brkyvz commented Dec 24, 2019

SparkQA commented Dec 25, 2019

brkyvz commented Jan 4, 2020

brkyvz commented Jan 9, 2020

cloud-fan Jan 9, 2020

Choose a reason for hiding this comment

cloud-fan Jan 9, 2020

Choose a reason for hiding this comment

cloud-fan Jan 9, 2020

Choose a reason for hiding this comment

cloud-fan commented Jan 9, 2020

SparkQA commented Jan 9, 2020

brkyvz commented Jan 9, 2020

SparkQA commented Jan 9, 2020

rdblue Dec 23, 2019 •

edited

Loading