Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

Closed
wants to merge 14 commits into from

Conversation

brkyvz
Copy link
Contributor

@brkyvz brkyvz commented Dec 16, 2019

What changes were proposed in this pull request?

This PR introduces SupportsCatalogOptions as an interface for TableProvider. Through SupportsCatalogOptions, V2 DataSources can implement the two methods extractIdentifier and extractCatalog to support the creation, and existence check of tables without requiring a formal TableCatalog implementation.

We currently don't support all SaveModes for DataSourceV2 in DataFrameWriter.save. The idea here is that eventually File based tables can be written with DataFrameWriter.save(path) will create a PathIdentifier where the name is path, and the V2SessionCatalog will be able to perform FileSystem checks at path to support ErrorIfExists and Ignore SaveModes.

Why are the changes needed?

To support all Save modes for V2 data sources with DataFrameWriter. Since we can now support table creation, we will be able to provide partitioning information when first creating the table as well.

Does this PR introduce any user-facing change?

Introduces a new interface

How was this patch tested?

Will add tests once interface is vetted.

@brkyvz brkyvz changed the title [SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider [RFC][SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider Dec 16, 2019
@brkyvz
Copy link
Contributor Author

brkyvz commented Dec 16, 2019

cc @rdblue @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 16, 2019

Test build #115411 has finished for PR 26913 at commit 0a87228.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115469 has finished for PR 26913 at commit 1578f6c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2019

Mostly looks good. The only real blocker is using catalogManager instead of calling Catalogs.load.

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115595 has finished for PR 26913 at commit 33abbd5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115631 has finished for PR 26913 at commit b94bfc5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz brkyvz changed the title [RFC][SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider [SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider Dec 20, 2019
@SparkQA
Copy link

SparkQA commented Dec 20, 2019

Test build #115635 has finished for PR 26913 at commit 33ae658.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2019

Test build #115637 has finished for PR 26913 at commit 746e0d1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

It's good to see that we are looking at the big picture, to support user-specified schema, catalog in options, and schema/partition in builtin metastore.

For "1 TableProvider", agree with the expectation here, and I'd say getTable(properties) is the simplest to satisfy the expectation.

For "2 SupportsExternalMetadata", I think itself can be used as a marker and we don't need SupportsUserSpecifiedSchema:

- spark.table(...): Metastore schema + partitioning info + properties is passed in to create Table
- spark.read.load(...): Call inferSchema + inferPartitioning then pass in inferred schema + partitioning + df options to create Table
- spark.read.schema().load(...): Call inferPartitioning, pass in schema + inferred partitioning + df options to create Table

For data source, I don't think they care about where the schema/partitioning come from. It can be inferred by themselves, or specified by end-users, or from Spark's builtin catalog. In getTable(schema, partitioning, properties), they need to validate the schema/partitioning anyway.

For "3 SupportsCatalogOptions", sounds good to me, but looks better if we can fail with user-specified schema not supported like TableProvider.

assert(table.name() === s"$namespace.t1", "Table identifier was wrong")
assert(table.partitioning().length === partitionBy.length, "Partitioning did not match")
assert(table.partitioning().map(_.references().head.fieldNames().head) === partitionBy,
"Partitioning was incorrect")
Copy link
Contributor

@rdblue rdblue Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These assertions are probably easier if you use the extractors:

table.partitioning.head match {
  case IdentityTransform(FieldReference(field)) =>
    assert(field === Seq(partitionBy.head))
  case _ =>
    fail(...)
}

@rdblue
Copy link
Contributor

rdblue commented Dec 23, 2019

I noted a few things, but overall +1.

@rdblue
Copy link
Contributor

rdblue commented Dec 23, 2019

Retest this please.

@rdblue
Copy link
Contributor

rdblue commented Dec 23, 2019

One more thing: can we keep the discussion about the TableProvider API changes in one place? Right now we've been talking about it on this PR: #26868

I think that's a good place to continue. I think this PR is independent.

@SparkQA
Copy link

SparkQA commented Dec 24, 2019

Test build #115656 has finished for PR 26913 at commit 12f4ce4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Dec 24, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 24, 2019

Test build #115671 has finished for PR 26913 at commit 12f4ce4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Dec 24, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Dec 25, 2019

Test build #115750 has finished for PR 26913 at commit 12f4ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 4, 2020

@cloud-fan Any more comments on this? Shall we merge this?

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 9, 2020

retest this please

* topic name, etc. It's an immutable case-insensitive string-to-string map.
*/
default String extractCatalog(CaseInsensitiveStringMap options) {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we by default return CatalogManager.SESSION_CATALOG_NAME instead of null?

case Some(schema) => provider.getTable(dsOptions, schema)
case _ => provider.getTable(dsOptions)
val table = provider match {
case hasCatalog: SupportsCatalogOptions =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's fail if the user specifies schema.

val partitioning = partitioningColumns.map { colNames =>
colNames.map(name => IdentityTransform(FieldReference(name)))
}.getOrElse(Seq.empty[Transform])
val bucketing = bucketColumnNames.map { cols =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we call CatalogV2Implicits.BucketSpecHelper.asTransform?

@cloud-fan
Copy link
Contributor

LGTM except 3 comments

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116397 has finished for PR 26913 at commit 12f4ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@brkyvz
Copy link
Contributor Author

brkyvz commented Jan 9, 2020

Thanks @rdblue and @cloud-fan . Merging to master

@asfgit asfgit closed this in f8d5957 Jan 9, 2020
@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116402 has finished for PR 26913 at commit 963133e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants