[SPARK-29665][SQL] Refine the TableProvider Interface #26868

cloud-fan · 2019-12-12T15:49:01Z

What changes were proposed in this pull request?

Instead of having several overloads of getTable method in TableProvider, it's better to have 2 methods explicitly: inferSchema and inferPartitioning. With a single getTable method that takes everything: schema, partitioning and properties.

This PR also adds a supportsExternalMetadata method in TableProvider, to indicate if the source support external table metadata. If this flag is false:

spark.read.schema... is disallowed and fails
when we support creating v2 tables in session catalog, spark only keeps table properties in the catalog.

Why are the changes needed?

API improvement.

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

SparkQA · 2019-12-12T19:48:56Z

Test build #115244 has finished for PR 26868 at commit 5aecc51.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])

cloud-fan · 2019-12-13T15:07:09Z

cc @rdblue @brkyvz

rdblue · 2019-12-19T19:05:11Z

Overall, I'm -1 on this PR.

This confuses a few things. For example, this confuses user-specified schema with the SupportsExternalMetadata case. Support for user-specified schema does require the ability to pass the schema in, but the methods added by SupportsExternalMetadata are actually required when using a "generic" metastore to pass in the metastore's schema and partitioning. (As a separate discussion, I think that user-specified schema should be an explicit trait to signal support.)

This also confuses the use case where "a datasource has metastore". If a source should go to a metastore, then we want to encourage the use of SupportsCatalogOptions (#26913) to instantiate the table correctly using a Catalog. That approach fixes the currently unsupported SaveModes as well. TableProvider.getTable would not be helpful to this case.

The main use case for the TableProvider API is to support tables that do not use a catalog (via SupportsCatalogOptions). Clearly, we need SupportsExternalMetadata.getTable for the case where a generic metastore needs to create a table. The question is whether we also need TableProvider.getTable(properties).

So far, the only argument I've heard in favor of having both getTable methods is that it is easier to port the file sources. I don't think it is worth adding an extra interface that must be implemented in order to use the built-in Spark catalog just to avoid some trouble with file source implementations. This makes the API more difficult to understand and I think people will be surprised by not being able to use the built-in generic catalog when they've implemented TableProvider.

cloud-fan · 2019-12-20T04:14:22Z

@rdblue I think there is some misunderstanding here. I did this change to make the API simpler for non-file-source.

Let's walk through some common data sources:

simple sources that report fixed schemas, like kafka source. For them, a simple getTable(properties) is the best.
sources that can infer schema or accept user-specified schema, like file source. For them, I agree with you that having explicit infer methods is better. So this PR introduces SupportsExternalMetadata with infer methods and getTable(schema, partitioning, properties).
sources that have metastore, like JDBC source. For them, I also agree with you that SupportsCatalogOptions ([SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913) is a good solution.

If we have a source that implements SupportsCatalogOptions, it will be confusing if we have to implement the infer methods as well. With this PR, SupportsExternalMetadata is a mixin so we don't have to implement the infer methods.

rdblue · 2019-12-20T16:41:50Z

@cloud-fan, what is your rationale for saying "For [sources like Kafka], a simple getTable(properties) is the best."? You didn't give any argument why that is the case.

My understanding is that the existing Kafka source has a static schema, so inferSchema is easy to implement. For other Kafka implementations, people may use a schema store in which case the lookup is fairly easy (although we would encourage building a Kafka catalog in this case). I don't see how the two inference methods make it more difficult to implement in this case.

Having two getTable calls does make the API more confusing. If I want to store a Kafka stream in the built-in generic catalog, we agree that catalog should pass the schema and partitioning to TableProvider.getTable (Your point 2.). That means that both getTable(properties) and getTable(schema, partitioning, properties) must be implemented. And if an author doesn't implement the optional SupportsExternalMetadata interface the source won't work with a generic metastore. I think that's a significant problem and is surprising behavior.

Like I said, we clearly need getTable(schema, partitioning, properties). I don't think that we actually need a second variant, especially when implementing only the "simple" version doesn't work with the built-in metastore.

cloud-fan · 2019-12-23T05:06:35Z

what is your rationale for saying "For [sources like Kafka], a simple getTable(properties) is the best."?

We can tell it by looking at the code, but let me explain it here as well.

class KafkaProvider implements TableProvider {
  Table getTable(properties) {
    return new KafkaTable(properties)
  }
}

class KafkaTable implements TableProvider {
  StructType schema() {
    return the_fixed_schema;
  }

  Transform[] partitioning() {
    return new Transform[0];
  }

  ScanBuilder ...
  WriteBuilder ...
}

This is simpler than the below one, as we don't need to worry about if the passed in schema and partitioning are wrong.

class KafkaProvider implements TableProvider {
  StructType inferSchema() {
    return the_fixed_schema;
  }

  Transform[] inferPartitioning() {
    return new Transform[0];
  }

  Table getTable(schema, partitioninng, properties) {
    assert(schema == the_fixed_schema)
    assert(partitioninng.isEmpty)
    return new KafkaTable(schema, properties)
  }
}

class KafkaTable(schema) implements TableProvider {
  StructType schema() {
    return this.schema;
  }

  Transform[] partitioning() {
    return new Transform[0];
  }

  ScanBuilder ...
  WriteBuilder ...
}

If I want to store a Kafka stream in the built-in generic catalog, we agree that catalog should pass the schema and partitioning to TableProvider.getTable (Your point 2.). That means that both getTable(properties) and getTable(schema, partitioning, properties) must be implemented.

In the last sync, I think we agree that we should have a "flag" to let Spark not store the schema/partitioning in the built-in generic catalog. SupportsExternalMetadata is the flag. If a source don't implement SupportsExternalMetadata, then Spark won't store the schema/partitioning in the builtin catalog. When we scan this table, Spark just call getTable(properties) and ask the source to report schema/partitioning.

cloud-fan · 2019-12-23T05:13:24Z

I've added a note about how TableProvider should work with the builtin generic catalog in the Why are the changes needed? section.

rdblue · 2019-12-23T18:25:22Z

@cloud-fan, thanks for bringing up a flag to not store schema and partitioning in the catalog. I don't recall that discussion, but maybe I misinterpreted what was said. I had thought that not implementing SupportsExternalMetadata would prevent a generic catalog from tracking the table, but you're right that it could just ignore the schema and partitioning in the catalog. But even with that cleared up, I don't think having two getTable methods is a good idea.

It would be surprising that the simplest source implementation can be stored in the built-in catalog, but defaults to opting out of tracking schema and partitioning with that catalog. Having schema and partitioning tracked by the catalog is a major reason to use the built-in catalog. Implementations with a different source of truth will plug in a catalog, so the main benefit of using the generic catalog is to track a source's metadata. I think that implementing SupportsExternalMetadata would be far more common than not, so it should be the default. (There's even reason for a Kafka source to use this for partitioning or a record schema; it's just not what the built-in Kafka does.)

Thanks for posting the pseudo-code for Kafka. I see what you're saying, but the second option is not so arduous that we must provide a simplification. And we also have to consider a similar situation for a source that does implement SupportsExternalMetadata. Should Spark provide a default getTable(properties) that throws an exception? Or do sources implement a getTable method that probably can't return a table if it is called because the schema is missing? Having two methods is worse for the more common case.

In short, I think the more common case is sources that accept the schema from the metastore. It will be simpler to understand and use the API if we don't have two getTable methods. And it isn't unreasonable for sources that don't use an external schema to either check the one passed in or ignore it -- after all, Spark will guarantee that the schema is either the one from inferSchema or the one from the metastore (or user-specified if that's supported).

cloud-fan · 2019-12-24T15:17:50Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/TableProvider.java

   */
-  Table getTable(CaseInsensitiveStringMap options);
+  Table getTable(StructType schema, Transform[] partitioning, Map<String, String> properties);


shall we use CaseInsensitiveStringMap as table properties here? People can get the original case sensitive map easily via asCaseSensitiveMap.

cc @rdblue @brkyvz

I think this needs to be case preserving, because it'll come from the metastore right?

Yes, let's keep it as it is.

SparkQA · 2019-12-24T17:02:54Z

Test build #115744 has finished for PR 26868 at commit f08e40b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

SparkQA · 2019-12-24T17:10:19Z

Test build #115745 has finished for PR 26868 at commit c0481a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

cloud-fan · 2019-12-27T08:20:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.internal.connector


org.apache.spark.sql.internal is a private package (as defined in project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages).

I pick this instead of org.apache.spark.sql.execution.datasources.v2 because that package is only in sql/core and it's weird to see an execution package in catalyst.

cloud-fan · 2019-12-27T08:28:20Z

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

@@ -173,15 +173,13 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
      case _ => None
    }
    ds match {
-      case provider: TableProvider =>
+      // file source v2 does not support streaming yet.


file source v2 is never supported in streaming, as FileTable.capabilities doesn't include streaming ones.

The reason I check it earlier is: now we call TableProvider.inferSchema before checking table capabilities, and the error becomes different if path is not specified.

It's weird that file source reports different error between batch and streaming when path is not specified. We should unify it. Here I just don't want to be blocked by an existing problem. cc @gengliangwang

SparkQA · 2019-12-27T12:02:43Z

Test build #115848 has finished for PR 26868 at commit 4cc3742.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

SparkQA · 2019-12-27T12:35:18Z

Test build #115852 has finished for PR 26868 at commit 59b0de1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

SparkQA · 2020-01-13T16:47:20Z

Test build #116653 has finished for PR 26868 at commit 86963e9.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

SparkQA · 2020-01-14T07:34:49Z

Test build #116674 has finished for PR 26868 at commit ffb0b29.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
implicit class PartitionTypeHelper(colNames: Seq[String])
trait SimpleTableProvider extends TableProvider
class NoopDataSource extends SimpleTableProvider with DataSourceRegister
class RateStreamProvider extends SimpleTableProvider with DataSourceRegister
class TextSocketSourceProvider extends SimpleTableProvider with DataSourceRegister with Logging

brkyvz · 2020-01-17T01:13:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala

+      schema: StructType,
+      partitioning: Array[Transform],
+      properties: util.Map[String, String]): Table = {
+    assert(partitioning.isEmpty)


why is this an assertion? Wouldn't this be the API to create a table in DataFrameWriter with partitioning?

nvm, this isn't used for FileBased tables

brkyvz · 2020-01-17T01:19:16Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        if (provider.isInstanceOf[FileDataSourceV2]) {
+          import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+          val partitioning = partitioningColumns.getOrElse(Nil).asTransforms
+          provider.getTable(df.schema, partitioning, dsOptions.asCaseSensitiveMap())


df.schema.asNullable?

good catch!

brkyvz · 2020-01-17T01:20:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Utils.scala

+      userSpecifiedSchema: Option[StructType]): Table = {
+    userSpecifiedSchema match {
+      case Some(schema) =>
+        if (provider.supportsExternalMetadata()) {


can we name this supportsUserSpecifiedSchema if this is only going to throw an error for this case?

This is for future-proof. The schema may not only come from user-specified, but also Spark catalog (the CREATE TABLE USING case).

SparkQA · 2020-01-17T14:35:29Z

Test build #116949 has finished for PR 26868 at commit 518f35a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-21T21:28:15Z

Test build #117192 has finished for PR 26868 at commit 3f1de1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

brkyvz

LGTM. We should do some cleanup later, but lets get this breaking interface in

brkyvz · 2020-01-29T20:53:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/connector/SimpleTableProvider.scala

+
+  def getTable(options: CaseInsensitiveStringMap): Table
+
+  private[this] var loadedTable: Table = _


Wouldn't this cause issues if you load two different Kafka tables? Shouldn't this be an options to table map? I'd probably turn this into a guava cache with an expiration so that you don't leak anything

nvm, TableProviders have to be classes, not singleton objects. This should be fine

brkyvz · 2020-01-29T20:55:35Z

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

+        // following reads would fail.
+        if (provider.isInstanceOf[FileDataSourceV2]) {
+          import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._
+          val partitioning = partitioningColumns.getOrElse(Nil).asTransforms


you can use partitioningAsV2 here instead

SparkQA · 2020-01-30T12:58:33Z

Test build #117566 has finished for PR 26868 at commit ec1d682.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-30T17:17:03Z

Test build #117567 has finished for PR 26868 at commit 2ceb46e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-31T05:38:00Z

thanks for the review, merging to master!

cloud-fan force-pushed the provider2 branch 2 times, most recently from e7f1884 to 5aecc51 Compare December 12, 2019 15:51

dongjoon-hyun added the SQL label Dec 12, 2019

brkyvz mentioned this pull request Dec 23, 2019

[SPARK-29219][SQL] Introduce SupportsCatalogOptions for TableProvider #26913

Closed

cloud-fan force-pushed the provider2 branch from 5aecc51 to f08e40b Compare December 24, 2019 15:11

cloud-fan commented Dec 24, 2019

View reviewed changes

cloud-fan force-pushed the provider2 branch from f08e40b to c0481a2 Compare December 24, 2019 15:25

cloud-fan force-pushed the provider2 branch from c0481a2 to 4cc3742 Compare December 27, 2019 08:05

cloud-fan commented Dec 27, 2019

View reviewed changes

cloud-fan force-pushed the provider2 branch from 4cc3742 to 59b0de1 Compare December 27, 2019 08:32

cloud-fan force-pushed the provider2 branch from 59b0de1 to 86963e9 Compare January 13, 2020 16:33

cloud-fan force-pushed the provider2 branch from 86963e9 to ffb0b29 Compare January 14, 2020 03:13

brkyvz reviewed Jan 17, 2020

View reviewed changes

cloud-fan force-pushed the provider2 branch from ffb0b29 to 518f35a Compare January 17, 2020 14:23

cloud-fan force-pushed the provider2 branch from 518f35a to 3f1de1f Compare January 21, 2020 17:17

brkyvz approved these changes Jan 29, 2020

View reviewed changes

refine TableProvider

11f54f6

cloud-fan force-pushed the provider2 branch from 3f1de1f to ec1d682 Compare January 30, 2020 12:51

address comment

2ceb46e

cloud-fan force-pushed the provider2 branch from ec1d682 to 2ceb46e Compare January 30, 2020 12:56

cloud-fan closed this in 9f42be2 Jan 31, 2020

gatorsmile changed the title ~~[SPARK-29665][SQL] refine the TableProvider interface~~ [SPARK-29665][SQL] Refine the TableProvider Interface Feb 7, 2020


		def getTable(options: CaseInsensitiveStringMap): Table

		private[this] var loadedTable: Table = _

[SPARK-29665][SQL] Refine the TableProvider Interface #26868

[SPARK-29665][SQL] Refine the TableProvider Interface #26868

Conversation

cloud-fan commented Dec 12, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Dec 12, 2019

cloud-fan commented Dec 13, 2019

rdblue commented Dec 19, 2019

cloud-fan commented Dec 20, 2019

rdblue commented Dec 20, 2019

cloud-fan commented Dec 23, 2019

cloud-fan commented Dec 23, 2019

rdblue commented Dec 23, 2019

cloud-fan Dec 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 24, 2019

SparkQA commented Dec 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 27, 2019

SparkQA commented Dec 27, 2019

SparkQA commented Jan 13, 2020

SparkQA commented Jan 14, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2020

SparkQA commented Jan 21, 2020

brkyvz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 30, 2020

SparkQA commented Jan 30, 2020

cloud-fan commented Jan 31, 2020

cloud-fan commented Dec 12, 2019 •

edited

Loading

cloud-fan Dec 24, 2019 •

edited

Loading