-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29665][SQL] Refine the TableProvider Interface #26868
Conversation
e7f1884
to
5aecc51
Compare
Test build #115244 has finished for PR 26868 at commit
|
Overall, I'm -1 on this PR. This confuses a few things. For example, this confuses user-specified schema with the SupportsExternalMetadata case. Support for user-specified schema does require the ability to pass the schema in, but the methods added by SupportsExternalMetadata are actually required when using a "generic" metastore to pass in the metastore's schema and partitioning. (As a separate discussion, I think that user-specified schema should be an explicit trait to signal support.) This also confuses the use case where "a datasource has metastore". If a source should go to a metastore, then we want to encourage the use of The main use case for the So far, the only argument I've heard in favor of having both |
@rdblue I think there is some misunderstanding here. I did this change to make the API simpler for non-file-source. Let's walk through some common data sources:
If we have a source that implements |
@cloud-fan, what is your rationale for saying "For [sources like Kafka], a simple getTable(properties) is the best."? You didn't give any argument why that is the case. My understanding is that the existing Kafka source has a static schema, so inferSchema is easy to implement. For other Kafka implementations, people may use a schema store in which case the lookup is fairly easy (although we would encourage building a Kafka catalog in this case). I don't see how the two inference methods make it more difficult to implement in this case. Having two Like I said, we clearly need |
We can tell it by looking at the code, but let me explain it here as well.
This is simpler than the below one, as we don't need to worry about if the passed in schema and partitioning are wrong.
In the last sync, I think we agree that we should have a "flag" to let Spark not store the schema/partitioning in the built-in generic catalog. |
I've added a note about how |
@cloud-fan, thanks for bringing up a flag to not store schema and partitioning in the catalog. I don't recall that discussion, but maybe I misinterpreted what was said. I had thought that not implementing It would be surprising that the simplest source implementation can be stored in the built-in catalog, but defaults to opting out of tracking schema and partitioning with that catalog. Having schema and partitioning tracked by the catalog is a major reason to use the built-in catalog. Implementations with a different source of truth will plug in a catalog, so the main benefit of using the generic catalog is to track a source's metadata. I think that implementing Thanks for posting the pseudo-code for Kafka. I see what you're saying, but the second option is not so arduous that we must provide a simplification. And we also have to consider a similar situation for a source that does implement In short, I think the more common case is sources that accept the schema from the metastore. It will be simpler to understand and use the API if we don't have two |
*/ | ||
Table getTable(CaseInsensitiveStringMap options); | ||
Table getTable(StructType schema, Transform[] partitioning, Map<String, String> properties); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be case preserving, because it'll come from the metastore right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's keep it as it is.
Test build #115744 has finished for PR 26868 at commit
|
Test build #115745 has finished for PR 26868 at commit
|
* limitations under the License. | ||
*/ | ||
|
||
package org.apache.spark.sql.internal.connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.spark.sql.internal
is a private package (as defined in project/SparkBuild.scala#Unidoc#ignoreUndocumentedPackages
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pick this instead of org.apache.spark.sql.execution.datasources.v2
because that package is only in sql/core and it's weird to see an execution
package in catalyst.
@@ -173,15 +173,13 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo | |||
case _ => None | |||
} | |||
ds match { | |||
case provider: TableProvider => | |||
// file source v2 does not support streaming yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file source v2 is never supported in streaming, as FileTable.capabilities
doesn't include streaming ones.
The reason I check it earlier is: now we call TableProvider.inferSchema
before checking table capabilities, and the error becomes different if path
is not specified.
It's weird that file source reports different error between batch and streaming when path is not specified. We should unify it. Here I just don't want to be blocked by an existing problem. cc @gengliangwang
Test build #115848 has finished for PR 26868 at commit
|
Test build #115852 has finished for PR 26868 at commit
|
Test build #116653 has finished for PR 26868 at commit
|
Test build #116674 has finished for PR 26868 at commit
|
schema: StructType, | ||
partitioning: Array[Transform], | ||
properties: util.Map[String, String]): Table = { | ||
assert(partitioning.isEmpty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this an assertion? Wouldn't this be the API to create a table in DataFrameWriter with partitioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, this isn't used for FileBased tables
if (provider.isInstanceOf[FileDataSourceV2]) { | ||
import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ | ||
val partitioning = partitioningColumns.getOrElse(Nil).asTransforms | ||
provider.getTable(df.schema, partitioning, dsOptions.asCaseSensitiveMap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.schema.asNullable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
userSpecifiedSchema: Option[StructType]): Table = { | ||
userSpecifiedSchema match { | ||
case Some(schema) => | ||
if (provider.supportsExternalMetadata()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we name this supportsUserSpecifiedSchema
if this is only going to throw an error for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for future-proof. The schema may not only come from user-specified, but also Spark catalog (the CREATE TABLE USING
case).
Test build #116949 has finished for PR 26868 at commit
|
Test build #117192 has finished for PR 26868 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We should do some cleanup later, but lets get this breaking interface in
|
||
def getTable(options: CaseInsensitiveStringMap): Table | ||
|
||
private[this] var loadedTable: Table = _ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this cause issues if you load two different Kafka tables? Shouldn't this be an options to table map? I'd probably turn this into a guava cache with an expiration so that you don't leak anything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvm, TableProviders have to be classes, not singleton objects. This should be fine
// following reads would fail. | ||
if (provider.isInstanceOf[FileDataSourceV2]) { | ||
import org.apache.spark.sql.connector.catalog.CatalogV2Implicits._ | ||
val partitioning = partitioningColumns.getOrElse(Nil).asTransforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use partitioningAsV2
here instead
Test build #117566 has finished for PR 26868 at commit
|
Test build #117567 has finished for PR 26868 at commit
|
thanks for the review, merging to master! |
What changes were proposed in this pull request?
Instead of having several overloads of
getTable
method inTableProvider
, it's better to have 2 methods explicitly:inferSchema
andinferPartitioning
. With a singlegetTable
method that takes everything: schema, partitioning and properties.This PR also adds a
supportsExternalMetadata
method inTableProvider
, to indicate if the source support external table metadata. If this flag is false:Why are the changes needed?
API improvement.
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing tests