[SPARK-30214][SQL] A new framework to resolve v2 commands #26847

yaooqinn · 2019-12-11T07:46:20Z

What changes were proposed in this pull request?

Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:

DROP NAMESPACE: only need to know the name of the namespace.
DESC NAMESPACE: need to lookup the namespace and get metadata, but is done during execution
DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
DESC TABLE: need to lookup the table and get metadata.

For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works and describes the root namespace under testcat, while DESC TABLE testcat fails if there is no table testcat under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:

parser creates logical plans or commands with UnresolvedNamespace/UnresolvedTable/UnresolvedView/UnresolvedRelation. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations)
analyzer converts
2.1 UnresolvedNamespace to ResolvesNamespace (contains catalog and namespace identifier)
2.2 UnresolvedTable to ResolvedTable (contains catalog, identifier and Table)
2.3 UnresolvedView to ResolvedView (will be added later when we migrate view commands)
2.4 UnresolvedRelation to relation.
an extra analyzer rule to match commands with V1Table and converts them to corresponding v1 commands. This will be added later when we migrate existing commands
planner matches commands and converts them to the corresponding physical nodes.

We also introduce brand new v2 commands - the comment syntaxes to illustrate how to work with the newly added framework.

COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ...
COMMENT ON TABLE ... IS ...

Details about the comment syntaxes:
As the new design of catalog v2, some properties become reserved, e.g. location, comment. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands.

They are the best practices from PostgreSQL and presto.

https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html

Mostly, the basic thoughts of the new framework came from the discussions bellow with @cloud-fan, #26847 (comment),

Why are the changes needed?

To make it easier to add new v2 commands, and easier to unify the table relation behavior.

Does this PR introduce any user-facing change?

yes, add new syntax

How was this patch tested?

add uts.

yaooqinn · 2019-12-11T07:51:19Z

cc @cloud-fan @maropu, thanks for reviewing this.

yaooqinn · 2019-12-11T07:53:45Z

a pre-discussion might be found here #26806 thanks again.

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

SparkQA · 2019-12-11T09:29:53Z

Test build #115159 has finished for PR 26847 at commit a005713.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class CommentOnNamespace(namespace: Seq[String], comment: String) extends Command
case class CommentOnTable(namespace: Seq[String], comment: String) extends Command

SparkQA · 2019-12-11T10:58:02Z

Test build #115164 has finished for PR 26847 at commit 7c45c9b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2019-12-11T12:03:13Z

This is a new command (no legacy v1 command), and is a good chance to discuss how commands should be resolved and executed ideally.

Since we have a v2 adapter for v1 catalog (V2SessionCatalog), all the table/namespace commands can be implemented via v2 APIs.

Usually a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:

DROP NAMESPACE: only need to know the name of the namespace.
DESC NAMESPACE: need to lookup the namespace and get metadata, but is done during execution
DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
DESC TABLE: need to lookup the table and get metadata.

For namespaces, analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works and describes the root namespace under testcat, while DESC TABLE testcat fails if there is no table testcat under the current catalog. It's because namespace can be named [], but table can't. The commands should explicitly specify it needs to operate on namespace or table.

Here is my proposal: introduce UnresolvedNamespace and UnresolvedV2Relation, which can be resolved to ResolvedNamespace(catalog, nameParts) and ResolvedV2Relation(catalog, v2Relation). The parser creates command containing either UnresolvedNamespace or UnresolvedV2Relation, and planner converts commands with ResolvedNamespace or ResolvedV2Relation to physical plans.

Note: there is already a UnresolvedV2Relation. We should rename it to something else.

What do you think? @brkyvz @rdblue @imback82

SparkQA · 2019-12-11T12:42:23Z

Test build #115186 has finished for PR 26847 at commit 9b272b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-11T15:40:37Z

Test build #115181 has finished for PR 26847 at commit e489e62.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-12-11T17:27:30Z

PS @yaooqinn you have almost 20 pull requests open. Can you review some that aren't likely to be reviewed or merged and close them?

yaooqinn · 2019-12-14T13:42:03Z

SHOW CURRENT NAMESPACE statement
SHOW NAMESPACES statement

These 2 are only v2 targeted, but with ParsedStatement trait, @cloud-fan, may need to change them.

yaooqinn · 2019-12-17T14:25:48Z

Hi, @cloud-fan, I have roughly implemented your proposal #26847 (comment) in this pull request, mind to take a look? thanks very much.

SparkQA · 2019-12-17T18:29:42Z

Test build #115460 has finished for PR 26847 at commit 7df9407.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-17T18:30:27Z

Test build #115459 has finished for PR 26847 at commit 3e37941.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class ResolvedV2Table(catalog: TableCatalog, tableIdentifier: Identifier)
case class CommentOnNamespace(child: NamespaceNode, comment: String) extends Command
case class CommentOnTable(child: TableNode, comment: String) extends Command

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamespaceNode.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableNode.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala

cloud-fan · 2019-12-18T14:53:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamespaceNode.scala

+case class ResolvedNamespace(catalog: SupportsNamespaces, namespace: Seq[String])
+  extends NamespaceNode
+
+case class UnresolvedNamespace(multipartIdentifier: Seq[String]) extends NamespaceNode {


BTW, we should give nice error message in CheckAnalysis when we hit UnresolvedNamespace and UnresolvedV2Table.

imback82 · 2019-12-30T19:09:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -732,6 +742,11 @@ class Analyzer(
        lookupTempView(ident)
          .map(view => i.copy(table = view))
          .getOrElse(i)
+      case u @ UnresolvedTable(ident) =>
+        lookupTempView(ident).foreach { _ =>
+          u.failAnalysis(s"${ident.quoted} is a view not table.")


Shouldn't this be is a temp view not table?

imback82 · 2019-12-30T19:34:00Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+    // unset this config to use the default v2 session catalog.
+    spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
+    // Session catalog is used.
+    sql(s"CREATE NAMESPACE ns")


s is not needed.

viirya · 2019-12-30T19:53:04Z

I think it is good to unify the relation resolution. A consistent approach to do that sounds a good idea. One question is do we need to add new syntax comment on here too?

viirya · 2019-12-30T20:04:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      case UnresolvedNamespace(CatalogAndNamespace(catalog, ns)) =>
+        ResolvedNamespace(catalog.asNamespaceCatalog, ns)
+    }


Does it mean the ParsedStatement from parser will turn to use UnresolvedNamespace? Currently, the catalogs in statements are resolved at ResolveCatalogs. Will we need to refactor ResolveCatalogs due to this change?

Yes, see
#26847 (comment)

We can have an extra rule to catch commands with v1 relation, and convert them to v1 commands. This can help us get rid of the duplicated code between ResolveCatalogs and ResolveSessionCatalog

yaooqinn · 2019-12-31T02:43:23Z

One question is do we need to add new syntax comment on here too?

Similar to location, comment became

I think it is good to unify the relation resolution. A consistent approach to do that sounds a good idea. One question is do we need to add new syntax comment on here too?

We add comment syntax to modify comment since it has become a reserved property like location, the purposes here are 1) to eliminate vagueness between the property comment and the COMMENT subclause in CREATE syntax, later only subclause is valid; 2) to eliminate unexcepted change by ALTER ... SET PROPERTIES syntax for reserved properties, all reserved ones should have their specific syntaxes to modify. Also, if these ones are changed by outside systems not spark, we can ignore them according to this.

SparkQA · 2019-12-31T03:23:59Z

Test build #115978 has finished for PR 26847 at commit 9d10239.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-31T07:30:40Z

Test build #115980 has finished for PR 26847 at commit c391212.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-03T03:20:58Z

COMMENT ON is used to demonstrate the new framework, and show how easy it is to implement a command.

SparkQA · 2020-01-03T07:59:04Z

Test build #116067 has finished for PR 26847 at commit d20b1b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-03T08:09:19Z

thanks, merging to master!

cloud-fan · 2020-01-03T08:09:45Z

@imback82 Let's start unifying the relation resolution!

imback82 · 2020-01-03T16:58:58Z

Cool! Thanks @yaooqinn and @cloud-fan!

…n framework ### What changes were proposed in this pull request? #26847 introduced new framework for resolving catalog/namespaces. This PR proposes to integrate commands that need to resolve namespaces into the new framework. ### Why are the changes needed? This is one of the work items for moving into the new resolution framework. Resolving v1/v2 tables with the new framework will be followed up in different PRs. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover the changes. Closes #27095 from imback82/unresolved_ns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gatorsmile · 2020-01-09T07:27:13Z

@imback82 COMMENT ON is a good example, but the code changes for implementing COMMENT ON are not small. It might make the code review harder. I think we can follow what @rdblue and @brkyvz suggested above to separate the actual changes from COMMENT ON?

How about reverting it and submitted two separate PRs?

cloud-fan · 2020-01-09T14:20:10Z

The COMMENT ON only takes about 60 LOC except for tests, it doesn't make my review harder at least.

This framework was blocking many things. Now we have several commits/open PRs depending on it:
314e70f
b2ed6d0
#26921
#26775

I don't think it's realistic to revert it now. In general, we should separate PR into smaller ones, but there are always exceptions, e.g. #24798 (comment)

If you have different ideas about the framework, please leave comments here, and we'll address them in followups.

gatorsmile · 2020-01-09T16:37:08Z

I would -1 on this comment #24798 (comment) We should avoid making such an exception.

Below is my proposal for such cases, we can keep the original giant PR [which might contain the related refactoring, new changes and good use cases] . When we deciding to merge it after reviewing the high level ideas/solutions, we can split it and open multiple smaller PRs, in which we can refer to the original PR for more context info.

Does that sound reasonable to all of you?

rdblue · 2020-01-09T16:55:34Z

+1 for @gatorsmile's approach of using a WIP PR for initial review and later separating features from refactoring and other changes.

I think that this should be reverted because there was a standing -1. That -1 was also echoed by @brkyvz and the same issue was later pointed out by @viirya. I think it was inappropriate to commit this in the first place, so it should be reverted.

I also think that we should take a closer look at the proposal for how resolution should work, and whether we want to target that for the 3.0 release given that we have several issues that we want to solve before 3.0 and the resolution works as it is now.

cloud-fan · 2020-01-10T04:53:24Z

I like the proposal from @gatorsmile . Let's put it in the review guideline and make it a policy. But the new policy should apply to new PRs only, not already merged PRs.

@rdblue do you have some concrete concerns about what's wrong with the new framework? Both @imback82 and I have taken a close look, and we think it's the right direction to go. If the new framework does have some major flaws, then let's revert it.

Note that, the DDL/DML command resolution framework is new in 3.0. The v1 commands just look up relation by themselves. If we all agree to use this new framework, we should make sure 3.0 and 3.1 use the same framework, otherwise it's really painful to backport bug fixes in the future.

Currently, we have a v2 adapter for v1 catalog (`V2SessionCatalog`), all the table/namespace commands can be implemented via v2 APIs. Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples: - `DROP NAMESPACE`: only need to know the name of the namespace. - `DESC NAMESPACE`: need to lookup the namespace and get metadata, but is done during execution - `DROP TABLE`: need to do lookup and make sure it's a table not (temp) view. - `DESC TABLE`: need to lookup the table and get metadata. For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need the analyzer to do lookup. Note that, table and namespace have a difference: `DESC NAMESPACE testcat` works and describes the root namespace under `testcat`, while `DESC TABLE testcat` fails if there is no table `testcat` under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table. In this Pull Request, we introduce a new framework to resolve v2 commands: 1. parser creates logical plans or commands with `UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations) 2. analyzer converts 2.1 `UnresolvedNamespace` to `ResolvesNamespace` (contains catalog and namespace identifier) 2.2 `UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and `Table`) 2.3 `UnresolvedView` to `ResolvedView` (will be added later when we migrate view commands) 2.4 `UnresolvedRelation` to relation. 3. an extra analyzer rule to match commands with `V1Table` and converts them to corresponding v1 commands. This will be added later when we migrate existing commands 4. planner matches commands and converts them to the corresponding physical nodes. We also introduce brand new v2 commands - the `comment` syntaxes to illustrate how to work with the newly added framework. ```sql COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ... COMMENT ON TABLE ... IS ... ``` Details about the `comment` syntaxes: As the new design of catalog v2, some properties become reserved, e.g. `location`, `comment`. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands. They are the best practices from PostgreSQL and presto. https://www.postgresql.org/docs/12/sql-comment.html https://prestosql.io/docs/current/sql/comment.html Mostly, the basic thoughts of the new framework came from the discussions bellow with cloud-fan, apache/spark#26847 (comment), To make it easier to add new v2 commands, and easier to unify the table relation behavior. yes, add new syntax add uts. Closes #26847 from yaooqinn/SPARK-30214. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-30214][SQL] Support COMMENT ON syntax

a005713

cloud-fan reviewed Dec 11, 2019

View reviewed changes

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Outdated Show resolved Hide resolved

naming

7c45c9b

yaooqinn added 2 commits December 11, 2019 19:02

Merge branch 'master' into SPARK-30214

024ab39

fix tests

e489e62

maropu reviewed Dec 11, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala Outdated Show resolved Hide resolved

comments

9b272b6

dongjoon-hyun added the SQL label Dec 12, 2019

yaooqinn added 4 commits December 17, 2019 19:04

impl UnresolvedNamespace

85617cd

megre master

e33a200

impl UnresolvedV2Table

3e37941

Merge branch 'master' into SPARK-30214

7df9407