Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30214][SQL] A new framework to resolve v2 commands #26847

Closed
wants to merge 35 commits into from
Closed

[SPARK-30214][SQL] A new framework to resolve v2 commands #26847

wants to merge 35 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Dec 11, 2019

What changes were proposed in this pull request?

Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:

  • DROP NAMESPACE: only need to know the name of the namespace.
  • DESC NAMESPACE: need to lookup the namespace and get metadata, but is done during execution
  • DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
  • DESC TABLE: need to lookup the table and get metadata.

For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works and describes the root namespace under testcat, while DESC TABLE testcat fails if there is no table testcat under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:

  1. parser creates logical plans or commands with UnresolvedNamespace/UnresolvedTable/UnresolvedView/UnresolvedRelation. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations)
  2. analyzer converts
    2.1 UnresolvedNamespace to ResolvesNamespace (contains catalog and namespace identifier)
    2.2 UnresolvedTable to ResolvedTable (contains catalog, identifier and Table)
    2.3 UnresolvedView to ResolvedView (will be added later when we migrate view commands)
    2.4 UnresolvedRelation to relation.
  3. an extra analyzer rule to match commands with V1Table and converts them to corresponding v1 commands. This will be added later when we migrate existing commands
  4. planner matches commands and converts them to the corresponding physical nodes.

We also introduce brand new v2 commands - the comment syntaxes to illustrate how to work with the newly added framework.

COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ...
COMMENT ON TABLE ... IS ...

Details about the comment syntaxes:
As the new design of catalog v2, some properties become reserved, e.g. location, comment. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands.

They are the best practices from PostgreSQL and presto.

https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html

Mostly, the basic thoughts of the new framework came from the discussions bellow with @cloud-fan, #26847 (comment),

Why are the changes needed?

To make it easier to add new v2 commands, and easier to unify the table relation behavior.

Does this PR introduce any user-facing change?

yes, add new syntax

How was this patch tested?

add uts.

@yaooqinn
Copy link
Member Author

cc @cloud-fan @maropu, thanks for reviewing this.

@yaooqinn
Copy link
Member Author

a pre-discussion might be found here #26806 thanks again.

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115159 has finished for PR 26847 at commit a005713.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class CommentOnNamespace(namespace: Seq[String], comment: String) extends Command
  • case class CommentOnTable(namespace: Seq[String], comment: String) extends Command

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115164 has finished for PR 26847 at commit 7c45c9b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 11, 2019

This is a new command (no legacy v1 command), and is a good chance to discuss how commands should be resolved and executed ideally.

Since we have a v2 adapter for v1 catalog (V2SessionCatalog), all the table/namespace commands can be implemented via v2 APIs.

Usually a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:

  1. DROP NAMESPACE: only need to know the name of the namespace.
  2. DESC NAMESPACE: need to lookup the namespace and get metadata, but is done during execution
  3. DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
  4. DESC TABLE: need to lookup the table and get metadata.

For namespaces, analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works and describes the root namespace under testcat, while DESC TABLE testcat fails if there is no table testcat under the current catalog. It's because namespace can be named [], but table can't. The commands should explicitly specify it needs to operate on namespace or table.

Here is my proposal: introduce UnresolvedNamespace and UnresolvedV2Relation, which can be resolved to ResolvedNamespace(catalog, nameParts) and ResolvedV2Relation(catalog, v2Relation). The parser creates command containing either UnresolvedNamespace or UnresolvedV2Relation, and planner converts commands with ResolvedNamespace or ResolvedV2Relation to physical plans.

Note: there is already a UnresolvedV2Relation. We should rename it to something else.

What do you think? @brkyvz @rdblue @imback82

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115186 has finished for PR 26847 at commit 9b272b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 11, 2019

Test build #115181 has finished for PR 26847 at commit e489e62.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Dec 11, 2019

PS @yaooqinn you have almost 20 pull requests open. Can you review some that aren't likely to be reviewed or merged and close them?

@yaooqinn
Copy link
Member Author

SHOW CURRENT NAMESPACE statement
SHOW NAMESPACES statement

These 2 are only v2 targeted, but with ParsedStatement trait, @cloud-fan, may need to change them.

@yaooqinn
Copy link
Member Author

yaooqinn commented Dec 17, 2019

Hi, @cloud-fan, I have roughly implemented your proposal #26847 (comment) in this pull request, mind to take a look? thanks very much.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115460 has finished for PR 26847 at commit 7df9407.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 17, 2019

Test build #115459 has finished for PR 26847 at commit 3e37941.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • case class ResolvedV2Table(catalog: TableCatalog, tableIdentifier: Identifier)
  • case class CommentOnNamespace(child: NamespaceNode, comment: String) extends Command
  • case class CommentOnTable(child: TableNode, comment: String) extends Command

case class ResolvedNamespace(catalog: SupportsNamespaces, namespace: Seq[String])
extends NamespaceNode

case class UnresolvedNamespace(multipartIdentifier: Seq[String]) extends NamespaceNode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, we should give nice error message in CheckAnalysis when we hit UnresolvedNamespace and UnresolvedV2Table.

@@ -732,6 +742,11 @@ class Analyzer(
lookupTempView(ident)
.map(view => i.copy(table = view))
.getOrElse(i)
case u @ UnresolvedTable(ident) =>
lookupTempView(ident).foreach { _ =>
u.failAnalysis(s"${ident.quoted} is a view not table.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be is a temp view not table?

// unset this config to use the default v2 session catalog.
spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key)
// Session catalog is used.
sql(s"CREATE NAMESPACE ns")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s is not needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@viirya
Copy link
Member

viirya commented Dec 30, 2019

I think it is good to unify the relation resolution. A consistent approach to do that sounds a good idea. One question is do we need to add new syntax comment on here too?

Comment on lines +728 to +730
case UnresolvedNamespace(CatalogAndNamespace(catalog, ns)) =>
ResolvedNamespace(catalog.asNamespaceCatalog, ns)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean the ParsedStatement from parser will turn to use UnresolvedNamespace? Currently, the catalogs in statements are resolved at ResolveCatalogs. Will we need to refactor ResolveCatalogs due to this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see
#26847 (comment)

We can have an extra rule to catch commands with v1 relation, and convert them to v1 commands. This can help us get rid of the duplicated code between ResolveCatalogs and ResolveSessionCatalog

@yaooqinn
Copy link
Member Author

One question is do we need to add new syntax comment on here too?

Similar to location, comment became

I think it is good to unify the relation resolution. A consistent approach to do that sounds a good idea. One question is do we need to add new syntax comment on here too?

We add comment syntax to modify comment since it has become a reserved property like location, the purposes here are 1) to eliminate vagueness between the property comment and the COMMENT subclause in CREATE syntax, later only subclause is valid; 2) to eliminate unexcepted change by ALTER ... SET PROPERTIES syntax for reserved properties, all reserved ones should have their specific syntaxes to modify. Also, if these ones are changed by outside systems not spark, we can ignore them according to this.

@SparkQA
Copy link

SparkQA commented Dec 31, 2019

Test build #115978 has finished for PR 26847 at commit 9d10239.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 31, 2019

Test build #115980 has finished for PR 26847 at commit c391212.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

COMMENT ON is used to demonstrate the new framework, and show how easy it is to implement a command.

@SparkQA
Copy link

SparkQA commented Jan 3, 2020

Test build #116067 has finished for PR 26847 at commit d20b1b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in c49388a Jan 3, 2020
@cloud-fan
Copy link
Contributor

@imback82 Let's start unifying the relation resolution!

@imback82
Copy link
Contributor

imback82 commented Jan 3, 2020

Cool! Thanks @yaooqinn and @cloud-fan!

cloud-fan pushed a commit that referenced this pull request Jan 7, 2020
…n framework

### What changes were proposed in this pull request?

#26847 introduced new framework for resolving catalog/namespaces. This PR proposes to integrate commands that need to resolve namespaces into the new framework.

### Why are the changes needed?

This is one of the work items for moving into the new resolution framework. Resolving v1/v2 tables with the new framework will be followed up in different PRs.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests should cover the changes.

Closes #27095 from imback82/unresolved_ns.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@gatorsmile
Copy link
Member

@imback82 COMMENT ON is a good example, but the code changes for implementing COMMENT ON are not small. It might make the code review harder. I think we can follow what @rdblue and @brkyvz suggested above to separate the actual changes from COMMENT ON?

How about reverting it and submitted two separate PRs?

@cloud-fan
Copy link
Contributor

The COMMENT ON only takes about 60 LOC except for tests, it doesn't make my review harder at least.

This framework was blocking many things. Now we have several commits/open PRs depending on it:
314e70f
b2ed6d0
#26921
#26775

I don't think it's realistic to revert it now. In general, we should separate PR into smaller ones, but there are always exceptions, e.g. #24798 (comment)

If you have different ideas about the framework, please leave comments here, and we'll address them in followups.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 9, 2020

I would -1 on this comment #24798 (comment) We should avoid making such an exception.

Below is my proposal for such cases, we can keep the original giant PR [which might contain the related refactoring, new changes and good use cases] . When we deciding to merge it after reviewing the high level ideas/solutions, we can split it and open multiple smaller PRs, in which we can refer to the original PR for more context info.

Does that sound reasonable to all of you?

@rdblue
Copy link
Contributor

rdblue commented Jan 9, 2020

+1 for @gatorsmile's approach of using a WIP PR for initial review and later separating features from refactoring and other changes.

I think that this should be reverted because there was a standing -1. That -1 was also echoed by @brkyvz and the same issue was later pointed out by @viirya. I think it was inappropriate to commit this in the first place, so it should be reverted.

I also think that we should take a closer look at the proposal for how resolution should work, and whether we want to target that for the 3.0 release given that we have several issues that we want to solve before 3.0 and the resolution works as it is now.

@cloud-fan
Copy link
Contributor

I like the proposal from @gatorsmile . Let's put it in the review guideline and make it a policy. But the new policy should apply to new PRs only, not already merged PRs.

@rdblue do you have some concrete concerns about what's wrong with the new framework? Both @imback82 and I have taken a close look, and we think it's the right direction to go. If the new framework does have some major flaws, then let's revert it.

Note that, the DDL/DML command resolution framework is new in 3.0. The v1 commands just look up relation by themselves. If we all agree to use this new framework, we should make sure 3.0 and 3.1 use the same framework, otherwise it's really painful to backport bug fixes in the future.

rdblue pushed a commit to Netflix/spark that referenced this pull request Jan 21, 2020
Currently, we have a v2 adapter for v1 catalog (`V2SessionCatalog`), all the table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:

  - `DROP NAMESPACE`: only need to know the name of the namespace.
  - `DESC NAMESPACE`: need to lookup the namespace and get metadata, but is done during execution
  - `DROP TABLE`: need to do lookup and make sure it's a table not (temp) view.
  - `DESC TABLE`: need to lookup the table and get metadata.

For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: `DESC NAMESPACE testcat` works and describes the root namespace under `testcat`, while `DESC TABLE testcat` fails if there is no table `testcat` under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:
1. parser creates logical plans or commands with `UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations)
2. analyzer converts
2.1 `UnresolvedNamespace` to `ResolvesNamespace` (contains catalog and namespace identifier)
2.2 `UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and `Table`)
2.3 `UnresolvedView` to `ResolvedView` (will be added later when we migrate view commands)
2.4 `UnresolvedRelation` to relation.
3. an extra analyzer rule to match commands with `V1Table` and converts them to corresponding v1 commands. This will be added later when we migrate existing commands
4. planner matches commands and converts them to the corresponding physical nodes.

We also introduce brand new v2 commands - the `comment` syntaxes to illustrate how to work with the newly added framework.
```sql
COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ...
COMMENT ON TABLE ... IS ...
```
Details about the `comment` syntaxes:
As the new design of catalog v2, some properties become reserved, e.g. `location`, `comment`. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands.

They are the best practices from PostgreSQL and presto.

https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html

Mostly, the basic thoughts of the new framework came from the discussions bellow with cloud-fan,  apache/spark#26847 (comment),

To make it easier to add new v2 commands, and easier to unify the table relation behavior.

yes, add new syntax

add uts.

Closes #26847 from yaooqinn/SPARK-30214.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.