-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-30214][SQL] A new framework to resolve v2 commands #26847
Conversation
cc @cloud-fan @maropu, thanks for reviewing this. |
a pre-discussion might be found here #26806 thanks again. |
sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Outdated
Show resolved
Hide resolved
Test build #115159 has finished for PR 26847 at commit
|
Test build #115164 has finished for PR 26847 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
Outdated
Show resolved
Hide resolved
This is a new command (no legacy v1 command), and is a good chance to discuss how commands should be resolved and executed ideally. Since we have a v2 adapter for v1 catalog ( Usually a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:
For namespaces, analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need analyzer to do lookup. Note that, table and namespace have a difference: Here is my proposal: introduce Note: there is already a |
Test build #115186 has finished for PR 26847 at commit
|
Test build #115181 has finished for PR 26847 at commit
|
PS @yaooqinn you have almost 20 pull requests open. Can you review some that aren't likely to be reviewed or merged and close them? |
These 2 are only v2 targeted, but with |
Hi, @cloud-fan, I have roughly implemented your proposal #26847 (comment) in this pull request, mind to take a look? thanks very much. |
Test build #115460 has finished for PR 26847 at commit
|
Test build #115459 has finished for PR 26847 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NamespaceNode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableNode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TableNode.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
Outdated
Show resolved
Hide resolved
case class ResolvedNamespace(catalog: SupportsNamespaces, namespace: Seq[String]) | ||
extends NamespaceNode | ||
|
||
case class UnresolvedNamespace(multipartIdentifier: Seq[String]) extends NamespaceNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, we should give nice error message in CheckAnalysis
when we hit UnresolvedNamespace
and UnresolvedV2Table
.
@@ -732,6 +742,11 @@ class Analyzer( | |||
lookupTempView(ident) | |||
.map(view => i.copy(table = view)) | |||
.getOrElse(i) | |||
case u @ UnresolvedTable(ident) => | |||
lookupTempView(ident).foreach { _ => | |||
u.failAnalysis(s"${ident.quoted} is a view not table.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be is a temp view not table
?
// unset this config to use the default v2 session catalog. | ||
spark.conf.unset(V2_SESSION_CATALOG_IMPLEMENTATION.key) | ||
// Session catalog is used. | ||
sql(s"CREATE NAMESPACE ns") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s
is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
I think it is good to unify the relation resolution. A consistent approach to do that sounds a good idea. One question is do we need to add new syntax |
case UnresolvedNamespace(CatalogAndNamespace(catalog, ns)) => | ||
ResolvedNamespace(catalog.asNamespaceCatalog, ns) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean the ParsedStatement
from parser will turn to use UnresolvedNamespace
? Currently, the catalogs in statements are resolved at ResolveCatalogs
. Will we need to refactor ResolveCatalogs
due to this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see
#26847 (comment)
We can have an extra rule to catch commands with v1 relation, and convert them to v1 commands. This can help us get rid of the duplicated code between ResolveCatalogs and ResolveSessionCatalog
Similar to
We add |
Test build #115978 has finished for PR 26847 at commit
|
Test build #115980 has finished for PR 26847 at commit
|
COMMENT ON is used to demonstrate the new framework, and show how easy it is to implement a command. |
Test build #116067 has finished for PR 26847 at commit
|
thanks, merging to master! |
@imback82 Let's start unifying the relation resolution! |
Cool! Thanks @yaooqinn and @cloud-fan! |
…n framework ### What changes were proposed in this pull request? #26847 introduced new framework for resolving catalog/namespaces. This PR proposes to integrate commands that need to resolve namespaces into the new framework. ### Why are the changes needed? This is one of the work items for moving into the new resolution framework. Resolving v1/v2 tables with the new framework will be followed up in different PRs. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover the changes. Closes #27095 from imback82/unresolved_ns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@imback82 COMMENT ON is a good example, but the code changes for implementing COMMENT ON are not small. It might make the code review harder. I think we can follow what @rdblue and @brkyvz suggested above to separate the actual changes from COMMENT ON? How about reverting it and submitted two separate PRs? |
The COMMENT ON only takes about 60 LOC except for tests, it doesn't make my review harder at least. This framework was blocking many things. Now we have several commits/open PRs depending on it: I don't think it's realistic to revert it now. In general, we should separate PR into smaller ones, but there are always exceptions, e.g. #24798 (comment) If you have different ideas about the framework, please leave comments here, and we'll address them in followups. |
I would -1 on this comment #24798 (comment) We should avoid making such an exception. Below is my proposal for such cases, we can keep the original giant PR [which might contain the related refactoring, new changes and good use cases] . When we deciding to merge it after reviewing the high level ideas/solutions, we can split it and open multiple smaller PRs, in which we can refer to the original PR for more context info. Does that sound reasonable to all of you? |
+1 for @gatorsmile's approach of using a WIP PR for initial review and later separating features from refactoring and other changes. I think that this should be reverted because there was a standing -1. That -1 was also echoed by @brkyvz and the same issue was later pointed out by @viirya. I think it was inappropriate to commit this in the first place, so it should be reverted. I also think that we should take a closer look at the proposal for how resolution should work, and whether we want to target that for the 3.0 release given that we have several issues that we want to solve before 3.0 and the resolution works as it is now. |
I like the proposal from @gatorsmile . Let's put it in the review guideline and make it a policy. But the new policy should apply to new PRs only, not already merged PRs. @rdblue do you have some concrete concerns about what's wrong with the new framework? Both @imback82 and I have taken a close look, and we think it's the right direction to go. If the new framework does have some major flaws, then let's revert it. Note that, the DDL/DML command resolution framework is new in 3.0. The v1 commands just look up relation by themselves. If we all agree to use this new framework, we should make sure 3.0 and 3.1 use the same framework, otherwise it's really painful to backport bug fixes in the future. |
Currently, we have a v2 adapter for v1 catalog (`V2SessionCatalog`), all the table/namespace commands can be implemented via v2 APIs. Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples: - `DROP NAMESPACE`: only need to know the name of the namespace. - `DESC NAMESPACE`: need to lookup the namespace and get metadata, but is done during execution - `DROP TABLE`: need to do lookup and make sure it's a table not (temp) view. - `DESC TABLE`: need to lookup the table and get metadata. For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed. For tables, mostly commands need the analyzer to do lookup. Note that, table and namespace have a difference: `DESC NAMESPACE testcat` works and describes the root namespace under `testcat`, while `DESC TABLE testcat` fails if there is no table `testcat` under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table. In this Pull Request, we introduce a new framework to resolve v2 commands: 1. parser creates logical plans or commands with `UnresolvedNamespace`/`UnresolvedTable`/`UnresolvedView`/`UnresolvedRelation`. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations) 2. analyzer converts 2.1 `UnresolvedNamespace` to `ResolvesNamespace` (contains catalog and namespace identifier) 2.2 `UnresolvedTable` to `ResolvedTable` (contains catalog, identifier and `Table`) 2.3 `UnresolvedView` to `ResolvedView` (will be added later when we migrate view commands) 2.4 `UnresolvedRelation` to relation. 3. an extra analyzer rule to match commands with `V1Table` and converts them to corresponding v1 commands. This will be added later when we migrate existing commands 4. planner matches commands and converts them to the corresponding physical nodes. We also introduce brand new v2 commands - the `comment` syntaxes to illustrate how to work with the newly added framework. ```sql COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ... COMMENT ON TABLE ... IS ... ``` Details about the `comment` syntaxes: As the new design of catalog v2, some properties become reserved, e.g. `location`, `comment`. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands. They are the best practices from PostgreSQL and presto. https://www.postgresql.org/docs/12/sql-comment.html https://prestosql.io/docs/current/sql/comment.html Mostly, the basic thoughts of the new framework came from the discussions bellow with cloud-fan, apache/spark#26847 (comment), To make it easier to add new v2 commands, and easier to unify the table relation behavior. yes, add new syntax add uts. Closes #26847 from yaooqinn/SPARK-30214. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Currently, we have a v2 adapter for v1 catalog (
V2SessionCatalog
), all the table/namespace commands can be implemented via v2 APIs.Usually, a command needs to know which catalog it needs to operate, but different commands have different requirements about what to resolve. A few examples:
DROP NAMESPACE
: only need to know the name of the namespace.DESC NAMESPACE
: need to lookup the namespace and get metadata, but is done during executionDROP TABLE
: need to do lookup and make sure it's a table not (temp) view.DESC TABLE
: need to lookup the table and get metadata.For namespaces, the analyzer only needs to find the catalog and the namespace name. The command can do lookup during execution if needed.
For tables, mostly commands need the analyzer to do lookup.
Note that, table and namespace have a difference:
DESC NAMESPACE testcat
works and describes the root namespace undertestcat
, whileDESC TABLE testcat
fails if there is no tabletestcat
under the current catalog. It's because namespaces can be named [], but tables can't. The commands should explicitly specify it needs to operate on namespace or table.In this Pull Request, we introduce a new framework to resolve v2 commands:
UnresolvedNamespace
/UnresolvedTable
/UnresolvedView
/UnresolvedRelation
. (CREATE TABLE still keeps Seq[String], as it doesn't need to look up relations)2.1
UnresolvedNamespace
toResolvesNamespace
(contains catalog and namespace identifier)2.2
UnresolvedTable
toResolvedTable
(contains catalog, identifier andTable
)2.3
UnresolvedView
toResolvedView
(will be added later when we migrate view commands)2.4
UnresolvedRelation
to relation.V1Table
and converts them to corresponding v1 commands. This will be added later when we migrate existing commandsWe also introduce brand new v2 commands - the
comment
syntaxes to illustrate how to work with the newly added framework.Details about the
comment
syntaxes:As the new design of catalog v2, some properties become reserved, e.g.
location
,comment
. We are going to disable setting reserved properties by dbproperties or tblproperites directly to avoid confliction with their related subClause or specific commands.They are the best practices from PostgreSQL and presto.
https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html
Mostly, the basic thoughts of the new framework came from the discussions bellow with @cloud-fan, #26847 (comment),
Why are the changes needed?
To make it easier to add new v2 commands, and easier to unify the table relation behavior.
Does this PR introduce any user-facing change?
yes, add new syntax
How was this patch tested?
add uts.