[SPARK-41405][SQL] Centralize the column resolution logic #38888

cloud-fan · 2022-12-02T14:54:43Z

What changes were proposed in this pull request?

This PR is a major refactor of how Spark resolves columns. Today, the column resolution logic is placed in several rules, which makes it hard to understand. It's also very fragile to maintain the resolution precedence, as you have to carefully deal with the interactions between these rules.

This PR centralizes the column resolution logic into a single rule: the existing ResolveReferences rule, so that we no longer need to worry about the interactions between multiple rules. The detailed resolution precedence is also documented.

Why are the changes needed?

code cleanup

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing tests

cloud-fan · 2022-12-06T10:24:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // is fully resolved, similar to the rule `ResolveAggregateFunctions`. However, Aggregate
+      // with GROUPING SETS is marked as unresolved and many analyzer rules can't apply to
+      // UnresolvedHaving because its child is not resolved. Here we explicitly resolve columns
+      // and subqueries of UnresolvedHaving so that the rewrite works in most cases.


This follows the previous code and has the same issues as before. For example:

create temp view t as select 1 a, 2 b, 3d c; select max(a) from t group by grouping sets ((b, c), (b + c)) having b + c > 0; org.apache.spark.sql.AnalysisException: Column 'b' does not exist. Did you mean one of the following? [max(a)]

This fails because b + c needs type coercion to be resolved, which will never happen as Aggregate with GROUPING SETS is marked as unresolved. Then Spark never knows that b + c is actually the grouping expression and can't rewrite HAVING.

cloud-fan · 2022-12-06T10:26:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+      // the missing attributes from the descendant node to the current node, and project them way
+      // at the end via an extra Project.
+      case s @ Sort(order, _, child) if !s.resolved || s.missingInput.nonEmpty =>
+        val resolvedNoOuter = order.map(resolveExpressionByPlanOutput(_, child))


I didn't use resolveExpressionByPlanChildren to follow the previous code: https://github.com/apache/spark/pull/38888/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47L1469 , I'm not sure if it will make a difference but just want to be safe.

cloud-fan · 2022-12-06T10:27:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+     * or resolved attributes which are missing from child output. This method tries to find the
+     * missing attributes and add them into the projection.
+     */
+    private def resolveExprsAndAddMissingAttrs(


moved from https://github.com/apache/spark/pull/38888/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47L2063 , no change.

cloud-fan · 2022-12-09T10:41:18Z

cc @viirya @allisonwang-db @MaxGekk

viirya · 2022-12-11T08:17:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+  // If it has been tried to be resolved but failed, mark it as unresolved so that other rules can
+  // try to resolve it again.


This expression for its purpose is to hold original column name for a resolved column. So the column resolution can be undo. With this new hasTried, it becomes something that is resolved but also failed to resolve?

Yes, I'll update the classdoc later. Now this expression can be used to undo column resolution, or redo it with a different priority.

viirya

This PR centralizes the column resolution logic into a single rule

I don't see new rule is added by this, but two rules were removed. Which single rule you referred? ResolveReferences?

cloud-fan · 2022-12-12T04:00:46Z

Yes, I've updated the PR description to make it clear.

cloud-fan · 2022-12-15T04:15:04Z

sql/core/src/test/resources/sql-tests/results/postgreSQL/select_having.sql.out

@@ -149,9 +149,9 @@ org.apache.spark.sql.AnalysisException
  "queryContext" : [ {


The error class

"_LEGACY_ERROR_TEMP_2422" : { "message" : [ "grouping expressions sequence is empty, and '<sqlExpr>' is not an aggregate function. Wrap '<aggExprs>' in windowing function(s) or wrap '<sqlExpr>' in first() (or first_value) if you don't care which value you get." ] },

The query context becomes more accurate actually.

viirya · 2022-12-22T07:56:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+ * resolution with a different priority if the analyzer has tried to resolve it with the default
+ * priority before but failed.


Suggested change

* resolution with a different priority if the analyzer has tried to resolve it with the default

* priority before but failed.

* resolution with a different priority if the analyzer has tried to resolve it with the default

* priority before but failed (i.e. `hasTried` is true).

viirya · 2022-12-22T08:22:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+ * `ResolveAggregationFunctions` will replace [[TempResolvedColumn]] with [[AttributeReference]] if
+ * it's inside aggregate functions or group expressions, or mark it as `hasTried` otherwise, hoping


I'm not sure I read this correctly. hasTried will be set to true, if the expression hosting TempResolvedColumn cannot be resolved, OR if it is not inside aggregate functions or group expressions?

if it is not inside aggregate functions or group expressions. Let me rephase the doc a bit more.

viirya · 2022-12-22T08:33:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

+          .map(_.asInstanceOf[NamedExpression])
+        a.copy(resolvedGroupingExprs, resolvedAggExprsWithOuter, a.child)
+
+      // Special case for Project as it supports literal column alias.


Suggested change

// Special case for Project as it supports literal column alias.

// Special case for Project as it supports Lateral column alias.

viirya

So far looks good to me. Although, there are some changes I've not read through yet.

cloud-fan · 2022-12-29T08:20:11Z

sql/core/src/test/scala/org/apache/spark/sql/LateralColumnAliasSuite.scala

@@ -547,8 +547,7 @@ class LateralColumnAliasSuite extends LateralColumnAliasSuiteBase {

  test("Lateral alias of a complex type") {
    // test both Project and Aggregate
-    // TODO(anchovyu): re-enable aggregate tests when fixed the having issue


This bug is fixed with this refactor

cloud-fan · 2022-12-29T08:20:37Z

@viirya can you take another look when you have time? thanks!

viirya · 2022-12-29T19:23:26Z

I will take another look today.

viirya · 2022-12-30T02:48:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

-        // We should resolve the references normally based on child (agg.output) first.
-        val maybeResolved = resolveExpressionByPlanOutput(cond, agg)
-        resolveOperatorWithAggregate(Seq(maybeResolved), agg, (newExprs, newChild) => {
+      case Filter(cond, agg: Aggregate) if agg.resolved && cond.resolved =>


Suggested change

case Filter(cond, agg: Aggregate) if agg.resolved && cond.resolved =>

case Filter(cond, agg: Aggregate) if agg.resolved && !cond.resolved =>

?

Oh, nvm, I got it after reading existing code.

Yea, and I added a comment to mention this: https://github.com/apache/spark/pull/38888/files#diff-ed19f376a63eba52eea59ca71f3355d4495fad4fad4db9a3324aade0d4986a47R2829

viirya

Looks good to me. I think this looks much more clear than before.

cloud-fan · 2022-12-30T07:50:55Z

The last commit is just a minor code simplification.

cloud-fan · 2023-01-03T12:09:22Z

thanks for review, merging to master!

dtenedor · 2023-01-17T17:52:44Z

Sorry for missing this earlier, late LGTM. Changes like this are moving in a good direction to move analysis logic closer to one pass. Ideally we could e.g. start making ResolveReferences immediately return an error if any unresolved reference could not be resolved, rather than waiting later until CheckAnalysis. We can make such improvements iteratively if we wish.

### What changes were proposed in this pull request? This is a followup of #38888 . When I search for all the matching of `UnresolvedAttribute`, I found that there are still a few rules doing column resolution: 1. ResolveAggAliasInGroupBy 2. ResolveGroupByAll 3. ResolveOrderByAll 4. ResolveDefaultColumns This PR merges the first 3 into `ResolvedReferences`. The last one will be done with a separate PR, as it's more complicated. To avoid making the rule `ResolvedReferences` bigger and bigger, this PR pulls out the resolution code for `Aggregate` to a separated virtual rule (only be used by `ResolvedReferences`). The same to `Sort`. We can refactor and add more virtual rules later. ### Why are the changes needed? It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after #38888 : `select a from t where exists (select 1 as a group by a)`. The `group by a` should be resolved as `1 as a`, but now it's resolved as outer reference `a`. This is because `ResolveReferences` runs before `ResolveAggAliasInGroupBy`, and resolves outer references too early. ### Does this PR introduce _any_ user-facing change? Fixes a bug, but the bug is not released yet. ### How was this patch tested? new tests Closes #39508 from cloud-fan/column. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of #38888 . When I search for all the matching of `UnresolvedAttribute`, I found that there are still a few rules doing column resolution: 1. ResolveAggAliasInGroupBy 2. ResolveGroupByAll 3. ResolveOrderByAll 4. ResolveDefaultColumns This PR merges the first 3 into `ResolvedReferences`. The last one will be done with a separate PR, as it's more complicated. To avoid making the rule `ResolvedReferences` bigger and bigger, this PR pulls out the resolution code for `Aggregate` to a separated virtual rule (only be used by `ResolvedReferences`). The same to `Sort`. We can refactor and add more virtual rules later. ### Why are the changes needed? It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after #38888 : `select a from t where exists (select 1 as a group by a)`. The `group by a` should be resolved as `1 as a`, but now it's resolved as outer reference `a`. This is because `ResolveReferences` runs before `ResolveAggAliasInGroupBy`, and resolves outer references too early. ### Does this PR introduce _any_ user-facing change? Fixes a bug, but the bug is not released yet. ### How was this patch tested? new tests Closes #39508 from cloud-fan/column. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 40ca27c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This PR refactors the default column value resolution so that we don't need an extra DS v2 API for external v2 sources. The general idea is to split the default column value resolution into two parts: 1. resolve the column "DEFAULT" to the column default expression. This applies to `Project`/`UnresolvedInlineTable` under `InsertIntoStatement`, and assignment expressions in `UpdateTable`/`MergeIntoTable`. 2. fill missing columns with column default values for the input query. This does not apply to UPDATE and non-INSERT action of MERGE as they use the column from the target table as the default value. The first part should be done for all the data sources, as it's part of column resolution. The second part should not be applied to v2 data sources with `ACCEPT_ANY_SCHEMA`, as they are free to define how to handle missing columns. More concretely, this PR: 1. put the column "DEFAULT" resolution logic in the rule `ResolveReferences`, with two new virtual rules. This is to follow #38888 2. put the missing column handling in `TableOutputResolver`, which is shared by both the v1 and v2 insertion resolution rule. External v2 data sources can add custom catalyst rules to deal with missing columns for themselves. 3. Remove the old rule `ResolveDefaultColumns`. Note that, with the refactor, we no long need to manually look up the table. We will deal with column default values after the target table of INSERT/UPDATE/MERGE is resolved. 4. Remove the rule `ResolveUserSpecifiedColumns` and merge it to `PreprocessTableInsertion`. These two rules are both to resolve v1 insertion, and it's tricky to reason about their interactions. It's clearer to resolve the insertion with one pass. ### Why are the changes needed? code cleanup and remove unneeded DS v2 API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated tests Closes #41262 from cloud-fan/def-val. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This PR refactors the default column value resolution so that we don't need an extra DS v2 API for external v2 sources. The general idea is to split the default column value resolution into two parts: 1. resolve the column "DEFAULT" to the column default expression. This applies to `Project`/`UnresolvedInlineTable` under `InsertIntoStatement`, and assignment expressions in `UpdateTable`/`MergeIntoTable`. 2. fill missing columns with column default values for the input query. This does not apply to UPDATE and non-INSERT action of MERGE as they use the column from the target table as the default value. The first part should be done for all the data sources, as it's part of column resolution. The second part should not be applied to v2 data sources with `ACCEPT_ANY_SCHEMA`, as they are free to define how to handle missing columns. More concretely, this PR: 1. put the column "DEFAULT" resolution logic in the rule `ResolveReferences`, with two new virtual rules. This is to follow apache#38888 2. put the missing column handling in `TableOutputResolver`, which is shared by both the v1 and v2 insertion resolution rule. External v2 data sources can add custom catalyst rules to deal with missing columns for themselves. 3. Remove the old rule `ResolveDefaultColumns`. Note that, with the refactor, we no long need to manually look up the table. We will deal with column default values after the target table of INSERT/UPDATE/MERGE is resolved. 4. Remove the rule `ResolveUserSpecifiedColumns` and merge it to `PreprocessTableInsertion`. These two rules are both to resolve v1 insertion, and it's tricky to reason about their interactions. It's clearer to resolve the insertion with one pass. ### Why are the changes needed? code cleanup and remove unneeded DS v2 API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated tests Closes apache#41262 from cloud-fan/def-val. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? This is a followup of apache#38888 . When I search for all the matching of `UnresolvedAttribute`, I found that there are still a few rules doing column resolution: 1. ResolveAggAliasInGroupBy 2. ResolveGroupByAll 3. ResolveOrderByAll 4. ResolveDefaultColumns This PR merges the first 3 into `ResolvedReferences`. The last one will be done with a separate PR, as it's more complicated. To avoid making the rule `ResolvedReferences` bigger and bigger, this PR pulls out the resolution code for `Aggregate` to a separated virtual rule (only be used by `ResolvedReferences`). The same to `Sort`. We can refactor and add more virtual rules later. ### Why are the changes needed? It's problematic to not centralize all the column resolution logic, as the execution order of the rules is not reliable. It actually leads to regression after apache#38888 : `select a from t where exists (select 1 as a group by a)`. The `group by a` should be resolved as `1 as a`, but now it's resolved as outer reference `a`. This is because `ResolveReferences` runs before `ResolveAggAliasInGroupBy`, and resolves outer references too early. ### Does this PR introduce _any_ user-facing change? Fixes a bug, but the bug is not released yet. ### How was this patch tested? new tests Closes apache#39508 from cloud-fan/column. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 40ca27c) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Dec 2, 2022

cloud-fan force-pushed the col branch 7 times, most recently from 5bc2c53 to 54bc098 Compare December 6, 2022 09:17

cloud-fan changed the title ~~[WIP] centralize the column resolution logic~~ [SPARK-41405][SQL] Centralize the column resolution logic Dec 6, 2022

cloud-fan commented Dec 6, 2022

View reviewed changes

cloud-fan force-pushed the col branch 3 times, most recently from 20bfa11 to 7789268 Compare December 8, 2022 05:47

cloud-fan marked this pull request as ready for review December 8, 2022 05:48

cloud-fan force-pushed the col branch from 7789268 to 169fd35 Compare December 9, 2022 10:39

viirya reviewed Dec 11, 2022

View reviewed changes

cloud-fan force-pushed the col branch 8 times, most recently from be8b1d0 to e819101 Compare December 15, 2022 04:08

cloud-fan commented Dec 15, 2022

View reviewed changes

viirya reviewed Dec 22, 2022

View reviewed changes

cloud-fan force-pushed the col branch 2 times, most recently from 85eaac4 to 0470b5e Compare December 22, 2022 14:19

cloud-fan added 2 commits December 29, 2022 15:59

centralize the column resolution logic

aa8516d

re-enable a test

d6cae76

cloud-fan force-pushed the col branch from 0470b5e to d6cae76 Compare December 29, 2022 08:19

cloud-fan commented Dec 29, 2022

View reviewed changes

viirya reviewed Dec 30, 2022

View reviewed changes

viirya approved these changes Dec 30, 2022

View reviewed changes

simplify the code using resolveOperatorsUp

89b7ff4

cloud-fan force-pushed the col branch from a41d2d2 to 89b7ff4 Compare December 30, 2022 10:00

cloud-fan closed this in 3c40be2 Jan 3, 2023

cloud-fan mentioned this pull request Jan 11, 2023

[SPARK-41985][SQL] Centralize more column resolution rules #39508

Closed

cloud-fan mentioned this pull request May 23, 2023

[SPARK-43742][SQL] Refactor default column value resolution #41262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41405][SQL] Centralize the column resolution logic #38888

[SPARK-41405][SQL] Centralize the column resolution logic #38888

cloud-fan commented Dec 2, 2022 •

edited

Loading

cloud-fan Dec 6, 2022

cloud-fan Dec 6, 2022

cloud-fan Dec 6, 2022

cloud-fan commented Dec 9, 2022

viirya Dec 11, 2022

cloud-fan Dec 12, 2022

viirya left a comment •

edited

Loading

cloud-fan commented Dec 12, 2022

cloud-fan Dec 15, 2022

viirya Dec 22, 2022

viirya Dec 22, 2022

cloud-fan Dec 22, 2022

viirya Dec 22, 2022 •

edited

Loading

viirya left a comment

cloud-fan Dec 29, 2022

cloud-fan commented Dec 29, 2022

viirya commented Dec 29, 2022

viirya Dec 30, 2022

viirya Dec 30, 2022

cloud-fan Dec 30, 2022

viirya left a comment

cloud-fan commented Dec 30, 2022

cloud-fan commented Jan 3, 2023

dtenedor commented Jan 17, 2023

		// If it has been tried to be resolved but failed, mark it as unresolved so that other rules can
		// try to resolve it again.

		@@ -149,9 +149,9 @@ org.apache.spark.sql.AnalysisException
		"queryContext" : [ {

		* resolution with a different priority if the analyzer has tried to resolve it with the default
		* priority before but failed.

		* `ResolveAggregationFunctions` will replace [[TempResolvedColumn]] with [[AttributeReference]] if
		* it's inside aggregate functions or group expressions, or mark it as `hasTried` otherwise, hoping

	// Special case for Project as it supports literal column alias.
	// Special case for Project as it supports Lateral column alias.

	case Filter(cond, agg: Aggregate) if agg.resolved && cond.resolved =>
	case Filter(cond, agg: Aggregate) if agg.resolved && !cond.resolved =>

[SPARK-41405][SQL] Centralize the column resolution logic #38888

[SPARK-41405][SQL] Centralize the column resolution logic #38888

Conversation

cloud-fan commented Dec 2, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment • edited Loading

Choose a reason for hiding this comment

cloud-fan commented Dec 12, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Dec 29, 2022

viirya commented Dec 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Dec 30, 2022

cloud-fan commented Jan 3, 2023

dtenedor commented Jan 17, 2023

cloud-fan commented Dec 2, 2022 •

edited

Loading

viirya left a comment •

edited

Loading

viirya Dec 22, 2022 •

edited

Loading