Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-41630][SQL] Support implicit lateral column alias resolution on Project #38776

Closed
wants to merge 18 commits into from

Conversation

anchovYu
Copy link
Contributor

@anchovYu anchovYu commented Nov 23, 2022

What changes were proposed in this pull request?

This PR implements a new feature: Implicit lateral column alias on Project case, controlled by spark.sql.lateralColumnAlias.enableImplicitResolution temporarily (default false now, but will turn on this conf once the feature is completely merged).

Lateral column alias

View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias.
There are two main cases to support: LCA in Project, and LCA in Aggregate.

-- LCA in Project. The base_salary references an attribute defined by a previous alias
SELECT salary AS base_salary, base_salary + bonus AS total_salary
FROM employee

-- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias
SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus)
FROM employee
GROUP BY dept

This implicit lateral column alias (no explicit keyword, e.g. lateral.base_salary) should be supported.

High level design

This PR defines a new Resolution rule, ResolveLateralColumnAlias to resolve the implicit lateral column alias, covering the Project case.
It introduces a new leaf node NamedExpression, LateralColumnAliasReference, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias.

The whole process is generally divided into two phases:

  1. recognize resolved lateral alias, wrap the attributes referencing them with LateralColumnAliasReference.
  2. when the whole operator is resolved, unwrap LateralColumnAliasReference. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project.

For example:

// Before
Project [age AS a, 'a + 1]
+- Child

// After phase 1
Project [age AS a, lateralalias(a) + 1]
+- Child

// After phase 2
Project [a, a + 1]
+- Project [child output, age AS a]
   +- Child

Resolution order

Given this new rule, the name resolution order will be (higher -> lower):

local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..)

There is a recent refactor that moves the creation of OuterReference in the Resolution batch: #38851.
Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an OuterReference using lateral column alias, similar as an UnresolvedAttribute. If success, it strips OuterReference and also wraps it with LateralColumnAliasReference.

Why are the changes needed?

The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience.

Does this PR introduce any user-facing change?

Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged.

How was this patch tested?

Existing tests and newly added tests.

@github-actions github-actions bot added the SQL label Nov 23, 2022
@anchovYu anchovYu changed the title [WIP] Refactor Analyzer by moving several public methods to the Analyzer object [WIP][SQL] Refactor Analyzer by moving several public methods to the Analyzer object Nov 23, 2022
@anchovYu anchovYu changed the title [WIP][SQL] Refactor Analyzer by moving several public methods to the Analyzer object [WIP][SQL] Refactor Analyzer by moving several public methods to the new Analyzer object Nov 23, 2022
@anchovYu anchovYu changed the title [WIP][SQL] Refactor Analyzer by moving several public methods to the new Analyzer object [WIP][SQL] Refactor Analyzer by moving several helper public methods to the new Analyzer object Nov 23, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

(cherry picked from commit 94adb3f)
(cherry picked from commit 313b2c9)
@github-actions github-actions bot added the CORE label Nov 28, 2022
@anchovYu anchovYu changed the title [WIP][SQL] Refactor Analyzer by moving several helper public methods to the new Analyzer object [SPARK-27561][SQL] Support implicit lateral column alias resolution on Project and refactor Analyzer Nov 28, 2022
@anchovYu
Copy link
Contributor Author

@cloud-fan @gengliangwang

&& projectList.exists(_.containsPattern(LATERAL_COLUMN_ALIAS_REFERENCE)) =>
// build the map again in case the project list changes and index goes off
// TODO one risk: is there any rule that strips off /add the Alias? that the LCA is resolved
// in the beginning, but when it comes to push down, it really can't find the matching one?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we will strip top-level alias. But if other rules split Project, then an alias may become an attribute in the upper Project. Fortunately I'm not aware of such rules.

Copy link
Contributor Author

@anchovYu anchovYu Dec 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I worried that the alias is turned into an attribute in the Project list (in non-top level, I assume LCA can happen in non-top level?). Glad there isn't such case, but even there is, as this PR does, restore back to UnresolvedAttribute should be safe to just throw exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any rule to add another layer of alias?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any

// Also, when resolving from bottom up should I worry about cases like:
// Project [b AS c, c + 1 AS d]
// +- Project [1 AS a, a AS b]
// b AS c is resolved, even b refers to an alias contains the lateral alias?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we resolve LCA bottom up, Project [1 AS a, a AS b] will become Project [a, a AS b] <- Project [1 as a] first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, my another worry is if there is some case that a AS b becomes unresolved due to the above comment (the 1 AS a is gone, leaving only a, then a AS b can't find the alias, becoming unresolved). But the above b AS c remains resolved incorrectly .. But anyway seems it doesn't harm, will still fail with unresolved or missing input.

// Implementation notes (to-delete):
// this is a design decision whether to restore the UnresolvedAttribute, or
// directly resolve by constructing a plan and using resolveExpressionByPlanChildren
resolveByLateralAlias(lcaRef.nameParts, aliasEntry.alias).getOrElse(lcaRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, why not just return lcaRef.ne?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is safe to resolve it one more time, in case the alias or the field changes or is reconstructed (different exprId).

if (resolvedAttr.resolved) Some(resolvedAttr) else None
}

private def rewriteLateralColumnAlias(plan: LogicalPlan): LogicalPlan = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is quite long...Shall we reduce its size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method is about 100 LOC after removing unnecessary TODO comments. But I can split it by phases into two methods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think wrapLCAReference and unwrapLCAReference can move out of this method..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future refactoring of resolving columns by @cloud-fan , I will split this rule into two rules by phases for now. After that refactoring, the wrapping phase will be completely embeded into the new big column resolution rule.

// If there is chaining, don't resolve and save to future rounds
lcaRef
}
case lcaRef: LateralColumnAliasReference if !aliasMap.contains(lcaRef.nameParts.head) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a.b.c -> LCATemp(resExpr, "a.b.c", attribute=alias.toAttribute)

def wrapLCAReference(e: NamedExpression): NamedExpression = {
e.transformWithPruning(_.containsAnyPattern(UNRESOLVED_ATTRIBUTE, OUTER_REFERENCE)) {
case o: OuterReference
if aliasMap.contains(o.nameParts.map(_.head).getOrElse(o.name)) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with OuterReference. But why only check the head here? Shouldn't we check the ending name part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general process of resolving an OuterReference is, given an UnresolvedAttribute with name parts, try to use the outer plan to resolve it. Once succeeds, it is resolved to an OuterReference. In this PR, I extend the OuterReference case class with a new nameParts field, saving the original nameParts from UnresolvedAttribute. This is because the nameParts are still needed to resolve lateral aliases.

On the reason why it is matched on the head, it is because the lateral alias reference can only be as the first str in the nameParts. For example, SELECT named_struct('a', 1) AS foo, foo.a, the foo.a has name parts ("foo", "a"), only the head foo can reference the lateral alias. It can't be a, because lateral alias reference won't have any qualifier.

}

test("Lateral alias basics - Project") {
checkAnswer(sql(s"select dept as d, d + 1 as e from $testTable where name = 'amy'"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we have a new test suite LateralColumnAliasDisabledSuite which overrides the checkAnswer method (or a new validation method) for checking exceptions?

@gengliangwang
Copy link
Member

LGTM except minor comments. Great work, @anchovYu

@@ -638,6 +638,14 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
case UnresolvedWindowExpression(_, windowSpec) =>
throw QueryCompilationErrors.windowSpecificationNotDefinedError(windowSpec.name)
})
// This should not happen, resolved Project or Aggregate should restore or resolve
// all lateral column alias references. Add check for extra safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we have a rule like RemoveTempResolvedColumn to restore LateralColumnAliasReference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't add it intentionally. This is because I don't want those attributes actually can be resolve as LCA but to show in the error msg as UnresolvedAttribute. Also note that unlike RemoveTempResolvedColumn, LCARef can't be directly resolved to the NamedExpression inside of it because the plan won't be right - there is no alias push down.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this should not happen, we should throw an internal error SparkThrowable.internalError, so that it can include more debug information, instead of UNRESOLVED_COLUMN

// OuterReference. Used in rule ResolveLateralColumnAlias to convert OuterReference back to
// LateralColumnAliasReference.
var nameParts: Option[Seq[String]] = None
def setNameParts(newNameParts: Option[Seq[String]]): OuterReference = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky. Maybe we should invoke WrapLateralColumnAliasReference in ResolveOuterReferences, so that we don't need to re-resolve outer references and introduce this hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed before, I feel it is not safe to do so given the current solution in ResolveOuterReference that each rule is applied only once. I made up a query (it can't run, just for demonstration):

SELECT *
FROM range(1, 7)
WHERE (
  SELECT id2
  FROM (SELECT dept * 2.0 AS id, id + 1 AS id2 FROM $testTable)) > 5
ORDER BY id

It is possible that dept * 2.0 is not resolved because it needs type conversion, so the LCA rule doesn't apply. Then it just wraps the id in id + 1 AS id2 as OuterReference.

* the current plan is needed to first try resolving the attribute by its
* children
*/
private def wrapLCARefHelper(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private def wrapLCARefHelper(
private def wrapLCARef(

// optional field, the original name parts of UnresolvedAttribute before it is resolved to
// OuterReference. Used in rule ResolveLateralColumnAlias to convert OuterReference back to
// LateralColumnAliasReference.
var nameParts: Option[Seq[String]] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to keep a mutable state, TreeNodeTag is a better choice. Directly using var in catalyst TreeNode is strongly discouraged.

Copy link
Contributor Author

@anchovYu anchovYu Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: I didn't add it in the constructor of OuterReference due to binary compatibility. Is that concern valid? Actually, what is the risk to change the constructor, but also write another unapply function? This seems impossible without introducing a new object with another name, and still requires large portion of code change of pattern matching.

Copy link
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some minor comments

@anchovYu anchovYu changed the title [SPARK-27561][SQL] Support implicit lateral column alias resolution on Project and refactor Analyzer [SPARK-27561][SQL] Support implicit lateral column alias resolution on Project Dec 12, 2022
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7e9b88b Dec 13, 2022
beliefer pushed a commit to beliefer/spark that referenced this pull request Dec 18, 2022
…n Project

### What changes were proposed in this pull request?
This PR implements a new feature: Implicit lateral column alias  on `Project` case, controlled by `spark.sql.lateralColumnAlias.enableImplicitResolution` temporarily (default false now, but will turn on this conf once the feature is completely merged).

#### Lateral column alias
View https://issues.apache.org/jira/browse/SPARK-27561 for more details on lateral column alias.
There are two main cases to support: LCA in Project, and LCA in Aggregate.
```sql
-- LCA in Project. The base_salary references an attribute defined by a previous alias
SELECT salary AS base_salary, base_salary + bonus AS total_salary
FROM employee

-- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias
SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus)
FROM employee
GROUP BY dept
```
This **implicit** lateral column alias (no explicit keyword, e.g. `lateral.base_salary`) should be supported.

#### High level design
This PR defines a new Resolution rule, `ResolveLateralColumnAlias` to resolve the implicit lateral column alias, covering the `Project` case.
It introduces a new leaf node NamedExpression, `LateralColumnAliasReference`, as a placeholder used to hold a referenced that has been temporarily resolved as the reference to a lateral column alias.

The whole process is generally divided into two phases:
1) recognize **resolved** lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`.
 2) when the whole operator is resolved, unwrap `LateralColumnAliasReference`. For Project, it further resolves the attributes and push down the referenced lateral aliases to the new Project.

For example:
```
// Before
Project [age AS a, 'a + 1]
+- Child

// After phase 1
Project [age AS a, lateralalias(a) + 1]
+- Child

// After phase 2
Project [a, a + 1]
+- Project [child output, age AS a]
   +- Child
```

#### Resolution order
Given this new rule, the name resolution order will be (higher -> lower):
```
local table column > local metadata attribute > local lateral column alias > all others (outer reference of subquery, parameters of SQL UDF, ..)
```

There is a recent refactor that moves the creation of `OuterReference` in the Resolution batch: apache#38851.
Because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`.

### Why are the changes needed?
The lateral column alias is a popular feature wanted for a long time. It is supported by lots of other database vendors (Redshift, snowflake, etc) and provides a better user experience.

### Does this PR introduce _any_ user-facing change?
Yes, as shown in the above example, it will be able to resolve lateral column alias. I will write the migration guide or release note when most PRs of this feature are merged.

### How was this patch tested?
Existing tests and newly added tests.

Closes apache#38776 from anchovYu/SPARK-27561-refactor.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@anchovYu anchovYu changed the title [SPARK-27561][SQL] Support implicit lateral column alias resolution on Project [SPARK-41630][SQL] Support implicit lateral column alias resolution on Project Dec 20, 2022
@anchovYu
Copy link
Contributor Author

Hi @gengliangwang , I opened subtask for Project in the original parent JIRA: https://issues.apache.org/jira/browse/SPARK-41630. I also updated the title of this PR. Could you help me update the status of that JIRA (assignee, resolved, etc)? Thanks!

cloud-fan pushed a commit that referenced this pull request Dec 21, 2022
…n Aggregate

### What changes were proposed in this pull request?

This PR implements the implicit lateral column alias on `Aggregate` case. For example,
```sql
-- LCA in Aggregate. The avg_salary references an attribute defined by a previous alias
SELECT dept, average(salary) AS avg_salary, avg_salary + average(bonus)
FROM employee
GROUP BY dept
```

The high level implementation idea is to insert the `Project` node above, and falling back to the resolution of lateral alias of Project code path in the last PR.

* Phase 1: recognize resolved lateral alias, wrap the attributes referencing them with `LateralColumnAliasReference`
* Phase 2: when the `Aggregate` operator is resolved, it goes through the whole aggregation list, extracts the aggregation expressions and grouping expressions to keep them in this `Aggregate` node, and add a `Project` above with the original output. It doesn't do anything on `LateralColumnAliasReference`, but completely leave it to the Project in the future turns of this rule.

Example:
```
 // Before rewrite:
 Aggregate [dept#14] [dept#14 AS a#12, 'a + 1, avg(salary#16) AS b#13, 'b + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 1:
 Aggregate [dept#14] [dept#14 AS a#12, lca(a) + 1, avg(salary#16) AS b#13, lca(b) + avg(bonus#17)]
 +- Child [dept#14,name#15,salary#16,bonus#17]

 // After phase 2:
 Project [dept#14 AS a#12, lca(a) + 1, avg(salary)#26 AS b#13, lca(b) + avg(bonus)#27]
 +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
     +- Child [dept#14,name#15,salary#16,bonus#17]

 // Now the problem falls back to the lateral alias resolution in Project.
 // After future rounds of this rule:
 Project [a#12, a#12 + 1, b#13, b#13 + avg(bonus)#27]
 +- Project [dept#14 AS a#12, avg(salary)#26 AS b#13]
    +- Aggregate [dept#14] [avg(salary#16) AS avg(salary)#26, avg(bonus#17) AS avg(bonus)#27, dept#14]
       +- Child [dept#14,name#15,salary#16,bonus#17]
```

Similar as the last PR (#38776), because lateral column alias has higher resolution priority than outer reference, it will try to resolve an `OuterReference` using lateral column alias, similar as an `UnresolvedAttribute`. If success, it strips `OuterReference` and also wraps it with `LateralColumnAliasReference`.

### Why are the changes needed?
Similar as stated in #38776.

### Does this PR introduce _any_ user-facing change?

Yes, as shown in the above example, it will be able to resolve lateral column alias in Aggregate.

### How was this patch tested?

Existing tests and newly added tests.

Closes #39040 from anchovYu/SPARK-27561-agg.

Authored-by: Xinyi Yu <xinyi.yu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants