Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result #28294

Closed
wants to merge 7 commits into from

Conversation

xuanyuanking
Copy link
Member

What changes were proposed in this pull request?

Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

Why are the changes needed?

The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:

  1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
  2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
  3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:

SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10

The query's result is

+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+

But if we add CAST, it will return an empty result.

SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10

Does this PR introduce any user-facing change?

Yes, bug fix for cast in having aggregate expressions.

How was this patch tested?

New UT added.

@@ -238,13 +238,13 @@ class Analyzer(
ResolveNaturalAndUsingJoin ::
ResolveOutputRelation ::
ExtractWindowExpressions ::
ResolveTimeZone(conf) ::
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will be reverted after #28288 merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, merged in ca90e19

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the help! Reverted in c75fe08

@SparkQA
Copy link

SparkQA commented Apr 22, 2020

Test build #121615 has finished for PR 28294 at commit 0acfc57.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AggregateWithHaving(

@SparkQA
Copy link

SparkQA commented Apr 22, 2020

Test build #121628 has finished for PR 28294 at commit f04db8e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2020

Test build #121633 has finished for PR 28294 at commit 81c9d47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -583,6 +583,16 @@ case class Aggregate(
}
}

case class AggregateWithHaving(
Copy link
Member

@maropu maropu Apr 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename this into UnresolvedHaving, then move it into unresolved.scala? I personally think HAVING always comes with Aggregate, so the name doesn't need to include Aggregate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it into unresolved.scala?

Yeah, make sense, will change it to unresolved.scala.

Could we rename this into UnresolvedHaving

Since the group by not always come with Aggregate, it can also be GroupingSets, we only handle the Aggregate part with AggregateWithHaving. So maybe let's keep it AggregateWithHaving? WDYT :)

@@ -2033,62 +2036,11 @@ class Analyzer(
*/
object ResolveAggregateFunctions extends Rule[LogicalPlan] {
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
case f @ Filter(cond, agg @ Aggregate(grouping, originalAggExprs, child)) if agg.resolved =>
case AggregateWithHaving(cond, agg: Aggregate) if agg.resolved =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you leave some comments here about why we need this special handling for aggregate with having?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done in c75fe08.

@SparkQA
Copy link

SparkQA commented Apr 23, 2020

Test build #121659 has finished for PR 28294 at commit c75fe08.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AggregateWithHaving(

@xuanyuanking
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121705 has finished for PR 28294 at commit d4a011c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

The failure test CliSuite.SPARK-26321 Should not split semicolon within quoted string literals can pass locally, it seems a flaky test caused by connecting timeout.

@xuanyuanking
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121718 has finished for PR 28294 at commit d4a011c.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please

/**
* Represents unresolved aggregate with having clause, it is turned by the analyzer into a Filter.
*/
case class AggregateWithHaving(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnresolvedHaving?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also discussed the naming here: #28294 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does GroupingSets have this bug as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GroupingSets don't have the same bug here.
But I test the 2 below queries:

select sum(a) as b, '2020-01-01' as fake
FROM VALUES (1, 10), (2, 20) AS T(a, b)
group by GROUPING SETS ((b), (a, b))
having b > 10;

and adding the cast

select sum(a) as b, cast('2020-01-01' as date) as fake
FROM VALUES (1, 10), (2, 20) AS T(a, b)
group by GROUPING SETS ((b), (a, b))
having b > 10;

Both queries return empty results, it seems the HAVING after GROUPING SET resolve expression in select clauses first. Maybe it's another bug, I think we should follow the resolving strategy of table scheme first and then select clause?

case aggregate: Aggregate =>
AggregateWithHaving(predicate, aggregate)
case _ =>
Filter(predicate, plan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we also create having here? This is for global aggregate, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also for GROUPING SET.

@SparkQA
Copy link

SparkQA commented Apr 24, 2020

Test build #121741 has finished for PR 28294 at commit d4a011c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 26, 2020

Test build #121839 has finished for PR 28294 at commit d4ac6d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121864 has finished for PR 28294 at commit d4ac6d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121880 has finished for PR 28294 at commit d4ac6d7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Apr 27, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121884 has finished for PR 28294 at commit d4ac6d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2020

Test build #121903 has finished for PR 28294 at commit 18d857f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

cloud-fan pushed a commit that referenced this pull request Apr 28, 2020
…rong result

### What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

### Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```

### Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.

### How was this patch tested?
New UT added.

Closes #28294 from xuanyuanking/SPARK-31519.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6ed2dfb)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this in 6ed2dfb Apr 28, 2020
@cloud-fan
Copy link
Contributor

@xuanyuanking can you send a backport PR for 2.4?

@xuanyuanking
Copy link
Member Author

Thanks for the review, will submit a backport soon.

@xuanyuanking xuanyuanking deleted the SPARK-31519 branch April 28, 2020 13:30
xuanyuanking added a commit to xuanyuanking/spark that referenced this pull request Apr 28, 2020
…rong result

Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```

Yes, bug fix for cast in having aggregate expressions.

New UT added.

Closes apache#28294 from xuanyuanking/SPARK-31519.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
maropu pushed a commit that referenced this pull request Jan 5, 2021
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 5, 2021
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit e279ed3)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 6, 2021
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

NOTE: This backport comes from #31039

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31050 from AngersZhuuuu/SPARK-34012-2.4.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 6, 2021
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

NOTE: This backport comes from #31039

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31049 from AngersZhuuuu/SPARK-34012-3.0.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants