[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result #28294

xuanyuanking · 2020-04-22T10:18:36Z

What changes were proposed in this pull request?

Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

Why are the changes needed?

The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:

Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:

SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10

The query's result is

+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+

But if we add CAST, it will return an empty result.

SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10

Does this PR introduce any user-facing change?

Yes, bug fix for cast in having aggregate expressions.

How was this patch tested?

New UT added.

xuanyuanking · 2020-04-22T10:20:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -238,13 +238,13 @@ class Analyzer(
      ResolveNaturalAndUsingJoin ::
      ResolveOutputRelation ::
      ExtractWindowExpressions ::
+      ResolveTimeZone(conf) ::


This change will be reverted after #28288 merged.

ok, merged in ca90e19

Thanks for the help! Reverted in c75fe08

SparkQA · 2020-04-22T10:45:00Z

Test build #121615 has finished for PR 28294 at commit 0acfc57.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AggregateWithHaving(

SparkQA · 2020-04-22T14:53:42Z

Test build #121628 has finished for PR 28294 at commit f04db8e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-22T19:06:02Z

Test build #121633 has finished for PR 28294 at commit 81c9d47.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-23T00:31:55Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

@@ -583,6 +583,16 @@ case class Aggregate(
  }
 }

+case class AggregateWithHaving(


Could we rename this into UnresolvedHaving, then move it into unresolved.scala? I personally think HAVING always comes with Aggregate, so the name doesn't need to include Aggregate.

move it into unresolved.scala?

Yeah, make sense, will change it to unresolved.scala.

Could we rename this into UnresolvedHaving

Since the group by not always come with Aggregate, it can also be GroupingSets, we only handle the Aggregate part with AggregateWithHaving. So maybe let's keep it AggregateWithHaving? WDYT :)

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala

maropu · 2020-04-23T00:52:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -2033,62 +2036,11 @@ class Analyzer(
   */
  object ResolveAggregateFunctions extends Rule[LogicalPlan] {
    def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
-      case f @ Filter(cond, agg @ Aggregate(grouping, originalAggExprs, child)) if agg.resolved =>
+      case AggregateWithHaving(cond, agg: Aggregate) if agg.resolved =>


Could you leave some comments here about why we need this special handling for aggregate with having?

Sure, done in c75fe08.

SparkQA · 2020-04-23T11:23:10Z

Test build #121659 has finished for PR 28294 at commit c75fe08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AggregateWithHaving(

xuanyuanking · 2020-04-24T00:53:53Z

retest this please.

SparkQA · 2020-04-24T04:32:49Z

Test build #121705 has finished for PR 28294 at commit d4a011c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-04-24T05:05:31Z

The failure test CliSuite.SPARK-26321 Should not split semicolon within quoted string literals can pass locally, it seems a flaky test caused by connecting timeout.

xuanyuanking · 2020-04-24T05:05:41Z

retest this please

SparkQA · 2020-04-24T07:05:02Z

Test build #121718 has finished for PR 28294 at commit d4a011c.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-04-24T08:21:29Z

retest this please

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

cloud-fan · 2020-04-24T08:31:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+/**
+ * Represents unresolved aggregate with having clause, it is turned by the analyzer into a Filter.
+ */
+case class AggregateWithHaving(


UnresolvedHaving?

We also discussed the naming here: #28294 (comment)

Does GroupingSets have this bug as well?

I think GroupingSets don't have the same bug here.
But I test the 2 below queries:

select sum(a) as b, '2020-01-01' as fake FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING SETS ((b), (a, b)) having b > 10;

and adding the cast

select sum(a) as b, cast('2020-01-01' as date) as fake FROM VALUES (1, 10), (2, 20) AS T(a, b) group by GROUPING SETS ((b), (a, b)) having b > 10;

Both queries return empty results, it seems the HAVING after GROUPING SET resolve expression in select clauses first. Maybe it's another bug, I think we should follow the resolving strategy of table scheme first and then select clause?

cloud-fan · 2020-04-24T08:35:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+      case aggregate: Aggregate =>
+        AggregateWithHaving(predicate, aggregate)
+      case _ =>
+        Filter(predicate, plan)


what if we also create having here? This is for global aggregate, right?

This is also for GROUPING SET.

SparkQA · 2020-04-24T13:53:49Z

Test build #121741 has finished for PR 28294 at commit d4a011c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-26T09:39:26Z

Test build #121839 has finished for PR 28294 at commit d4ac6d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2020-04-27T03:04:57Z

retest this please

SparkQA · 2020-04-27T06:46:08Z

Test build #121864 has finished for PR 28294 at commit d4ac6d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-27T06:55:00Z

retest this please

SparkQA · 2020-04-27T07:05:02Z

Test build #121880 has finished for PR 28294 at commit d4ac6d7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-04-27T07:16:32Z

retest this please

SparkQA · 2020-04-27T09:25:24Z

Test build #121884 has finished for PR 28294 at commit d4ac6d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-04-27T19:08:41Z

Test build #121903 has finished for PR 28294 at commit 18d857f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-04-28T08:11:39Z

thanks, merging to master/3.0!

…rong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ | b| fake| +---+----------+ | 2|2020-01-01| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6ed2dfb) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

cloud-fan · 2020-04-28T08:12:26Z

@xuanyuanking can you send a backport PR for 2.4?

xuanyuanking · 2020-04-28T13:30:13Z

Thanks for the review, will submit a backport soon.

…rong result Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ | b| fake| +---+----------+ | 2|2020-01-01| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` Yes, bug fix for cast in having aggregate expressions. New UT added. Closes apache#28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit e279ed3) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31050 from AngersZhuuuu/SPARK-34012-2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31049 from AngersZhuuuu/SPARK-34012-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

Add new logic node AggregateWithHaving

0acfc57

probot-autolabeler bot added the SQL label Apr 22, 2020

xuanyuanking commented Apr 22, 2020

View reviewed changes

fix

f04db8e

fix PlanParserSuite and refactory

81c9d47

maropu reviewed Apr 23, 2020

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

maropu reviewed Apr 23, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala Outdated Show resolved Hide resolved

maropu reviewed Apr 23, 2020

View reviewed changes

address comments and ut fix

c75fe08

UT fix for the window function

d4a011c

cloud-fan reviewed Apr 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 24, 2020

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Apr 24, 2020

View reviewed changes

cloud-fan mentioned this pull request Apr 24, 2020

[SPARK-31334][SQL] Don't ResolveReference/ResolveMissingReference when Filter condition with aggregate expression #28107

Closed

address comment

d4ac6d7

fix test

18d857f

cloud-fan closed this in 6ed2dfb Apr 28, 2020

xuanyuanking deleted the SPARK-31519 branch April 28, 2020 13:30

AngersZhuuuu mentioned this pull request Jan 5, 2021

[SPARK-34012][SQL] Keep behavior consistent when conf spark.sql.legacy.parser.havingWithoutGroupByAsWhere is true with migration guide #31039

Closed

[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result #28294

[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result #28294

Conversation

xuanyuanking commented Apr 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 22, 2020

SparkQA commented Apr 22, 2020

SparkQA commented Apr 22, 2020

maropu Apr 23, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 23, 2020

xuanyuanking commented Apr 24, 2020

SparkQA commented Apr 24, 2020

xuanyuanking commented Apr 24, 2020

xuanyuanking commented Apr 24, 2020

SparkQA commented Apr 24, 2020

xuanyuanking commented Apr 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2020

SparkQA commented Apr 26, 2020

xuanyuanking commented Apr 27, 2020

SparkQA commented Apr 27, 2020

cloud-fan commented Apr 27, 2020

SparkQA commented Apr 27, 2020

maropu commented Apr 27, 2020

SparkQA commented Apr 27, 2020

SparkQA commented Apr 27, 2020

cloud-fan commented Apr 28, 2020

cloud-fan commented Apr 28, 2020

xuanyuanking commented Apr 28, 2020

maropu Apr 23, 2020 •

edited

Loading