-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result #28294
Conversation
@@ -238,13 +238,13 @@ class Analyzer( | |||
ResolveNaturalAndUsingJoin :: | |||
ResolveOutputRelation :: | |||
ExtractWindowExpressions :: | |||
ResolveTimeZone(conf) :: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will be reverted after #28288 merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, merged in ca90e19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the help! Reverted in c75fe08
Test build #121615 has finished for PR 28294 at commit
|
Test build #121628 has finished for PR 28294 at commit
|
Test build #121633 has finished for PR 28294 at commit
|
@@ -583,6 +583,16 @@ case class Aggregate( | |||
} | |||
} | |||
|
|||
case class AggregateWithHaving( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we rename this into UnresolvedHaving
, then move it into unresolved.scala
? I personally think HAVING
always comes with Aggregate
, so the name doesn't need to include Aggregate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it into unresolved.scala?
Yeah, make sense, will change it to unresolved.scala.
Could we rename this into UnresolvedHaving
Since the group by
not always come with Aggregate
, it can also be GroupingSets
, we only handle the Aggregate part with AggregateWithHaving
. So maybe let's keep it AggregateWithHaving
? WDYT :)
...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala
Outdated
Show resolved
Hide resolved
@@ -2033,62 +2036,11 @@ class Analyzer( | |||
*/ | |||
object ResolveAggregateFunctions extends Rule[LogicalPlan] { | |||
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { | |||
case f @ Filter(cond, agg @ Aggregate(grouping, originalAggExprs, child)) if agg.resolved => | |||
case AggregateWithHaving(cond, agg: Aggregate) if agg.resolved => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you leave some comments here about why we need this special handling for aggregate with having?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, done in c75fe08.
Test build #121659 has finished for PR 28294 at commit
|
retest this please. |
Test build #121705 has finished for PR 28294 at commit
|
The failure test |
retest this please |
Test build #121718 has finished for PR 28294 at commit
|
retest this please |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
Outdated
Show resolved
Hide resolved
/** | ||
* Represents unresolved aggregate with having clause, it is turned by the analyzer into a Filter. | ||
*/ | ||
case class AggregateWithHaving( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnresolvedHaving
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also discussed the naming here: #28294 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does GroupingSets
have this bug as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think GroupingSets
don't have the same bug here.
But I test the 2 below queries:
select sum(a) as b, '2020-01-01' as fake
FROM VALUES (1, 10), (2, 20) AS T(a, b)
group by GROUPING SETS ((b), (a, b))
having b > 10;
and adding the cast
select sum(a) as b, cast('2020-01-01' as date) as fake
FROM VALUES (1, 10), (2, 20) AS T(a, b)
group by GROUPING SETS ((b), (a, b))
having b > 10;
Both queries return empty results, it seems the HAVING after GROUPING SET resolve expression in select clauses first. Maybe it's another bug, I think we should follow the resolving strategy of table scheme first and then select clause?
case aggregate: Aggregate => | ||
AggregateWithHaving(predicate, aggregate) | ||
case _ => | ||
Filter(predicate, plan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we also create having here? This is for global aggregate, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also for GROUPING SET.
Test build #121741 has finished for PR 28294 at commit
|
Test build #121839 has finished for PR 28294 at commit
|
retest this please |
Test build #121864 has finished for PR 28294 at commit
|
retest this please |
Test build #121880 has finished for PR 28294 at commit
|
retest this please |
Test build #121884 has finished for PR 28294 at commit
|
Test build #121903 has finished for PR 28294 at commit
|
thanks, merging to master/3.0! |
…rong result ### What changes were proposed in this pull request? Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). ### Why are the changes needed? The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ | b| fake| +---+----------+ | 2|2020-01-01| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` ### Does this PR introduce any user-facing change? Yes, bug fix for cast in having aggregate expressions. ### How was this patch tested? New UT added. Closes #28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6ed2dfb) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@xuanyuanking can you send a backport PR for 2.4? |
Thanks for the review, will submit a backport soon. |
…rong result Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)). The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator. It works for simple cases in a very tricky way as it relies on rule execution order: 1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved. 2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved. 3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns. In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns. See the demo below: ``` SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` The query's result is ``` +---+----------+ | b| fake| +---+----------+ | 2|2020-01-01| +---+----------+ ``` But if we add CAST, it will return an empty result. ``` SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10 ``` Yes, bug fix for cast in having aggregate expressions. New UT added. Closes apache#28294 from xuanyuanking/SPARK-31519. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit e279ed3) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31050 from AngersZhuuuu/SPARK-34012-2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31049 from AngersZhuuuu/SPARK-34012-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).
Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.
It works for simple cases in a very tricky way as it relies on rule execution order:
In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.
See the demo below:
The query's result is
But if we add CAST, it will return an empty result.
Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.
How was this patch tested?
New UT added.