-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696
Conversation
@@ -73,3 +73,9 @@ where b.z != b.z; | |||
-- SPARK-24369 multiple distinct aggregations having the same argument set | |||
SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*) | |||
FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y); | |||
|
|||
-- SPARK-25708 HAVING without GROUP BY means global aggregate | |||
SELECT 1 FROM range(10) HAVING true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the fix, this returns 10 rows
SELECT 1 FROM range(10) HAVING true; | ||
|
||
-- SPARK-25708 HAVING without GROUP BY means global aggregate | ||
SELECT 1 FROM range(10) HAVING MAX(id) > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before the fix, this fails with
java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[0, bigint, false])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
nice catch! Shall we mention this in the migration guide? It is a behavior change (despite the previous was a wrong behavior), so I think warning users might be a good thing. LGTM otherwise. |
I added the release-notes label to the JIRA ticket. I am not sure if there is a migration-guide label. |
Test build #97252 has finished for PR 22696 at commit
|
@@ -108,7 +108,7 @@ class PlanParserSuite extends AnalysisTest { | |||
assertEqual("select a, b from db.c where x < 1", table("db", "c").where('x < 1).select('a, 'b)) | |||
assertEqual( | |||
"select a, b from db.c having x < 1", | |||
table("db", "c").select('a, 'b).where('x < 1)) | |||
table("db", "c").groupBy()('a, 'b).where('x < 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this query legal? Can we run such query in a test?
I read the articles here and here. One point gets my attention. Below is Postgres documentation about HAVING
without GROUP BY
:
The presence of HAVING turns a query into a grouped query even if there is no GROUP BY clause. This is the same as what happens when the query contains aggregate functions but no GROUP BY clause. All the selected rows are considered to form a single group, and the SELECT list and HAVING clause can only reference table columns from within aggregate functions. Such a query will emit a single row if the HAVING condition is true, zero rows if it is not true.
Please see the bold text. Seems to me in this query, we can't have x < 1
as condition in HAVING
because x
is not within aggregate functions. ditto for a
and b
in SELECT
list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this query is invalid.
Note that this is parser suite. A lot of test cases in this suite are using invalid queries.
Test build #97254 has finished for PR 22696 at commit
|
I think we should mention this in migration guide. Although previous behavior is wrong, it might be treated as a "feature" of Spark SQL. We should explicitly let users know this change. |
LGTM |
|
||
SELECT 1 FROM range(10) HAVING MAX(id) > 0; | ||
|
||
SELECT id FROM range(10) HAVING id > 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before this fix, this returns 10 rows, now it fails.
Test build #97266 has finished for PR 22696 at commit
|
retest this please |
docs/sql-programming-guide.md
Outdated
@@ -1894,6 +1894,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see | |||
|
|||
- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`. | |||
|
|||
- In Spark version 2.4 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 3.0. Since Spark 3.0, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we backport it to 2.4?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For such a correctness issue, I think we should merge it to the 2.4 release
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to feature flag it if you port it to 2.4. People might rely on its current behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should add a legacy SQLConf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for backporting to 2.4 with a legacy conf
Test build #97276 has finished for PR 22696 at commit
|
retest this please |
Test build #97280 has finished for PR 22696 at commit
|
Test build #97289 has finished for PR 22696 at commit
|
LGTM Thanks! Merged to master/2.4 |
According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. new test Closes #22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 78e1331) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request? According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. ## How was this patch tested? new test Closes apache#22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@cloud-fan / @gatorsmile , just stumbled on this while investigating an issue with a query while migrating to 2.4... Seems like the fix over simplified the original intent. It should be totally ok to do something like
Having is applied on the result of The previous SQL should be interpreted as
Which is what the previous plan was doing... This is easier to see when using a window function:
The window will be generated then the filter applied on the result. You can't apply a where on Can you explain what this change fixes exactly? |
|
That sql is not valid in Oracle but this works as I described above: |
I tried |
Indeed. The following query fails in Postgresql: Seems like SQL standard is very loosly implemented across the different RDBMS, but the stanrdard indeed state clearly that HAVING requires GROUP BY: https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#having-clause Thanks for the quick followup. We will fix our queries :) |
Weird, the 2 previous comments are actually in the Future... |
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit e279ed3) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31050 from AngersZhuuuu/SPARK-34012-2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31049 from AngersZhuuuu/SPARK-34012-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
According to the SQL standard, when a query contains
HAVING
, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser.
How was this patch tested?
new test