[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

cloud-fan · 2018-10-11T12:01:02Z

What changes were proposed in this pull request?

According to the SQL standard, when a query contains HAVING, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser.

How was this patch tested?

new test

cloud-fan · 2018-10-11T12:08:04Z

cc @hvanhovell @gatorsmile @viirya @mgaido91 @ueshin

cloud-fan · 2018-10-11T12:08:55Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

@@ -73,3 +73,9 @@ where b.z != b.z;
 -- SPARK-24369 multiple distinct aggregations having the same argument set
 SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
  FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y);
+
+-- SPARK-25708 HAVING without GROUP BY means global aggregate
+SELECT 1 FROM range(10) HAVING true;


before the fix, this returns 10 rows

cloud-fan · 2018-10-11T12:09:44Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+SELECT 1 FROM range(10) HAVING true;
+
+-- SPARK-25708 HAVING without GROUP BY means global aggregate
+SELECT 1 FROM range(10) HAVING MAX(id) > 0;


before the fix, this fails with

java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[0, bigint, false])

hvanhovell

LGTM

mgaido91 · 2018-10-11T13:24:51Z

nice catch! Shall we mention this in the migration guide? It is a behavior change (despite the previous was a wrong behavior), so I think warning users might be a good thing. LGTM otherwise.

hvanhovell · 2018-10-11T13:28:22Z

I added the release-notes label to the JIRA ticket. I am not sure if there is a migration-guide label.

SparkQA · 2018-10-11T13:40:12Z

Test build #97252 has finished for PR 22696 at commit f33400d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-11T13:55:54Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/PlanParserSuite.scala

@@ -108,7 +108,7 @@ class PlanParserSuite extends AnalysisTest {
    assertEqual("select a, b from db.c where x < 1", table("db", "c").where('x < 1).select('a, 'b))
    assertEqual(
      "select a, b from db.c having x < 1",
-      table("db", "c").select('a, 'b).where('x < 1))
+      table("db", "c").groupBy()('a, 'b).where('x < 1))


Is this query legal? Can we run such query in a test?

I read the articles here and here. One point gets my attention. Below is Postgres documentation about HAVING without GROUP BY:

The presence of HAVING turns a query into a grouped query even if there is no GROUP BY clause. This is the same as what happens when the query contains aggregate functions but no GROUP BY clause. All the selected rows are considered to form a single group, and the SELECT list and HAVING clause can only reference table columns from within aggregate functions. Such a query will emit a single row if the HAVING condition is true, zero rows if it is not true.

Please see the bold text. Seems to me in this query, we can't have x < 1 as condition in HAVING because x is not within aggregate functions. ditto for a and b in SELECT list.

Yes this query is invalid.

Note that this is parser suite. A lot of test cases in this suite are using invalid queries.

SparkQA · 2018-10-11T14:01:22Z

Test build #97254 has finished for PR 22696 at commit f6bbd38.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-11T14:31:26Z

I think we should mention this in migration guide. Although previous behavior is wrong, it might be treated as a "feature" of Spark SQL. We should explicitly let users know this change.

viirya · 2018-10-11T15:20:19Z

LGTM

cloud-fan · 2018-10-11T15:20:39Z

sql/core/src/test/resources/sql-tests/inputs/group-by.sql

+
+SELECT 1 FROM range(10) HAVING MAX(id) > 0;
+
+SELECT id FROM range(10) HAVING id > 0;


before this fix, this returns 10 rows, now it fails.

SparkQA · 2018-10-11T17:16:14Z

Test build #97266 has finished for PR 22696 at commit 78a1689.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-11T17:27:57Z

retest this please

cloud-fan · 2018-10-11T17:53:13Z

docs/sql-programming-guide.md

@@ -1894,6 +1894,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see

  - In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

+  - In Spark version 2.4 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true`  and returns 10 rows. This violates SQL standard, and has been fixed in Spark 3.0. Since Spark 3.0, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row.


shall we backport it to 2.4?

For such a correctness issue, I think we should merge it to the 2.4 release

You will need to feature flag it if you port it to 2.4. People might rely on its current behavior.

Yes. We should add a legacy SQLConf

+1 for backporting to 2.4 with a legacy conf

SparkQA · 2018-10-11T19:28:48Z

Test build #97276 has finished for PR 22696 at commit 78a1689.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-11T20:15:23Z

retest this please

SparkQA · 2018-10-12T00:21:26Z

Test build #97280 has finished for PR 22696 at commit 78a1689.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-12T05:24:29Z

Test build #97289 has finished for PR 22696 at commit b0dc140.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-10-12T07:23:23Z

LGTM

Thanks! Merged to master/2.4

According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. new test Closes #22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit 78e1331) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

## What changes were proposed in this pull request? According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/ However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser. ## How was this patch tested? new test Closes apache#22696 from cloud-fan/having. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

arkguil · 2019-05-07T17:57:53Z

@cloud-fan / @gatorsmile , just stumbled on this while investigating an issue with a query while migrating to 2.4...

Seems like the fix over simplified the original intent. It should be totally ok to do something like

select id from range(10) having id > 5

Having is applied on the result of select id from range(10), and since id is in the resultset, this should not fail with grouping expressions sequence is empty, and 'id' is not an aggregate function.

The previous SQL should be interpreted as

select id from range(10) group by id having id > 5

Which is what the previous plan was doing... This is easier to see when using a window function:

select id, max(id) over () as max_id from range(10) where id > 5 having max_id = id

The window will be generated then the filter applied on the result. You can't apply a where on max_id since it is only available after select id, max(id) over () as max_id from range(10) where id > 5 is executed.

Can you explain what this change fixes exactly?

cloud-fan · 2019-05-07T18:05:25Z

select id from range(10) having id > 5 can you try it with other databases like PostgreSQL, Oracle? I don't think this should be interpreted as select id from range(10) group by id having id > 5 according to the SQL standard.

arkguil · 2019-05-08T12:50:29Z

That sql is not valid in Oracle but this works as I described above:
select t.id from (select 5 as id from dual) t having t.id >= 5

cloud-fan · 2019-05-08T13:00:31Z

I tried select t.id from (select 5 as id from dual) t having t.id >= 5 in Postgres and it fails.

arkguil · 2019-05-08T15:22:34Z

Indeed. The following query fails in Postgresql:
select id from (select 1 as id) t having id > 0
ERROR: column "t.id" must appear in the GROUP BY clause or be used in an aggregate function Position: 8

Seems like SQL standard is very loosly implemented across the different RDBMS, but the stanrdard indeed state clearly that HAVING requires GROUP BY:

https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#having-clause

Thanks for the quick followup. We will fix our queries :)

arkguil · 2019-05-08T15:23:54Z

Weird, the 2 previous comments are actually in the Future...

…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> (cherry picked from commit e279ed3) Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31050 from AngersZhuuuu/SPARK-34012-2.4. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In #22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. NOTE: This backport comes from #31039 ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31049 from AngersZhuuuu/SPARK-34012-3.0. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

cloud-fan force-pushed the having branch from f33400d to f6bbd38 Compare October 11, 2018 12:07

cloud-fan commented Oct 11, 2018

View reviewed changes

hvanhovell approved these changes Oct 11, 2018

View reviewed changes

viirya reviewed Oct 11, 2018

View reviewed changes

cloud-fan added 2 commits October 11, 2018 23:10

HAVING without GROUP BY means global aggregate

8603835

add migration guide

78a1689

cloud-fan force-pushed the having branch from f6bbd38 to 78a1689 Compare October 11, 2018 15:19

cloud-fan commented Oct 11, 2018

View reviewed changes

address comments

b0dc140

asfgit closed this in 78e1331 Oct 12, 2018

maropu mentioned this pull request Jan 5, 2021

[SPARK-28227][SQL] Support projection, aggregate/window functions, and lateral view in the TRANSFORM clause #29087

Closed

AngersZhuuuu mentioned this pull request Jan 5, 2021

[SPARK-34012][SQL] Keep behavior consistent when conf spark.sql.legacy.parser.havingWithoutGroupByAsWhere is true with migration guide #31039

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

cloud-fan commented Oct 11, 2018

cloud-fan commented Oct 11, 2018

cloud-fan Oct 11, 2018

cloud-fan Oct 11, 2018

hvanhovell left a comment

mgaido91 commented Oct 11, 2018

hvanhovell commented Oct 11, 2018

SparkQA commented Oct 11, 2018

viirya Oct 11, 2018

cloud-fan Oct 11, 2018

SparkQA commented Oct 11, 2018

viirya commented Oct 11, 2018

viirya commented Oct 11, 2018

cloud-fan Oct 11, 2018

SparkQA commented Oct 11, 2018

cloud-fan commented Oct 11, 2018

cloud-fan Oct 11, 2018

gatorsmile Oct 11, 2018

hvanhovell Oct 11, 2018

gatorsmile Oct 11, 2018

mgaido91 Oct 11, 2018

SparkQA commented Oct 11, 2018

gatorsmile commented Oct 11, 2018

SparkQA commented Oct 12, 2018

SparkQA commented Oct 12, 2018

gatorsmile commented Oct 12, 2018

arkguil commented May 7, 2019

cloud-fan commented May 7, 2019

arkguil commented May 8, 2019

cloud-fan commented May 8, 2019

arkguil commented May 8, 2019

arkguil commented May 8, 2019


		SELECT 1 FROM range(10) HAVING MAX(id) > 0;

		SELECT id FROM range(10) HAVING id > 0;

		@@ -1894,6 +1894,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see

		- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

		- In Spark version 2.4 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 3.0. Since Spark 3.0, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row.

[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

Conversation

cloud-fan commented Oct 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Oct 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell left a comment

Choose a reason for hiding this comment

mgaido91 commented Oct 11, 2018

hvanhovell commented Oct 11, 2018

SparkQA commented Oct 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2018

viirya commented Oct 11, 2018

viirya commented Oct 11, 2018

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2018

cloud-fan commented Oct 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 11, 2018

gatorsmile commented Oct 11, 2018

SparkQA commented Oct 12, 2018

SparkQA commented Oct 12, 2018

gatorsmile commented Oct 12, 2018

arkguil commented May 7, 2019

cloud-fan commented May 7, 2019

arkguil commented May 8, 2019

cloud-fan commented May 8, 2019

arkguil commented May 8, 2019

arkguil commented May 8, 2019