Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25708][SQL] HAVING without GROUP BY means global aggregate #22696

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

According to the SQL standard, when a query contains HAVING, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser.

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

@@ -73,3 +73,9 @@ where b.z != b.z;
-- SPARK-24369 multiple distinct aggregations having the same argument set
SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y);

-- SPARK-25708 HAVING without GROUP BY means global aggregate
SELECT 1 FROM range(10) HAVING true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the fix, this returns 10 rows

SELECT 1 FROM range(10) HAVING true;

-- SPARK-25708 HAVING without GROUP BY means global aggregate
SELECT 1 FROM range(10) HAVING MAX(id) > 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the fix, this fails with

java.lang.UnsupportedOperationException: Cannot evaluate expression: max(input[0, bigint, false])

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mgaido91
Copy link
Contributor

nice catch! Shall we mention this in the migration guide? It is a behavior change (despite the previous was a wrong behavior), so I think warning users might be a good thing. LGTM otherwise.

@hvanhovell
Copy link
Contributor

I added the release-notes label to the JIRA ticket. I am not sure if there is a migration-guide label.

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97252 has finished for PR 22696 at commit f33400d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -108,7 +108,7 @@ class PlanParserSuite extends AnalysisTest {
assertEqual("select a, b from db.c where x < 1", table("db", "c").where('x < 1).select('a, 'b))
assertEqual(
"select a, b from db.c having x < 1",
table("db", "c").select('a, 'b).where('x < 1))
table("db", "c").groupBy()('a, 'b).where('x < 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this query legal? Can we run such query in a test?

I read the articles here and here. One point gets my attention. Below is Postgres documentation about HAVING without GROUP BY:

The presence of HAVING turns a query into a grouped query even if there is no GROUP BY clause. This is the same as what happens when the query contains aggregate functions but no GROUP BY clause. All the selected rows are considered to form a single group, and the SELECT list and HAVING clause can only reference table columns from within aggregate functions. Such a query will emit a single row if the HAVING condition is true, zero rows if it is not true.

Please see the bold text. Seems to me in this query, we can't have x < 1 as condition in HAVING because x is not within aggregate functions. ditto for a and b in SELECT list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this query is invalid.

Note that this is parser suite. A lot of test cases in this suite are using invalid queries.

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97254 has finished for PR 22696 at commit f6bbd38.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

I think we should mention this in migration guide. Although previous behavior is wrong, it might be treated as a "feature" of Spark SQL. We should explicitly let users know this change.

@viirya
Copy link
Member

viirya commented Oct 11, 2018

LGTM


SELECT 1 FROM range(10) HAVING MAX(id) > 0;

SELECT id FROM range(10) HAVING id > 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before this fix, this returns 10 rows, now it fails.

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97266 has finished for PR 22696 at commit 78a1689.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@@ -1894,6 +1894,8 @@ working with timestamps in `pandas_udf`s to get the best performance, see

- In PySpark, when creating a `SparkSession` with `SparkSession.builder.getOrCreate()`, if there is an existing `SparkContext`, the builder was trying to update the `SparkConf` of the existing `SparkContext` with configurations specified to the builder, but the `SparkContext` is shared by all `SparkSession`s, so we should not update them. Since 3.0, the builder come to not update the configurations. This is the same behavior as Java/Scala API in 2.3 and above. If you want to update them, you need to update them prior to creating a `SparkSession`.

- In Spark version 2.4 and earlier, HAVING without GROUP BY is treated as WHERE. This means, `SELECT 1 FROM range(10) HAVING true` is executed as `SELECT 1 FROM range(10) WHERE true` and returns 10 rows. This violates SQL standard, and has been fixed in Spark 3.0. Since Spark 3.0, HAVING without GROUP BY is treated as a global aggregate, which means `SELECT 1 FROM range(10) HAVING true` will return only one row.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we backport it to 2.4?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For such a correctness issue, I think we should merge it to the 2.4 release

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to feature flag it if you port it to 2.4. People might rely on its current behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We should add a legacy SQLConf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for backporting to 2.4 with a legacy conf

@SparkQA
Copy link

SparkQA commented Oct 11, 2018

Test build #97276 has finished for PR 22696 at commit 78a1689.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97280 has finished for PR 22696 at commit 78a1689.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2018

Test build #97289 has finished for PR 22696 at commit b0dc140.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

LGTM

Thanks! Merged to master/2.4

asfgit pushed a commit that referenced this pull request Oct 12, 2018
According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser.

new test

Closes #22696 from cloud-fan/having.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
(cherry picked from commit 78e1331)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in 78e1331 Oct 12, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

According to the SQL standard, when a query contains `HAVING`, it indicates an aggregate operator. For more details please refer to https://blog.jooq.org/2014/12/04/do-you-really-understand-sqls-group-by-and-having-clauses/

However, in Spark SQL parser, we treat HAVING as a normal filter when there is no GROUP BY, which breaks SQL semantic and lead to wrong result. This PR fixes the parser.

## How was this patch tested?

new test

Closes apache#22696 from cloud-fan/having.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@arkguil
Copy link

arkguil commented May 7, 2019

@cloud-fan / @gatorsmile , just stumbled on this while investigating an issue with a query while migrating to 2.4...

Seems like the fix over simplified the original intent. It should be totally ok to do something like

select id from range(10) having id > 5

Having is applied on the result of select id from range(10), and since id is in the resultset, this should not fail with grouping expressions sequence is empty, and 'id' is not an aggregate function.

The previous SQL should be interpreted as

select id from range(10) group by id having id > 5

Which is what the previous plan was doing... This is easier to see when using a window function:

select id, max(id) over () as max_id from range(10) where id > 5 having max_id = id

The window will be generated then the filter applied on the result. You can't apply a where on max_id since it is only available after select id, max(id) over () as max_id from range(10) where id > 5 is executed.

Can you explain what this change fixes exactly?

@cloud-fan
Copy link
Contributor Author

select id from range(10) having id > 5 can you try it with other databases like PostgreSQL, Oracle? I don't think this should be interpreted as select id from range(10) group by id having id > 5 according to the SQL standard.

@arkguil
Copy link

arkguil commented May 8, 2019

That sql is not valid in Oracle but this works as I described above:
select t.id from (select 5 as id from dual) t having t.id >= 5

@cloud-fan
Copy link
Contributor Author

I tried select t.id from (select 5 as id from dual) t having t.id >= 5 in Postgres and it fails.

@arkguil
Copy link

arkguil commented May 8, 2019

Indeed. The following query fails in Postgresql:
select id from (select 1 as id) t having id > 0
ERROR: column "t.id" must appear in the GROUP BY clause or be used in an aggregate function Position: 8

Seems like SQL standard is very loosly implemented across the different RDBMS, but the stanrdard indeed state clearly that HAVING requires GROUP BY:

https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#having-clause

Thanks for the quick followup. We will fix our queries :)

@arkguil
Copy link

arkguil commented May 8, 2019

Weird, the 2 previous comments are actually in the Future...

maropu pushed a commit that referenced this pull request Jan 5, 2021
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 5, 2021
…cy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
(cherry picked from commit e279ed3)
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 6, 2021
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

NOTE: This backport comes from #31039

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31050 from AngersZhuuuu/SPARK-34012-2.4.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu pushed a commit that referenced this pull request Jan 6, 2021
…legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide

### What changes were proposed in this pull request?
In #22696 we support HAVING without GROUP BY means global aggregate
But since we treat having as Filter before, in this way will cause a lot of analyze error, after #28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true`   .
This PR fix this issue and add UT.

NOTE: This backport comes from #31039

### Why are the changes needed?
Keep consistent behavior of migration guide.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes #31049 from AngersZhuuuu/SPARK-34012-3.0.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants