fix: bugs when having and group by are all false #11897

Lordworms · 2024-08-09T01:40:42Z

Which issue does this PR close?

Closes #11748

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb

Makes sense to me -- than you @Lordworms

datafusion/sql/src/select.rs

alamb · 2024-08-10T14:09:54Z

datafusion/sql/src/select.rs

-            let having_expr_post_aggr =
-                rebase_expr(having_expr, &aggr_projection_exprs, input)?;
-
+            let having_expr_post_aggr = if is_constant_expression(having_expr) {


The check needs to be for a false constant (not just a constant) right? I'll suggest a test to show this

alamb · 2024-08-10T14:10:45Z

datafusion/sqllogictest/test_files/aggregate.slt

+
+query R
+SELECT AVG(v1) FROM t1 having false;


Can we please also add negative (positive?) tests like:

SELECT AVG(v1) FROM t1 GROUP BY false having true;

And

SELECT AVG(v1) FROM t1 GROUP BY false having 1 = 1;

alamb

Thank you for your work on this PR @Lordworms but I am now pretty confused about these changes 🤔

In my understanding, the bug involves he HAVING clause -- the HAVING clause is like the WHERE clause except that it happens after grouping where the WHERE clause happens before grouping.

This page has a nice table

So for a query like

SELECT AVG(v1) FROM t1 GROUP BY false having false;

I would expect a query plan that looks something like

Filter(expr=false) <-- this is the HAVING(false)
  GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
    TableScan(t1)

When I explained the query we can see that in fact this filter is added

> create table t1(v1 int) as values (1), (2);
0 row(s) fetched.
Elapsed 0.016 seconds.

> explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
...
| initial_logical_plan                                       | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Filter: Boolean(false)                                                                                                  |
|                                                            |     Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]]                                                            |
|                                                            |       TableScan: t1

However, then it looks like the filter is pushed down below the group by

| logical_plan after push_down_filter                        | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                             |
|                                                            |     Filter: Boolean(false)                                                                                                |
|                                                            |       TableScan: t1

Which finally results in

| logical_plan after eliminate_filter                        | Aggregate: groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                                             |
|                                                            |   EmptyRelation

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

alamb · 2024-08-11T11:15:06Z

datafusion/sqllogictest/test_files/aggregate.slt

+query R
+SELECT AVG(v1) FROM t1 GROUP BY false having true;
+----


I think the output here should have a single row -- it should be the same result as SELECT AVG(v1) FROM t1 GROUP BY false

I don't think this is correct anymore

Lordworms · 2024-08-11T15:11:48Z

Thank you for your work on this PR @Lordworms but I am now pretty confused about these changes 🤔

In my understanding, the bug involves he HAVING clause -- the HAVING clause is like the WHERE clause except that it happens after grouping where the WHERE clause happens before grouping.

This page has a nice table
So for a query like
SELECT AVG(v1) FROM t1 GROUP BY false having false;
I would expect a query plan that looks something like
Filter(expr=false) <-- this is the HAVING(false)
  GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
    TableScan(t1)

Yes, I understand, but the gby expr is optimized into an empty set in Optimizer, in order to pass the gby information to ExecutionPlan, I think I can only add an extra boolean member?

When I explained the query we can see that in fact this filter is added

> create table t1(v1 int) as values (1), (2);
0 row(s) fetched.
Elapsed 0.016 seconds.

> explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
...
| initial_logical_plan                                       | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Filter: Boolean(false)                                                                                                  |
|                                                            |     Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]]                                                            |
|                                                            |       TableScan: t1

However, then it looks like the filter is pushed down below the group by

| logical_plan after push_down_filter                        | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                             |
|                                                            |     Filter: Boolean(false)                                                                                                |
|                                                            |       TableScan: t1

Which finally results in

| logical_plan after eliminate_filter                        | Aggregate: groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                                             |
|                                                            |   EmptyRelation

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

dismissed their stale review

Also in duckDB, those query do returns 0 rows.

Lordworms · 2024-08-11T15:12:29Z

I'll try not to push boolean down to see the plan. but I think in this case, the key point is to distinguish whether return an empty set or one row, in order to reach that, we need to do whether a AggregateExec performs on a group by or not?, since this query would return one row

select covar_samp(sq.column1, sq.column2) from (values (1.1, 2.2))

since it does not have a groupby anywhere.

I think the key point here for AggregateExec is to effectively distinguish whether to return a empty set or a null

I tried the same query with pushdown filter diabled, it still generated a row

Don't know if there is any better way than adding a new field

jayzhan211 · 2024-08-12T01:12:09Z

I think the way to handle false for group by and having expr is rewrite the expression in optimizer, probably SimplifyExpr pattern matching. We rewrite it to expr that returns empty row if we found the expression in group by or having is evaluated to false.

btw, the issue is not strictly tied to group by + having

We should also return empty rows for these queries
select avg(a) from t group by false
select avg(a) from t group by true
select avg(a) from t having false
-- empty rows

select avg(a) from t having true
-- null

Lordworms · 2024-08-12T02:27:55Z

I think the way to handle false for group by and having expr is rewrite the expression in optimizer, probably SimplifyExpr pattern matching. We rewrite it to expr that returns empty row if we found the expression in group by or having is evaluated to false.

btw, the issue is not strictly tied to group by + having

We should also return empty rows for these queries select avg(a) from t group by false select avg(a) from t group by true select avg(a) from t having false -- empty rows

select avg(a) from t having true -- null

I don't think that's enough, the key here is we need to return empty set for those queries, but if group by is global and the table is not empty, we could not rewrite like this, the only way for us to know whether return an empty set or a null is in execution, since you don't know the size of the records before that.

jayzhan211 · 2024-08-12T04:54:20Z

the only way for us to know whether return an empty set or a null is in execution

What is the reason that we could not determine the result in the the optimizer? Is there any counter example that does not work if we rewrite expression and could only be determined in execution?

InList is one of the example that returns empty set and it is rewritten early in ExprSimplifier

query I
select x from t where x IN (1,2,3) AND x IN (4,5);
----

query TT
explain select x from t where x IN (1,2,3) AND x IN (4,5);
----
logical_plan EmptyRelation
physical_plan EmptyExec

Lordworms · 2024-08-12T05:02:39Z

the only way for us to know whether return an empty set or a null is in execution

What is the reason that we could not determine the result in the the optimizer? Is there any counter example that does not work if we rewrite expression and could only be determined in execution?

InList is one of the example that returns empty set and it is rewritten early in ExprSimplifier
query I
select x from t where x IN (1,2,3) AND x IN (4,5);
----

query TT
explain select x from t where x IN (1,2,3) AND x IN (4,5);
----
logical_plan EmptyRelation
physical_plan EmptyExec

The point is not the optimizer actually, for example(using offical release of DF and duckdb)

we should generate an empty set when having is true and group_by is a constant

I was doing similar optimize things in the begining, but after alamb asked me to add tests when having is true. I found out this problem.

I think we should control the AggExec's behaviour so I added this new field is_global_group_by. So this PR actually fixed two bugs, (1. global group_by + having false. 2. global group_by + having true)

jonahgao · 2024-08-12T06:38:39Z

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

I think this solution is reasonable. Constant expressions can be regarded as independent of each other, that is, they are fake column references.

~~@Lordworms Perhaps you can try to fix it like this. I haven't verified it carefully yet.~~

UPDATE: #11748 can be fixed by disabling EliminateGroupByConstant. It seems that EliminateGroupByConstant is the root cause.

jonahgao · 2024-08-12T07:09:20Z

So this PR actually fixed two bugs, (1. global group_by + having false. 2. global group_by + having true)

The following case seems to be caused by EliminateGroupByConstant. ~~I think we can create a separate fix for it later.~~

DataFusion CLI v41.0.0
> SELECT AVG(v1) FROM t1 GROUP BY false;
+------------+
| avg(t1.v1) |
+------------+
|            |
+------------+
1 row(s) fetched.

In DuckDB:

D SELECT AVG(v1) FROM t1 GROUP BY false;
┌─────────┐
│ avg(v1) │
│ double  │
├─────────┤
│ 0 rows  │
└─────────┘

jayzhan211 · 2024-08-12T07:44:47Z

Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after EliminateGroupByConstant.

I think this is why you bring the is_global_group_by information down to physical layer.

I think another approach is we avoid EliminateGroupByConstant at all and we eliminate it when creating physical group by expression 🤔

jonahgao · 2024-08-12T08:47:30Z

I agree to disable EliminateGroupByConstant because it does not work correctly with empty input.

Lordworms · 2024-08-12T16:27:16Z

EliminateGroupByConstant

Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after EliminateGroupByConstant.

I think this is why you bring the is_global_group_by information down to physical layer.

exactly

I think another approach is we avoid EliminateGroupByConstant at all and we eliminate it when creating physical group by expression 🤔

I don't think we should completely avoid this rule since it has its own usage, for example here, if we disable it, the plan would be like

and with enable it, the plan is

we introduced an unnecessary Projection which I don't feel suitable, since it not only introduced another operator but also violates some basic rules(we should do projection before aggregate to minimize cost). Also I think there should be other test cases which may require AggregateExec to emit empty set, since right now it always returns an null with no input.

jayzhan211 · 2024-08-13T02:43:47Z

we introduced an unnecessary Projection which I don't feel suitable

I think this is what optimize_projections's job.

btw, normal query has projection too
select avg(a) from t group by a;

Projection: avg(t.a)
      Aggregate: groupBy=[[t.a]], aggr=[[avg(CAST(t.a AS Float64))]]
        TableScan: t projection=[a]

alamb · 2024-08-14T21:35:39Z

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

github-actions · 2024-10-14T02:00:54Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Aug 9, 2024

Lordworms force-pushed the issue_11748 branch from a1205aa to edbf530 Compare August 9, 2024 01:49

alamb previously approved these changes Aug 9, 2024

View reviewed changes

alamb reviewed Aug 9, 2024

View reviewed changes

datafusion/sql/src/select.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Aug 9, 2024

Minor: use lit(true) and lit(false) more #11904

Merged

jonahgao reviewed Aug 9, 2024

View reviewed changes

datafusion/sql/src/select.rs Outdated Show resolved Hide resolved

Lordworms force-pushed the issue_11748 branch from edbf530 to ddcb045 Compare August 10, 2024 01:04

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Aug 10, 2024

alamb reviewed Aug 10, 2024

View reviewed changes

github-actions bot added the core Core DataFusion crate label Aug 10, 2024

Lordworms added 4 commits August 10, 2024 22:02

fix: bugs when having and group by are all false

3f713fa

fix check

b9928a0

adding more situation

6c379a9

fix bugs

ad7d82e

Lordworms force-pushed the issue_11748 branch from 5b8177d to ad7d82e Compare August 11, 2024 05:23

alamb reviewed Aug 11, 2024

View reviewed changes

This comment was marked as outdated.

Sign in to view

alamb marked this pull request as draft August 14, 2024 21:35

jayzhan211 mentioned this pull request Aug 17, 2024

Invalid aggregate SQL query with HAVING can be executed without error (SQLancer-TLP) #12013

Closed

github-actions bot added the Stale PR has not had any activity for some time label Oct 14, 2024

github-actions bot closed this Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: bugs when having and group by are all false #11897

fix: bugs when having and group by are all false #11897

Lordworms commented Aug 9, 2024

alamb left a comment

alamb Aug 10, 2024 •

edited

Loading

alamb Aug 10, 2024

alamb left a comment

alamb Aug 11, 2024

Lordworms commented Aug 11, 2024 •

edited

Loading

Lordworms commented Aug 11, 2024 •

edited

Loading

jayzhan211 commented Aug 12, 2024 •

edited

Loading

Lordworms commented Aug 12, 2024 •

edited

Loading

jayzhan211 commented Aug 12, 2024 •

edited

Loading

Lordworms commented Aug 12, 2024 •

edited

Loading

This comment was marked as outdated.

jonahgao commented Aug 12, 2024 •

edited

Loading

jonahgao commented Aug 12, 2024 •

edited

Loading

jayzhan211 commented Aug 12, 2024

jonahgao commented Aug 12, 2024

Lordworms commented Aug 12, 2024 •

edited

Loading

jayzhan211 commented Aug 13, 2024 •

edited

Loading

alamb commented Aug 14, 2024

github-actions bot commented Oct 14, 2024

fix: bugs when having and group by are all false #11897

fix: bugs when having and group by are all false #11897

Conversation

Lordworms commented Aug 9, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

alamb Aug 10, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb Aug 11, 2024

Choose a reason for hiding this comment

Lordworms commented Aug 11, 2024 • edited Loading

Lordworms commented Aug 11, 2024 • edited Loading

jayzhan211 commented Aug 12, 2024 • edited Loading

Lordworms commented Aug 12, 2024 • edited Loading

jayzhan211 commented Aug 12, 2024 • edited Loading

Lordworms commented Aug 12, 2024 • edited Loading

This comment was marked as outdated.

jonahgao commented Aug 12, 2024 • edited Loading

jonahgao commented Aug 12, 2024 • edited Loading

jayzhan211 commented Aug 12, 2024

jonahgao commented Aug 12, 2024

Lordworms commented Aug 12, 2024 • edited Loading

jayzhan211 commented Aug 13, 2024 • edited Loading

alamb commented Aug 14, 2024

github-actions bot commented Oct 14, 2024

alamb Aug 10, 2024 •

edited

Loading

Lordworms commented Aug 11, 2024 •

edited

Loading

Lordworms commented Aug 11, 2024 •

edited

Loading

jayzhan211 commented Aug 12, 2024 •

edited

Loading

Lordworms commented Aug 12, 2024 •

edited

Loading

jayzhan211 commented Aug 12, 2024 •

edited

Loading

Lordworms commented Aug 12, 2024 •

edited

Loading

jonahgao commented Aug 12, 2024 •

edited

Loading

jonahgao commented Aug 12, 2024 •

edited

Loading

Lordworms commented Aug 12, 2024 •

edited

Loading

jayzhan211 commented Aug 13, 2024 •

edited

Loading