Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: bugs when having and group by are all false #11897

Closed
wants to merge 4 commits into from

Conversation

Lordworms
Copy link
Contributor

Which issue does this PR close?

Closes #11748

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Aug 9, 2024
alamb
alamb previously approved these changes Aug 9, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me -- than you @Lordworms

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules labels Aug 10, 2024
let having_expr_post_aggr =
rebase_expr(having_expr, &aggr_projection_exprs, input)?;

let having_expr_post_aggr = if is_constant_expression(having_expr) {
Copy link
Contributor

@alamb alamb Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check needs to be for a false constant (not just a constant) right? I'll suggest a test to show this


query R
SELECT AVG(v1) FROM t1 having false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please also add negative (positive?) tests like:

SELECT AVG(v1) FROM t1 GROUP BY false having true;

And

SELECT AVG(v1) FROM t1 GROUP BY false having 1 = 1;

@github-actions github-actions bot added the core Core DataFusion crate label Aug 10, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your work on this PR @Lordworms but I am now pretty confused about these changes 🤔

In my understanding, the bug involves he HAVING clause -- the HAVING clause is like the WHERE clause except that it happens after grouping where the WHERE clause happens before grouping.

This page has a nice table

Screenshot 2024-08-11 at 7 17 33 AM

So for a query like

SELECT AVG(v1) FROM t1 GROUP BY false having false;

I would expect a query plan that looks something like

Filter(expr=false) <-- this is the HAVING(false)
  GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
    TableScan(t1)

When I explained the query we can see that in fact this filter is added

> create table t1(v1 int) as values (1), (2);
0 row(s) fetched.
Elapsed 0.016 seconds.

> explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
...
| initial_logical_plan                                       | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Filter: Boolean(false)                                                                                                  |
|                                                            |     Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]]                                                            |
|                                                            |       TableScan: t1

However, then it looks like the filter is pushed down below the group by

| logical_plan after push_down_filter                        | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                             |
|                                                            |     Filter: Boolean(false)                                                                                                |
|                                                            |       TableScan: t1

Which finally results in

| logical_plan after eliminate_filter                        | Aggregate: groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                                             |
|                                                            |   EmptyRelation

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

query R
SELECT AVG(v1) FROM t1 GROUP BY false having true;
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the output here should have a single row -- it should be the same result as SELECT AVG(v1) FROM t1 GROUP BY false

@alamb alamb dismissed their stale review August 11, 2024 11:27

I don't think this is correct anymore

@Lordworms
Copy link
Contributor Author

Lordworms commented Aug 11, 2024

Thank you for your work on this PR @Lordworms but I am now pretty confused about these changes 🤔

In my understanding, the bug involves he HAVING clause -- the HAVING clause is like the WHERE clause except that it happens after grouping where the WHERE clause happens before grouping.

This page has a nice table

Screenshot 2024-08-11 at 7 17 33 AM So for a query like
SELECT AVG(v1) FROM t1 GROUP BY false having false;

I would expect a query plan that looks something like

Filter(expr=false) <-- this is the HAVING(false)
  GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
    TableScan(t1)

Yes, I understand, but the gby expr is optimized into an empty set in Optimizer, in order to pass the gby information to ExecutionPlan, I think I can only add an extra boolean member?

When I explained the query we can see that in fact this filter is added

> create table t1(v1 int) as values (1), (2);
0 row(s) fetched.
Elapsed 0.016 seconds.

> explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
...
| initial_logical_plan                                       | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Filter: Boolean(false)                                                                                                  |
|                                                            |     Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]]                                                            |
|                                                            |       TableScan: t1

However, then it looks like the filter is pushed down below the group by

| logical_plan after push_down_filter                        | Projection: avg(t1.v1)                                                                                                    |
|                                                            |   Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                             |
|                                                            |     Filter: Boolean(false)                                                                                                |
|                                                            |       TableScan: t1

Which finally results in

| logical_plan after eliminate_filter                        | Aggregate: groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]]                                                             |
|                                                            |   EmptyRelation

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

dismissed their stale review

Also in duckDB, those query do returns 0 rows.
image

@Lordworms
Copy link
Contributor Author

Lordworms commented Aug 11, 2024

I'll try not to push boolean down to see the plan. but I think in this case, the key point is to distinguish whether return an empty set or one row, in order to reach that, we need to do whether a AggregateExec performs on a group by or not?, since this query would return one row

select covar_samp(sq.column1, sq.column2) from (values (1.1, 2.2))

since it does not have a groupby anywhere.

I think the key point here for AggregateExec is to effectively distinguish whether to return a empty set or a null

I tried the same query with pushdown filter diabled, it still generated a row
image

image

Don't know if there is any better way than adding a new field

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 12, 2024

I think the way to handle false for group by and having expr is rewrite the expression in optimizer, probably SimplifyExpr pattern matching. We rewrite it to expr that returns empty row if we found the expression in group by or having is evaluated to false.

btw, the issue is not strictly tied to group by + having

We should also return empty rows for these queries
select avg(a) from t group by false
select avg(a) from t group by true
select avg(a) from t having false
-- empty rows

select avg(a) from t having true
-- null

@Lordworms
Copy link
Contributor Author

Lordworms commented Aug 12, 2024

I think the way to handle false for group by and having expr is rewrite the expression in optimizer, probably SimplifyExpr pattern matching. We rewrite it to expr that returns empty row if we found the expression in group by or having is evaluated to false.

btw, the issue is not strictly tied to group by + having

We should also return empty rows for these queries select avg(a) from t group by false select avg(a) from t group by true select avg(a) from t having false -- empty rows

select avg(a) from t having true -- null

I don't think that's enough, the key here is we need to return empty set for those queries, but if group by is global and the table is not empty, we could not rewrite like this, the only way for us to know whether return an empty set or a null is in execution, since you don't know the size of the records before that.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 12, 2024

the only way for us to know whether return an empty set or a null is in execution

What is the reason that we could not determine the result in the the optimizer? Is there any counter example that does not work if we rewrite expression and could only be determined in execution?

InList is one of the example that returns empty set and it is rewritten early in ExprSimplifier

query I
select x from t where x IN (1,2,3) AND x IN (4,5);
----

query TT
explain select x from t where x IN (1,2,3) AND x IN (4,5);
----
logical_plan EmptyRelation
physical_plan EmptyExec

@Lordworms
Copy link
Contributor Author

Lordworms commented Aug 12, 2024

the only way for us to know whether return an empty set or a null is in execution

What is the reason that we could not determine the result in the the optimizer? Is there any counter example that does not work if we rewrite expression and could only be determined in execution?

InList is one of the example that returns empty set and it is rewritten early in ExprSimplifier

query I
select x from t where x IN (1,2,3) AND x IN (4,5);
----

query TT
explain select x from t where x IN (1,2,3) AND x IN (4,5);
----
logical_plan EmptyRelation
physical_plan EmptyExec

The point is not the optimizer actually, for example(using offical release of DF and duckdb)
image

we should generate an empty set when having is true and group_by is a constant

I was doing similar optimize things in the begining, but after alamb asked me to add tests when having is true. I found out this problem.

I think we should control the AggExec's behaviour so I added this new field is_global_group_by. So this PR actually fixed two bugs, (1. global group_by + having false. 2. global group_by + having true)

@jayzhan211

This comment was marked as outdated.

@jonahgao
Copy link
Member

jonahgao commented Aug 12, 2024

So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )

I think this solution is reasonable. Constant expressions can be regarded as independent of each other, that is, they are fake column references.

@Lordworms Perhaps you can try to fix it like this. I haven't verified it carefully yet.

UPDATE: #11748 can be fixed by disabling EliminateGroupByConstant. It seems that EliminateGroupByConstant is the root cause.

@jonahgao
Copy link
Member

jonahgao commented Aug 12, 2024

So this PR actually fixed two bugs, (1. global group_by + having false. 2. global group_by + having true)

The following case seems to be caused by EliminateGroupByConstant. I think we can create a separate fix for it later.

DataFusion CLI v41.0.0
> SELECT AVG(v1) FROM t1 GROUP BY false;
+------------+
| avg(t1.v1) |
+------------+
|            |
+------------+
1 row(s) fetched.

In DuckDB:

D SELECT AVG(v1) FROM t1 GROUP BY false;
┌─────────┐
│ avg(v1) │
│ double  │
├─────────┤
│ 0 rows  │
└─────────┘

@jayzhan211
Copy link
Contributor

Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after EliminateGroupByConstant.

I think this is why you bring the is_global_group_by information down to physical layer.

I think another approach is we avoid EliminateGroupByConstant at all and we eliminate it when creating physical group by expression 🤔

@jonahgao
Copy link
Member

I agree to disable EliminateGroupByConstant because it does not work correctly with empty input.

@Lordworms
Copy link
Contributor Author

Lordworms commented Aug 12, 2024

EliminateGroupByConstant

Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after EliminateGroupByConstant.

I think this is why you bring the is_global_group_by information down to physical layer.

exactly

I think another approach is we avoid EliminateGroupByConstant at all and we eliminate it when creating physical group by expression 🤔

I don't think we should completely avoid this rule since it has its own usage, for example here, if we disable it, the plan would be like
image

and with enable it, the plan is
image

we introduced an unnecessary Projection which I don't feel suitable, since it not only introduced another operator but also violates some basic rules(we should do projection before aggregate to minimize cost). Also I think there should be other test cases which may require AggregateExec to emit empty set, since right now it always returns an null with no input.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Aug 13, 2024

we introduced an unnecessary Projection which I don't feel suitable

I think this is what optimize_projections's job.

btw, normal query has projection too
select avg(a) from t group by a;

Projection: avg(t.a)
      Aggregate: groupBy=[[t.a]], aggr=[[avg(CAST(t.a AS Float64))]]
        TableScan: t projection=[a]

@alamb alamb marked this pull request as draft August 14, 2024 21:35
@alamb
Copy link
Contributor

alamb commented Aug 14, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Oct 14, 2024
@github-actions github-actions bot closed this Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules sql SQL Planner sqllogictest SQL Logic Tests (.slt) Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect result returned by a group by query (SQLancer-TLP)
4 participants