-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: bugs when having and group by are all false #11897
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me -- than you @Lordworms
let having_expr_post_aggr = | ||
rebase_expr(having_expr, &aggr_projection_exprs, input)?; | ||
|
||
let having_expr_post_aggr = if is_constant_expression(having_expr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check needs to be for a false constant (not just a constant) right? I'll suggest a test to show this
|
||
query R | ||
SELECT AVG(v1) FROM t1 having false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we please also add negative (positive?) tests like:
SELECT AVG(v1) FROM t1 GROUP BY false having true;
And
SELECT AVG(v1) FROM t1 GROUP BY false having 1 = 1;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your work on this PR @Lordworms but I am now pretty confused about these changes 🤔
In my understanding, the bug involves he HAVING
clause -- the HAVING
clause is like the WHERE
clause except that it happens after grouping where the WHERE
clause happens before grouping.
This page has a nice table

So for a query like
SELECT AVG(v1) FROM t1 GROUP BY false having false;
I would expect a query plan that looks something like
Filter(expr=false) <-- this is the HAVING(false)
GroupBy(agg=AVG(v1), gby=false) <-- this is the GROUP BY expr
TableScan(t1)
When I explained the query we can see that in fact this filter is added
> create table t1(v1 int) as values (1), (2);
0 row(s) fetched.
Elapsed 0.016 seconds.
> explain verbose SELECT AVG(v1) FROM t1 GROUP BY false having false;
...
| initial_logical_plan | Projection: avg(t1.v1) |
| | Filter: Boolean(false) |
| | Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(t1.v1)]] |
| | TableScan: t1
However, then it looks like the filter is pushed down below the group by
| logical_plan after push_down_filter | Projection: avg(t1.v1) |
| | Aggregate: groupBy=[[Boolean(false)]], aggr=[[avg(CAST(t1.v1 AS Float64))]] |
| | Filter: Boolean(false) |
| | TableScan: t1
Which finally results in
| logical_plan after eliminate_filter | Aggregate: groupBy=[[]], aggr=[[avg(CAST(t1.v1 AS Float64))]] |
| | EmptyRelation
So it seems a solution might be to refine the conditions under which filters can be pushed below grouping (perhaps we shouldn't push filters below grouping when there are no column references in the filter 🤔 )
query R | ||
SELECT AVG(v1) FROM t1 GROUP BY false having true; | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the output here should have a single row -- it should be the same result as SELECT AVG(v1) FROM t1 GROUP BY false
Yes, I understand, but the gby expr is optimized into an empty set in Optimizer, in order to pass the gby information to ExecutionPlan, I think I can only add an extra boolean member?
|
I think the way to handle btw, the issue is not strictly tied to group by + having We should also return empty rows for these queries
|
I don't think that's enough, the key here is we need to return empty set for those queries, but if group by is global and the table is not empty, we could not rewrite like this, the only way for us to know whether return an empty set or a null is in execution, since you don't know the size of the records before that. |
What is the reason that we could not determine the result in the the optimizer? Is there any counter example that does not work if we rewrite expression and could only be determined in execution? InList is one of the example that returns empty set and it is rewritten early in ExprSimplifier
|
This comment was marked as outdated.
This comment was marked as outdated.
I think this solution is reasonable. Constant expressions can be regarded as independent of each other, that is, they are fake column references.
UPDATE: #11748 can be fixed by disabling |
The following case seems to be caused by DataFusion CLI v41.0.0
> SELECT AVG(v1) FROM t1 GROUP BY false;
+------------+
| avg(t1.v1) |
+------------+
| |
+------------+
1 row(s) fetched. In DuckDB: D SELECT AVG(v1) FROM t1 GROUP BY false;
┌─────────┐
│ avg(v1) │
│ double │
├─────────┤
│ 0 rows │
└─────────┘ |
Ideally group by constant should be eliminated, but the result is different when there is no row and we can't differentiate it after I think this is why you bring the I think another approach is we avoid |
I agree to disable |
I think this is what btw, normal query has projection too
|
Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Closes #11748
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?