-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48503][SQL] Allow grouping on expressions in scalar subqueries, if they are bound to outer rows #47388
Conversation
// Collect the inner query expressions that are guaranteed to have a single value for each | ||
// outer row. See comment on getCorrelatedEquivalentInnerExpressions. | ||
val correlatedEquivalentExprs = getCorrelatedEquivalentInnerExpressions(query) | ||
// Unlike 'groupByCols', preserve the entire grouping expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move groupByCols
up, or this declaration + comment down? Forward comment reference here is confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! just rephrased the comment
-- Group-by expression is the same as the one we filter on - legal | ||
select *, (select count(*) from y where x1 = y1 and cast(y2 as double) = x1 + 1 | ||
group by cast(y2 as double)) from x; | ||
|
||
|
||
-- Illegal queries | ||
select * from x where (select count(*) from y where y1 > x1 group by y1) = 1; | ||
select *, (select count(*) from y where y1 + y2 = x1 group by y1) from x; | ||
select *, (select count(*) from y where x1 = y1 and y2 + 10 = x1 + 1 group by y2) from x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I'm not really sure how these new test cases reflect the changes made in the code. Could you elaborate how
Group-by expression is the same as the one we filter on
is legal and why the other one is illegal?
- edit: the idea is grouping expressions need to be present as a whole in predicates.
- Could you also more test cases for the positive and negative cases you commented in
getEquivalentToOuter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-discussed offline
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2- added a test on expr = outer1 + outer2
To extend it a bit more, shall we allow |
We will look into this as follow-up steps, but ultimately this should be a runtime-level check (checking that the join returns at most 1 row), so that we can allow all kinds of subqueries and let runtime throw a runtime error for bad cases. |
thanks, merging to master! |
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Extends previous work in #46839, allowing the grouping expressions to be bound to outer references.
Most common example is
select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))
Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it.
Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries.
Why are the changes needed?
Extends supported subqueries
Does this PR introduce any user-facing change?
Yes, previously failing queries are now passing
How was this patch tested?
Query tests
Was this patch authored or co-authored using generative AI tooling?
No