-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48503][SQL] Fix invalid scalar subqueries with group-by on non-equivalent columns that were incorrectly allowed #46839
Conversation
plan match { | ||
case Filter(cond, child) => | ||
val correlated = AttributeSet(splitConjunctivePredicates(cond) | ||
.filter(containsOuter) // TODO: can remove this line to allow e.g. where x = 1 group by x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intend to enable that in a separate PR, to reduce risk here.
_: SubqueryAlias => | ||
AttributeSet(plan.children.flatMap(child => getCorrelatedEquivalentInnerColumns(child))) | ||
|
||
case _ => AttributeSet.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of operators handled here is by no means comprehensive and ensuring it covers enough is tricky. I used the list in LogicalPlanVisitor as a starting point, but in my testing I discovered that e.g. SubqueryAlias also needs to be handled to cover cases with FROM subqueries inside the scalar subquery.
Suggestions on other important operators to handle or other potential approaches welcome.
(In the long run I think we need to replace this entire check with a runtime check as described in https://issues.apache.org/jira/browse/SPARK-48501, but that's highly nontrivial)
thanks, merging to master! |
…-equivalent columns that were incorrectly allowed ### What changes were proposed in this pull request? Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results. For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira): ``` select *, (select count(*) from y where y1 > x1 group by y1) from x; ``` It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery. Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g. `where y1 + y2 = x1 group by y1`. See the comments in the code and the tests for more examples. This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example `a + 1 = outer(b)` is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.) This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago). ### Why are the changes needed? Fix invalid queries returning wrong results ### Does this PR introduce _any_ user-facing change? Yes, block subqueries with invalid group-bys. ### How was this patch tested? Add tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46839 from jchen5/scalar-subq-gby. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ual to constant ### What changes were proposed in this pull request? We can enable scalar subqueries that have `group by a` if there's a predicate `a = 1`, because these predicates guarantee the group-by produces at most one row. (This builds on top of #46839 and enables shapes there were unsupported prior to that PR as well.) ### Why are the changes needed? Support valid subquery shapes. ### Does this PR introduce _any_ user-facing change? Yes, support subquery shapes. ### How was this patch tested? Unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46902 from jchen5/subq-gby-eq. Authored-by: Jack Chen <jack.chen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in #46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…, if they are bound to outer rows ### What changes were proposed in this pull request? Extends previous work in apache#46839, allowing the grouping expressions to be bound to outer references. Most common example is `select *, (select count(*) from T_inner where cast(T_inner.x as date) = T_outer.date group by cast(T_inner.x as date))` Here, we group by cast(T_inner.x as date) which is bound to an outer row. This guarantees that for every outer row, there is exactly one value of cast(T_inner.x as date), so it is safe to group on it. Previously, we required that only columns can be bound to outer expressions, thus forbidding such subqueries. ### Why are the changes needed? Extends supported subqueries ### Does this PR introduce _any_ user-facing change? Yes, previously failing queries are now passing ### How was this patch tested? Query tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47388 from agubichev/group_by_cols. Authored-by: Andrey Gubichev <andrey.gubichev@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Fixes CheckAnalysis to reject invalid scalar subquery group-bys that were previously allowed and returned wrong results.
For example, this query is not legal and should give an error, but instead we incorrectly allowed it and it returns wrong results prior to this PR (full repro with table data in the jira):
It returns two rows, even though there's only one row of x. The correct result is an error, because there is more than one row returned by the scalar subquery.
Another problem case is if the correlation condition is an equality but it's under another operator like an OUTER JOIN or UNION. Various other expressions that are not equi-joins between the inner and outer fields hit this too, e.g.
where y1 + y2 = x1 group by y1
. See the comments in the code and the tests for more examples.This PR fixes the logic which checks for valid vs invalid group-bys. However, note that this new logic may block some queries that are actually valid, for example
a + 1 = outer(b)
is valid but would be rejected. Therefore, we add a conf flag that can be used to restore the legacy behavior, as well as logging for when the legacy behavior is used and differs from the new behavior. (In general, to accurately run valid queries and reject invalid queries, the check must be moved from compile-time to run-time - see https://issues.apache.org/jira/browse/SPARK-48501.)This is a longstanding bug. The bug is in CheckAnalysis in checkAggregateInScalarSubquery. It allows grouping columns that are present in correlation predicates, but doesn’t check whether those predicates are equalities - because when that code was written, non-equality correlation wasn’t allowed. Therefore, this bug has existed since non-equality correlation was added (~2 years ago).
Why are the changes needed?
Fix invalid queries returning wrong results
Does this PR introduce any user-facing change?
Yes, block subqueries with invalid group-bys.
How was this patch tested?
Add tests
Was this patch authored or co-authored using generative AI tooling?
No