-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery #18968
Conversation
Test build #80765 has finished for PR 18968 at commit
|
Test build #80767 has finished for PR 18968 at commit
|
Test build #80769 has finished for PR 18968 at commit
|
Test build #80778 has finished for PR 18968 at commit
|
This reads misleading, actually this PR does not change |
// | ||
// Filter key#201 IN (list#200 [(value#207 = min(value)#204)]) | ||
// : +- Project [key#206, value#207] | ||
// : +- Filter (value#207 > val_9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why you pick a different example in PR description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The example in PR description is constructed later. This example is I encountered in subquery_in_having.q
. I'll make it consistent.
can you also include the SQL statement? I feel a little hard to read the query plan with list query as I'm not faimilar with this part. |
@cloud-fan Thanks for the comments. I've updated the PR description and added a SQL statement. |
Test build #80812 has finished for PR 18968 at commit
|
@viirya Hi Simon, many thanks for finding this issue. Instead of adding the compensation code in the resolve logic for in-subquery expression can we consider to move the semantic checking of comparing the count of arguments in either side of in-subquery expression to checkAnalysis instead ? We do several other checks in checkAnalysis for subquery expression. I just feel this may be a little cleaner ? For your reference, i quickly tried it here |
@dilipbiswal Thanks for comment. This issue is happened at optimization phase, the query plan is resolved after analysis and of course passes |
Moving such checks from If we want to capture the potential analysis errors introduced by the optimizer rule |
I agree that the original check should be in The additional check added by this change can be put in |
Re-thinking it, I agree that this kind of check should put in |
@viirya Isn't checkAnalysis supposed to catch such semantic errors ? In my thinking, this particular error is to make sure the left hand side number of args matching the right hand side is to catch user errors in the input query. After that is done either during analyzer or post analysis phase such as checkAnalysis , we shouldn't be doing this particular check. My reason is that , if optimizer causes a side effect such that it makes the original check invalid, we shouldn't be returning the particular error that we return today as that wouldn't mean much to the user as thats not the query he typed in , correct ? |
@dilipbiswal But we still need to detect such violation of check and report the error if optimization rules cause the side effect that makes the plan unresolved. I think we can't assume that in post analysis phase, we don't do anything that can break the integrity of the analyzed plan. |
@viirya I agree that we should report violations. The only question i had is whether we should tie this particular check to the expression being resolved or not. In the old version of the code, we used to do this particular check in the analyzer. However after doing the check, we used to change the expression to PredicateSubquery after pulling up the correlated predicates. I am fine with whatever you decide. If we keep the check, we should make it such that it reads a little better than how it reads now :-) I just wanted to give an example to help illustrate and in the process learn based on response. So today we do quite a bit of semantic checks for Subqueries and so many other operators in checkAnalysis. Say for a subquery expression, we did pass the semantic checks and later on in optimizer we violate those checks by rewriting the plans in weird ways , we don't do those checks again , right ? In other words, we don't tie those checks with the operator being resolved or not. I think this check in question is one such semantic check. |
@gatorsmile @viirya There was pr from Natt pr. Is it possible to get some feedback on the idea ? If we do this, the next step was to combine the pullup and rewrite to one single rule so then this problem wouldn't occur :-). Actually i had this change made in my local branch on top of natt's changes a while back. |
That's right, if we combine the two rules, we won't produce the unresolved |
@viirya ok. |
8dbafb2
to
99d9570
Compare
@cloud-fan Based on my understanding, I revise this change. May you look if it is what you think? Thanks. |
Thanks @viirya @cloud-fan. This looks much better. Can we not preserve the user facing error we raise today? I think the error we raise today is better for the user ? Even if we were to have another case for ListQuery but with simplified type checking, it would be worth it , no ? |
Test build #80988 has finished for PR 18968 at commit
|
Test build #80989 has finished for PR 18968 at commit
|
Right side columns: | ||
[t2.`t2a`, t2.`t2b`]. | ||
; | ||
cannot resolve '(t1.`t1a` IN (listquery(t1.`t1a`)))' due to data type mismatch: Arguments must be same type but were: IntegerType != StructType(StructField(t2a,IntegerType,false), StructField(t2b,IntegerType,false)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new message is confusing when users using the In Subquery.
@dilipbiswal @gatorsmile Regarding with the error message, I do think so. |
Maybe we can still have a case for ListQuery, but it is simpler and mainly for better message? |
SGTM |
Test build #81027 has finished for PR 18968 at commit
|
Test build #81028 has finished for PR 18968 at commit
|
} else { | ||
childOutputs.head.dataType | ||
} | ||
override lazy val resolved: Boolean = childrenResolved && plan.resolved && childOutputs.nonEmpty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we fill in childOutputs
, this ListQuery
cannot be resolved. Otherwise, to access its dataType
causes failure in In.checkInputDataTypes
.
@@ -1286,8 +1286,16 @@ class Analyzer( | |||
resolveSubQuery(s, plans)(ScalarSubquery(_, _, exprId)) | |||
case e @ Exists(sub, _, exprId) if !sub.resolved => | |||
resolveSubQuery(e, plans)(Exists(_, _, exprId)) | |||
case In(value, Seq(l @ ListQuery(sub, _, exprId))) if value.resolved && !sub.resolved => | |||
val expr = resolveSubQuery(l, plans)(ListQuery(_, _, exprId)) | |||
case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !sub.resolved => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya If we modified to
case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !l.resolved
would we still require the following case statement ? The following case looks a little
strange as we are in the resolveSubqueries routine and check for sub.resolved == true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought to change resolveSubQuery
to avoid re-analysis on a resolved plan. But since it is just once, maybe not a big deal. So finally I leave it untouched.
We could. resolveSubquery will do at least one time analysis to resolve the
subquery. It might be a waste for already resolved subquery.
Let me try if I can remove the case with little changes.
On Aug 24, 2017 2:10 PM, "Dilip Biswal" <notifications@github.com> wrote:
*@dilipbiswal* commented on this pull request.
------------------------------
In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/an
alysis/Analyzer.scala
<#18968 (comment)>:
@@ -1286,8 +1286,16 @@ class Analyzer(
resolveSubQuery(s, plans)(ScalarSubquery(_, _, exprId))
case e @ Exists(sub, _, exprId) if !sub.resolved =>
resolveSubQuery(e, plans)(Exists(_, _, exprId))
- case In(value, Seq(l @ ListQuery(sub, _, exprId))) if
value.resolved && !sub.resolved =>
- val expr = resolveSubQuery(l, plans)(ListQuery(_, _, exprId))
+ case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if
value.resolved && !sub.resolved =>
@viirya <https://github.com/viirya> If we modified to
case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if
value.resolved && !l.resolved
would we still require the following case statement ? The following case
looks a little
strange as we are in the resolveSubqueries routine and check for
sub.resolved == true.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18968 (review)>,
or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAEM9xQlIzMPgV5_BoB_PK8tJCXkzilfks5sbRO8gaJpZM4O5wyz>
.
|
Thanks Simon. Changes look good to me. cc @cloud-fan @gatorsmile for any additional comments. |
Test build #81071 has finished for PR 18968 at commit
|
LGTM, merging to master! |
Thanks @cloud-fan @gatorsmile @dilipbiswal |
…ild output ### What changes were proposed in this pull request? Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes. ### Why are the changes needed? Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync. `ListQuery.childOutputs` was added by #18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan. ### Does this PR introduce _any_ user-facing change? No, there is no user-facing bug exposed. ### How was this patch tested? a new plan test Closes #40851 from cloud-fan/list_query. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…ild output ### What changes were proposed in this pull request? Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes. ### Why are the changes needed? Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync. `ListQuery.childOutputs` was added by apache#18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan. ### Does this PR introduce _any_ user-facing change? No, there is no user-facing bug exposed. ### How was this patch tested? a new plan test Closes apache#40851 from cloud-fan/list_query. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule
PullupCorrelatedPredicates
can produce unresolved plans.For a correlated IN query looks like:
The query plan might look like:
After
PullupCorrelatedPredicates
, it produces query plan like:Because the correlated predicate involves another attribute
d#3
in subquery, it has been pulled out and added into theProject
on the top of the subquery.When
list
inIn
contains just oneListQuery
,In.checkInputDataTypes
checks if the size ofvalue
expressions matches the output size of subquery. In the above example, there is onlyvalue
expression and the subquery output has two attributesc#2, d#3
, so it fails the check andIn.resolved
returnsfalse
.We should not let
In.checkInputDataTypes
wrongly report unresolved plans to fail the structural integrity check.How was this patch tested?
Added test.