[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery #18968

viirya · 2017-08-17T04:20:50Z

What changes were proposed in this pull request?

With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule PullupCorrelatedPredicates can produce unresolved plans.

For a correlated IN query looks like:

SELECT t1.a FROM t1
WHERE
t1.a IN (SELECT t2.c
        FROM t2
        WHERE t1.b < t2.d);

The query plan might look like:

Project [a#0]
+- Filter a#0 IN (list#4 [b#1])
   :  +- Project [c#2]
   :     +- Filter (outer(b#1) < d#3)
   :        +- LocalRelation <empty>, [c#2, d#3]
   +- LocalRelation <empty>, [a#0, b#1]

After PullupCorrelatedPredicates, it produces query plan like:

'Project [a#0]
+- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
   :  +- Project [c#2, d#3]
   :     +- LocalRelation <empty>, [c#2, d#3]
   +- LocalRelation <empty>, [a#0, b#1]

Because the correlated predicate involves another attribute d#3 in subquery, it has been pulled out and added into the Project on the top of the subquery.

When list in In contains just one ListQuery, In.checkInputDataTypes checks if the size of value expressions matches the output size of subquery. In the above example, there is only value expression and the subquery output has two attributes c#2, d#3, so it fails the check and In.resolved returns false.

We should not let In.checkInputDataTypes wrongly report unresolved plans to fail the structural integrity check.

How was this patch tested?

Added test.

SparkQA · 2017-08-17T06:03:33Z

Test build #80765 has finished for PR 18968 at commit 4604a08.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-17T07:04:49Z

Test build #80767 has finished for PR 18968 at commit 4a47393.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-17T07:04:50Z

Test build #80769 has finished for PR 18968 at commit f5d8ebb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-17T10:19:03Z

Test build #80778 has finished for PR 18968 at commit 38631bb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-17T12:01:02Z

cc @cloud-fan @hvanhovell

cloud-fan · 2017-08-17T16:43:56Z

We should not let PullupCorrelatedPredicates produce unresolved plans to fail the structural integrity check.

This reads misleading, actually this PR does not change PullupCorrelatedPredicates, but to fix the type checking to not mistakenly report unresolved plans.

cloud-fan · 2017-08-17T16:45:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+        //
+        //   Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
+        //   :  +- Project [key#206, value#207]
+        //   :     +- Filter (value#207 > val_9)


why you pick a different example in PR description?

The example in PR description is constructed later. This example is I encountered in subquery_in_having.q. I'll make it consistent.

cloud-fan · 2017-08-17T17:16:17Z

can you also include the SQL statement? I feel a little hard to read the query plan with list query as I'm not faimilar with this part.

viirya · 2017-08-17T23:43:33Z

@cloud-fan Thanks for the comments. I've updated the PR description and added a SQL statement.

SparkQA · 2017-08-18T02:12:39Z

Test build #80812 has finished for PR 18968 at commit 476b4ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2017-08-19T00:46:40Z

@viirya Hi Simon, many thanks for finding this issue. Instead of adding the compensation code in the resolve logic for in-subquery expression can we consider to move the semantic checking of comparing the count of arguments in either side of in-subquery expression to checkAnalysis instead ? We do several other checks in checkAnalysis for subquery expression. I just feel this may be a little cleaner ?

For your reference, i quickly tried it here

viirya · 2017-08-19T02:47:16Z

@dilipbiswal Thanks for comment.

This issue is happened at optimization phase, the query plan is resolved after analysis and of course passes checkAnalysis. PullupCorrelatedPredicates is the rule to adjust the subquery plan and fail type checking in In predicate.

gatorsmile · 2017-08-20T04:39:35Z

Moving such checks from In.checkInputDataTypes to checkAnalysis looks cleaner to me. What we are doing in In.checkInputDataTypes does not belong to checkInputDataTypes.

If we want to capture the potential analysis errors introduced by the optimizer rule PullupCorrelatedPredicates , we should override resolved instead of doing all these checks in checkInputDataTypes.

viirya · 2017-08-20T05:54:32Z

I agree that the original check should be in checkAnalysis instead of checkInputDataTypes.

The additional check added by this change can be put in resolved. Sounds good to me.

viirya · 2017-08-20T06:17:07Z

Re-thinking it, I agree that this kind of check should put in resolved. However, I doubt whether we should put in checkAnalysis. By doing this, we split the resolving check of In to two places, one in checkAnalysis, one in In.resolved.

dilipbiswal · 2017-08-20T06:30:22Z

@viirya Isn't checkAnalysis supposed to catch such semantic errors ? In my thinking, this particular error is to make sure the left hand side number of args matching the right hand side is to catch user errors in the input query. After that is done either during analyzer or post analysis phase such as checkAnalysis , we shouldn't be doing this particular check. My reason is that , if optimizer causes a side effect such that it makes the original check invalid, we shouldn't be returning the particular error that we return today as that wouldn't mean much to the user as thats not the query he typed in , correct ?

viirya · 2017-08-20T06:38:21Z

@dilipbiswal But we still need to detect such violation of check and report the error if optimization rules cause the side effect that makes the plan unresolved.

I think we can't assume that in post analysis phase, we don't do anything that can break the integrity of the analyzed plan.

dilipbiswal · 2017-08-20T06:50:49Z

@viirya I agree that we should report violations. The only question i had is whether we should tie this particular check to the expression being resolved or not. In the old version of the code, we used to do this particular check in the analyzer. However after doing the check, we used to change the expression to PredicateSubquery after pulling up the correlated predicates. I am fine with whatever you decide. If we keep the check, we should make it such that it reads a little better than how it reads now :-)

I just wanted to give an example to help illustrate and in the process learn based on response. So today we do quite a bit of semantic checks for Subqueries and so many other operators in checkAnalysis. Say for a subquery expression, we did pass the semantic checks and later on in optimizer we violate those checks by rewriting the plans in weird ways , we don't do those checks again , right ? In other words, we don't tie those checks with the operator being resolved or not. I think this check in question is one such semantic check.

dilipbiswal · 2017-08-20T06:58:13Z

@gatorsmile @viirya There was pr from Natt pr. Is it possible to get some feedback on the idea ? If we do this, the next step was to combine the pullup and rewrite to one single rule so then this problem wouldn't occur :-). Actually i had this change made in my local branch on top of natt's changes a while back.

viirya · 2017-08-20T07:01:55Z

That's right, if we combine the two rules, we won't produce the unresolved In predicate. I just don't know if we have the plan in near term to combine them together. Sounds like a non-trivial change.

dilipbiswal · 2017-08-20T07:12:41Z

@viirya ok.

viirya · 2017-08-22T14:28:52Z

@cloud-fan Based on my understanding, I revise this change. May you look if it is what you think? Thanks.

dilipbiswal · 2017-08-22T16:16:25Z

Thanks @viirya @cloud-fan. This looks much better. Can we not preserve the user facing error we raise today? I think the error we raise today is better for the user ? Even if we were to have another case for ListQuery but with simplified type checking, it would be worth it , no ?

SparkQA · 2017-08-22T16:18:34Z

Test build #80988 has finished for PR 18968 at commit a7f6816.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T16:59:59Z

Test build #80989 has finished for PR 18968 at commit 498bd3b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-08-22T20:45:48Z

...re/src/test/resources/sql-tests/results/subquery/negative-cases/subq-input-typecheck.sql.out

-Right side columns:
-[t2.`t2a`, t2.`t2b`].
-             ;
+cannot resolve '(t1.`t1a` IN (listquery(t1.`t1a`)))' due to data type mismatch: Arguments must be same type but were: IntegerType != StructType(StructField(t2a,IntegerType,false), StructField(t2b,IntegerType,false));


This new message is confusing when users using the In Subquery.

viirya · 2017-08-22T23:29:40Z

@dilipbiswal @gatorsmile Regarding with the error message, I do think so.

viirya · 2017-08-22T23:30:47Z

Maybe we can still have a case for ListQuery, but it is simpler and mainly for better message?

cloud-fan · 2017-08-23T00:41:25Z

SGTM

SparkQA · 2017-08-23T10:18:23Z

Test build #81027 has finished for PR 18968 at commit 66a193d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-23T10:24:52Z

Test build #81028 has finished for PR 18968 at commit dae01f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-24T03:36:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala

+  } else {
+    childOutputs.head.dataType
+  }
+  override lazy val resolved: Boolean = childrenResolved && plan.resolved && childOutputs.nonEmpty


Before we fill in childOutputs, this ListQuery cannot be resolved. Otherwise, to access its dataType causes failure in In.checkInputDataTypes.

dilipbiswal · 2017-08-24T06:07:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1286,8 +1286,16 @@ class Analyzer(
          resolveSubQuery(s, plans)(ScalarSubquery(_, _, exprId))
        case e @ Exists(sub, _, exprId) if !sub.resolved =>
          resolveSubQuery(e, plans)(Exists(_, _, exprId))
-        case In(value, Seq(l @ ListQuery(sub, _, exprId))) if value.resolved && !sub.resolved =>
-          val expr = resolveSubQuery(l, plans)(ListQuery(_, _, exprId))
+        case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !sub.resolved =>


@viirya If we modified to

case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !l.resolved

would we still require the following case statement ? The following case looks a little
strange as we are in the resolveSubqueries routine and check for sub.resolved == true.

I thought to change resolveSubQuery to avoid re-analysis on a resolved plan. But since it is just once, maybe not a big deal. So finally I leave it untouched.

viirya · 2017-08-24T06:27:57Z

We could. resolveSubquery will do at least one time analysis to resolve the subquery. It might be a waste for already resolved subquery. Let me try if I can remove the case with little changes. On Aug 24, 2017 2:10 PM, "Dilip Biswal" <notifications@github.com> wrote: *@dilipbiswal* commented on this pull request. ------------------------------ In sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/an alysis/Analyzer.scala <#18968 (comment)>:

@@ -1286,8 +1286,16 @@ class Analyzer(

resolveSubQuery(s, plans)(ScalarSubquery(_, _, exprId)) case e @ Exists(sub, _, exprId) if !sub.resolved => resolveSubQuery(e, plans)(Exists(_, _, exprId)) - case In(value, Seq(l @ ListQuery(sub, _, exprId))) if value.resolved && !sub.resolved => - val expr = resolveSubQuery(l, plans)(ListQuery(_, _, exprId)) + case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !sub.resolved => @viirya <https://github.com/viirya> If we modified to case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !l.resolved would we still require the following case statement ? The following case looks a little strange as we are in the resolveSubqueries routine and check for sub.resolved == true. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#18968 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEM9xQlIzMPgV5_BoB_PK8tJCXkzilfks5sbRO8gaJpZM4O5wyz> .

dilipbiswal · 2017-08-24T07:46:22Z

Thanks Simon. Changes look good to me. cc @cloud-fan @gatorsmile for any additional comments.

SparkQA · 2017-08-24T10:08:49Z

Test build #81071 has finished for PR 18968 at commit 9364d6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-24T13:47:30Z

LGTM, merging to master!

viirya · 2017-08-24T13:51:45Z

Thanks @cloud-fan @gatorsmile @dilipbiswal

…ild output ### What changes were proposed in this pull request? Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes. ### Why are the changes needed? Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync. `ListQuery.childOutputs` was added by #18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan. ### Does this PR introduce _any_ user-facing change? No, there is no user-facing bug exposed. ### How was this patch tested? a new plan test Closes #40851 from cloud-fan/list_query. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…ild output ### What changes were proposed in this pull request? Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes. ### Why are the changes needed? Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync. `ListQuery.childOutputs` was added by apache#18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan. ### Does this PR introduce _any_ user-facing change? No, there is no user-facing bug exposed. ### How was this patch tested? a new plan test Closes apache#40851 from cloud-fan/list_query. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

viirya added 2 commits August 17, 2017 04:16

PullupCorrelatedPredicates should not produce unresolved plans.

4604a08

Add test for In expression.

f5d8ebb

viirya force-pushed the SPARK-21759 branch from 4a47393 to f5d8ebb Compare August 17, 2017 05:33

viirya mentioned this pull request Aug 17, 2017

[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

Closed

Add test result file and fix error message margin.

38631bb

cloud-fan reviewed Aug 17, 2017

View reviewed changes

Replace example query plan.

476b4ab

viirya changed the title ~~[SPARK-21759][SQL] PullupCorrelatedPredicates should not produce unresolved plans~~ [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery Aug 17, 2017

viirya force-pushed the SPARK-21759 branch 2 times, most recently from 8dbafb2 to 99d9570 Compare August 20, 2017 07:51

Try cloud-fan's proposal.

498bd3b

viirya force-pushed the SPARK-21759 branch from a7f6816 to 498bd3b Compare August 22, 2017 15:15

gatorsmile reviewed Aug 22, 2017

View reviewed changes

Better error message for ListQuery.

dae01f1

viirya force-pushed the SPARK-21759 branch from 66a193d to dae01f1 Compare August 23, 2017 07:52

viirya commented Aug 24, 2017

View reviewed changes

dilipbiswal reviewed Aug 24, 2017

View reviewed changes

Address comment.

9364d6e

asfgit closed this in 183d4cb Aug 24, 2017

gatorsmile mentioned this pull request Feb 13, 2018

[SPARK-23316][SQL] AnalysisException after max iteration reached for IN query #20548

Closed

cloud-fan mentioned this pull request Apr 19, 2023

[SPARK-43190][SQL] ListQuery.childOutput should be consistent with child output #40851

Closed

viirya deleted the SPARK-21759 branch December 27, 2023 18:34

[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery #18968

[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery #18968

Conversation

viirya commented Aug 17, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 17, 2017

SparkQA commented Aug 17, 2017

SparkQA commented Aug 17, 2017

SparkQA commented Aug 17, 2017

viirya commented Aug 17, 2017

cloud-fan commented Aug 17, 2017

cloud-fan Aug 17, 2017

Choose a reason for hiding this comment

viirya Aug 17, 2017

Choose a reason for hiding this comment

cloud-fan commented Aug 17, 2017

viirya commented Aug 17, 2017

SparkQA commented Aug 18, 2017

dilipbiswal commented Aug 19, 2017 • edited Loading

viirya commented Aug 19, 2017 • edited Loading

gatorsmile commented Aug 20, 2017

viirya commented Aug 20, 2017

viirya commented Aug 20, 2017

dilipbiswal commented Aug 20, 2017 • edited Loading

viirya commented Aug 20, 2017 • edited Loading

dilipbiswal commented Aug 20, 2017 • edited Loading

dilipbiswal commented Aug 20, 2017 • edited Loading

viirya commented Aug 20, 2017

dilipbiswal commented Aug 20, 2017

viirya commented Aug 22, 2017

dilipbiswal commented Aug 22, 2017 • edited Loading

SparkQA commented Aug 22, 2017

SparkQA commented Aug 22, 2017

gatorsmile Aug 22, 2017

Choose a reason for hiding this comment

viirya commented Aug 22, 2017

viirya commented Aug 22, 2017

cloud-fan commented Aug 23, 2017

SparkQA commented Aug 23, 2017

SparkQA commented Aug 23, 2017

viirya Aug 24, 2017

Choose a reason for hiding this comment

dilipbiswal Aug 24, 2017

Choose a reason for hiding this comment

viirya Aug 24, 2017 • edited Loading

Choose a reason for hiding this comment

viirya commented Aug 24, 2017 via email

dilipbiswal commented Aug 24, 2017

SparkQA commented Aug 24, 2017

cloud-fan commented Aug 24, 2017

viirya commented Aug 24, 2017

viirya commented Aug 17, 2017 •

edited

Loading

dilipbiswal commented Aug 19, 2017 •

edited

Loading

viirya commented Aug 19, 2017 •

edited

Loading

dilipbiswal commented Aug 20, 2017 •

edited

Loading

viirya commented Aug 20, 2017 •

edited

Loading

dilipbiswal commented Aug 20, 2017 •

edited

Loading

dilipbiswal commented Aug 20, 2017 •

edited

Loading

dilipbiswal commented Aug 22, 2017 •

edited

Loading

viirya Aug 24, 2017 •

edited

Loading