Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery #18968

Closed
wants to merge 11 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Aug 17, 2017

What changes were proposed in this pull request?

With the check for structural integrity proposed in SPARK-21726, it is found that the optimization rule PullupCorrelatedPredicates can produce unresolved plans.

For a correlated IN query looks like:

SELECT t1.a FROM t1
WHERE
t1.a IN (SELECT t2.c
        FROM t2
        WHERE t1.b < t2.d);

The query plan might look like:

Project [a#0]
+- Filter a#0 IN (list#4 [b#1])
   :  +- Project [c#2]
   :     +- Filter (outer(b#1) < d#3)
   :        +- LocalRelation <empty>, [c#2, d#3]
   +- LocalRelation <empty>, [a#0, b#1]

After PullupCorrelatedPredicates, it produces query plan like:

'Project [a#0]
+- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
   :  +- Project [c#2, d#3]
   :     +- LocalRelation <empty>, [c#2, d#3]
   +- LocalRelation <empty>, [a#0, b#1]

Because the correlated predicate involves another attribute d#3 in subquery, it has been pulled out and added into the Project on the top of the subquery.

When list in In contains just one ListQuery, In.checkInputDataTypes checks if the size of value expressions matches the output size of subquery. In the above example, there is only value expression and the subquery output has two attributes c#2, d#3, so it fails the check and In.resolved returns false.

We should not let In.checkInputDataTypes wrongly report unresolved plans to fail the structural integrity check.

How was this patch tested?

Added test.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80765 has finished for PR 18968 at commit 4604a08.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80767 has finished for PR 18968 at commit 4a47393.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80769 has finished for PR 18968 at commit f5d8ebb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2017

Test build #80778 has finished for PR 18968 at commit 38631bb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Aug 17, 2017

cc @cloud-fan @hvanhovell

@cloud-fan
Copy link
Contributor

We should not let PullupCorrelatedPredicates produce unresolved plans to fail the structural integrity check.

This reads misleading, actually this PR does not change PullupCorrelatedPredicates, but to fix the type checking to not mistakenly report unresolved plans.

//
// Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
// : +- Project [key#206, value#207]
// : +- Filter (value#207 > val_9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why you pick a different example in PR description?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example in PR description is constructed later. This example is I encountered in subquery_in_having.q. I'll make it consistent.

@cloud-fan
Copy link
Contributor

can you also include the SQL statement? I feel a little hard to read the query plan with list query as I'm not faimilar with this part.

@viirya viirya changed the title [SPARK-21759][SQL] PullupCorrelatedPredicates should not produce unresolved plans [SPARK-21759][SQL] In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery Aug 17, 2017
@viirya
Copy link
Member Author

viirya commented Aug 17, 2017

@cloud-fan Thanks for the comments. I've updated the PR description and added a SQL statement.

@SparkQA
Copy link

SparkQA commented Aug 18, 2017

Test build #80812 has finished for PR 18968 at commit 476b4ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 19, 2017

@viirya Hi Simon, many thanks for finding this issue. Instead of adding the compensation code in the resolve logic for in-subquery expression can we consider to move the semantic checking of comparing the count of arguments in either side of in-subquery expression to checkAnalysis instead ? We do several other checks in checkAnalysis for subquery expression. I just feel this may be a little cleaner ?

For your reference, i quickly tried it here

@viirya
Copy link
Member Author

viirya commented Aug 19, 2017

@dilipbiswal Thanks for comment.

This issue is happened at optimization phase, the query plan is resolved after analysis and of course passes checkAnalysis. PullupCorrelatedPredicates is the rule to adjust the subquery plan and fail type checking in In predicate.

@gatorsmile
Copy link
Member

Moving such checks from In.checkInputDataTypes to checkAnalysis looks cleaner to me. What we are doing in In.checkInputDataTypes does not belong to checkInputDataTypes.

If we want to capture the potential analysis errors introduced by the optimizer rule PullupCorrelatedPredicates , we should override resolved instead of doing all these checks in checkInputDataTypes.

@viirya
Copy link
Member Author

viirya commented Aug 20, 2017

I agree that the original check should be in checkAnalysis instead of checkInputDataTypes.

The additional check added by this change can be put in resolved. Sounds good to me.

@viirya
Copy link
Member Author

viirya commented Aug 20, 2017

Re-thinking it, I agree that this kind of check should put in resolved. However, I doubt whether we should put in checkAnalysis. By doing this, we split the resolving check of In to two places, one in checkAnalysis, one in In.resolved.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 20, 2017

@viirya Isn't checkAnalysis supposed to catch such semantic errors ? In my thinking, this particular error is to make sure the left hand side number of args matching the right hand side is to catch user errors in the input query. After that is done either during analyzer or post analysis phase such as checkAnalysis , we shouldn't be doing this particular check. My reason is that , if optimizer causes a side effect such that it makes the original check invalid, we shouldn't be returning the particular error that we return today as that wouldn't mean much to the user as thats not the query he typed in , correct ?

@viirya
Copy link
Member Author

viirya commented Aug 20, 2017

@dilipbiswal But we still need to detect such violation of check and report the error if optimization rules cause the side effect that makes the plan unresolved.

I think we can't assume that in post analysis phase, we don't do anything that can break the integrity of the analyzed plan.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 20, 2017

@viirya I agree that we should report violations. The only question i had is whether we should tie this particular check to the expression being resolved or not. In the old version of the code, we used to do this particular check in the analyzer. However after doing the check, we used to change the expression to PredicateSubquery after pulling up the correlated predicates. I am fine with whatever you decide. If we keep the check, we should make it such that it reads a little better than how it reads now :-)

I just wanted to give an example to help illustrate and in the process learn based on response. So today we do quite a bit of semantic checks for Subqueries and so many other operators in checkAnalysis. Say for a subquery expression, we did pass the semantic checks and later on in optimizer we violate those checks by rewriting the plans in weird ways , we don't do those checks again , right ? In other words, we don't tie those checks with the operator being resolved or not. I think this check in question is one such semantic check.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 20, 2017

@gatorsmile @viirya There was pr from Natt pr. Is it possible to get some feedback on the idea ? If we do this, the next step was to combine the pullup and rewrite to one single rule so then this problem wouldn't occur :-). Actually i had this change made in my local branch on top of natt's changes a while back.

@viirya
Copy link
Member Author

viirya commented Aug 20, 2017

That's right, if we combine the two rules, we won't produce the unresolved In predicate. I just don't know if we have the plan in near term to combine them together. Sounds like a non-trivial change.

@dilipbiswal
Copy link
Contributor

@viirya ok.

@viirya viirya force-pushed the SPARK-21759 branch 2 times, most recently from 8dbafb2 to 99d9570 Compare August 20, 2017 07:51
@viirya
Copy link
Member Author

viirya commented Aug 22, 2017

@cloud-fan Based on my understanding, I revise this change. May you look if it is what you think? Thanks.

@dilipbiswal
Copy link
Contributor

dilipbiswal commented Aug 22, 2017

Thanks @viirya @cloud-fan. This looks much better. Can we not preserve the user facing error we raise today? I think the error we raise today is better for the user ? Even if we were to have another case for ListQuery but with simplified type checking, it would be worth it , no ?

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80988 has finished for PR 18968 at commit a7f6816.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80989 has finished for PR 18968 at commit 498bd3b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Right side columns:
[t2.`t2a`, t2.`t2b`].
;
cannot resolve '(t1.`t1a` IN (listquery(t1.`t1a`)))' due to data type mismatch: Arguments must be same type but were: IntegerType != StructType(StructField(t2a,IntegerType,false), StructField(t2b,IntegerType,false));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new message is confusing when users using the In Subquery.

@viirya
Copy link
Member Author

viirya commented Aug 22, 2017

@dilipbiswal @gatorsmile Regarding with the error message, I do think so.

@viirya
Copy link
Member Author

viirya commented Aug 22, 2017

Maybe we can still have a case for ListQuery, but it is simpler and mainly for better message?

@cloud-fan
Copy link
Contributor

SGTM

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81027 has finished for PR 18968 at commit 66a193d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81028 has finished for PR 18968 at commit dae01f1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
childOutputs.head.dataType
}
override lazy val resolved: Boolean = childrenResolved && plan.resolved && childOutputs.nonEmpty
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we fill in childOutputs, this ListQuery cannot be resolved. Otherwise, to access its dataType causes failure in In.checkInputDataTypes.

@@ -1286,8 +1286,16 @@ class Analyzer(
resolveSubQuery(s, plans)(ScalarSubquery(_, _, exprId))
case e @ Exists(sub, _, exprId) if !sub.resolved =>
resolveSubQuery(e, plans)(Exists(_, _, exprId))
case In(value, Seq(l @ ListQuery(sub, _, exprId))) if value.resolved && !sub.resolved =>
val expr = resolveSubQuery(l, plans)(ListQuery(_, _, exprId))
case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !sub.resolved =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya If we modified to

case In(value, Seq(l @ ListQuery(sub, _, exprId, _))) if value.resolved && !l.resolved

would we still require the following case statement ? The following case looks a little
strange as we are in the resolveSubqueries routine and check for sub.resolved == true.

Copy link
Member Author

@viirya viirya Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought to change resolveSubQuery to avoid re-analysis on a resolved plan. But since it is just once, maybe not a big deal. So finally I leave it untouched.

@viirya
Copy link
Member Author

viirya commented Aug 24, 2017 via email

@dilipbiswal
Copy link
Contributor

Thanks Simon. Changes look good to me. cc @cloud-fan @gatorsmile for any additional comments.

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81071 has finished for PR 18968 at commit 9364d6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, merging to master!

@asfgit asfgit closed this in 183d4cb Aug 24, 2017
@viirya
Copy link
Member Author

viirya commented Aug 24, 2017

Thanks @cloud-fan @gatorsmile @dilipbiswal

cloud-fan added a commit that referenced this pull request Apr 20, 2023
…ild output

### What changes were proposed in this pull request?

Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes.

### Why are the changes needed?

Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync.

`ListQuery.childOutputs` was added by #18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan.

### Does this PR introduce _any_ user-facing change?

No, there is no user-facing bug exposed.

### How was this patch tested?

a new plan test

Closes #40851 from cloud-fan/list_query.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Apr 21, 2023
…ild output

### What changes were proposed in this pull request?

Update `ListQuery` to only store the number of columns of the original plan, instead of directly storing the original plan output attributes.

### Why are the changes needed?

Storing the plan output attributes is troublesome as we have to maintain them and keep them in sync with the plan. For example, `DeduplicateRelations` may change the plan output, and today we do not update `ListQuery.childOutputs` to keep sync.

`ListQuery.childOutputs` was added by apache#18968 . It's only used to track the original plan output attributes as subquery de-correlation may add more columns. We can do the same thing by storing the number of columns of the plan.

### Does this PR introduce _any_ user-facing change?

No, there is no user-facing bug exposed.

### How was this patch tested?

a new plan test

Closes apache#40851 from cloud-fan/list_query.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@viirya viirya deleted the SPARK-21759 branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants