-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29145][SQL] Support sub-queries in join conditions #25854
Conversation
ok to test |
Test build #111015 has finished for PR 25854 at commit
|
No? It seems ths pr intends to accept a new statement in DataFrame/SQL? |
Also, can you add end-to-end tests in |
Ok, I will do this. End-to-end UT add to |
Test build #111029 has finished for PR 25854 at commit
|
Test build #111037 has finished for PR 25854 at commit
|
Test build #112100 has finished for PR 25854 at commit
|
Test build #112101 has finished for PR 25854 at commit
|
gentle ping @maropu @wangyum @cloud-fan |
@@ -602,7 +602,7 @@ trait CheckAnalysis extends PredicateHelper { | |||
|
|||
case inSubqueryOrExistsSubquery => | |||
plan match { | |||
case _: Filter | _: SupportsSubquery => // Ok | |||
case _: Filter | _: SupportsSubquery | _: Join => // Ok | |||
case _ => | |||
failAnalysis(s"IN/EXISTS predicate sub-queries can only be used in" + | |||
s" Filter and a few commands: $plan") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update the message: Filter/Join and a few commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update the message:
Filter/Join and a few commands
Done
Seq(3, 4, 6, 9).toDF("id").createOrReplaceTempView("s3") | ||
|
||
checkAnswer( | ||
sql("SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select 9)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put correlated subquery in join condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put correlated subquery in join condition?
Subquery is in join condition, LogicalPlan as below:
== Parsed Logical Plan ==
'Project ['s1.id]
+- 'Join Inner, (('s1.id = 's2.id) AND 's1.id IN (list#258 []))
: +- 'Project [unresolvedalias(9, None)]
: +- OneRowRelation
:- 'UnresolvedRelation [s1]
+- 'UnresolvedRelation [s2]
== Analyzed Logical Plan ==
id: int
Project [id#244]
+- Join Inner, ((id#244 = id#250) AND id#244 IN (list#258 []))
: +- Project [9 AS 9#259]
: +- OneRowRelation
:- SubqueryAlias `s1`
: +- Project [value#241 AS id#244]
: +- LocalRelation [value#241]
+- SubqueryAlias `s2`
+- Project [value#247 AS id#250]
+- LocalRelation [value#247]
== Optimized Logical Plan ==
Project [id#244]
+- Join Inner, (id#244 = id#250)
:- Project [value#241 AS id#244]
: +- Join LeftSemi, (value#241 = 9#259)
: :- LocalRelation [value#241]
: +- Project [9 AS 9#259]
: +- OneRowRelation
+- Project [value#247 AS id#250]
+- Join LeftSemi, (value#247 = 9#259)
:- LocalRelation [value#247]
+- Project [9 AS 9#259]
+- OneRowRelation
== Physical Plan ==
*(4) Project [id#244]
+- *(4) BroadcastHashJoin [id#244], [id#250], Inner, BuildRight
:- *(4) Project [value#241 AS id#244]
: +- *(4) BroadcastHashJoin [value#241], [9#259], LeftSemi, BuildRight
: :- *(4) LocalTableScan [value#241]
: +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#145]
: +- *(1) Project [9 AS 9#259]
: +- *(1) Scan OneRowRelation[]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#173]
+- *(3) Project [value#247 AS id#250]
+- *(3) BroadcastHashJoin [value#247], [9#259], LeftSemi, BuildRight
:- *(3) LocalTableScan [value#247]
+- ReusedExchange [9#259], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#145]
Row(1) :: Row(3) :: Nil) | ||
|
||
checkAnswer( | ||
sql("SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, do we support
SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3 where s3.id = s2.id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, do we support
SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3 where s3.id = s2.id)
Cann't since strategy's idempotence is broken. Seem write sql like this is not reasonable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @dilipbiswal
I checked with pgsql and it's supported. We need to update RewriteCorrelatedScalarSubquery
to support it in Spark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @dilipbiswal
I checked with pgsql and it's supported. We need to update
RewriteCorrelatedScalarSubquery
to support it in Spark.
We should support it, checking on this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to address the support in this pr? I think its ok to do in another jira. kindly ping @dilipbiswal
Test build #112140 has finished for PR 25854 at commit
|
Check whole process, you show is optimized plan, in analyzed plan, join condition is still in main join, after optimize, it was pushed down. |
@dilipbiswal |
Test build #112521 has finished for PR 25854 at commit
|
Test build #112534 has finished for PR 25854 at commit
|
@AngersZhuuuu Great.. Thanks a lot for adding the UTs. Looks good to me. |
Row(3) :: Row(9) :: Nil) | ||
|
||
checkAnswer( | ||
sql("SELECT s1.id as id2 from s1 LEFT SEMI JOIN s2 " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can you follow the format of the other tests? In multi-line cases, the format seems to be like this;
sql("""
|
| ...
...
| ) """.stripMargin)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Can you update the title? |
Test build #112585 has finished for PR 25854 at commit
|
retest this please |
Test build #112593 has finished for PR 25854 at commit
|
Thanks! Merged to master. |
@@ -204,6 +204,154 @@ class SubquerySuite extends QueryTest with SharedSparkSession { | |||
} | |||
} | |||
|
|||
test("SPARK-29145: JOIN Condition use QueryList") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move it to SQLQueryTestSuite?
It sounds like it does not contain any test case that check the EXISTS subquery? Could you also add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move it to SQLQueryTestSuite?
It sounds like it does not contain any test case that check the EXISTS subquery? Could you also add it?
Ok, will raise a pr follow your comment.
…query/in-subquery/in-joins.sql` ### What changes were proposed in this pull request? Follow comment of #25854 (comment) ### Why are the changes needed? NO ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD TEST CASE Closes #26406 from AngersZhuuuu/SPARK-29145-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@@ -1697,6 +1697,8 @@ class Analyzer( | |||
// Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries. | |||
case q: UnaryNode if q.childrenResolved => | |||
resolveSubQueries(q, q.children) | |||
case j: Join if j.childrenResolved => | |||
resolveSubQueries(j, Seq(j, j.left, j.right)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't recall the details, but why it's not Seq(j.left, j.right)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't recall the details, but why it's not
Seq(j.left, j.right)
?
Should be a mistake, raise a pr and remove this?
…in join conditions ### What changes were proposed in this pull request? According to discuss #25854 (comment) ### Why are the changes needed? Clean code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #32499 from AngersZhuuuu/SPARK-29145-fix. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Support SparkSQL use iN/EXISTS with subquery in JOIN condition.
Why are the changes needed?
Support SQL use iN/EXISTS with subquery in JOIN condition.
Does this PR introduce any user-facing change?
This PR is for enable user use subquery in
JOIN
's ON condition. such as we have create three tablewe can do query like :
How was this patch tested?
ADDED UT