[SPARK-29145][SQL] Support sub-queries in join conditions #25854

AngersZhuuuu · 2019-09-19T16:04:15Z

What changes were proposed in this pull request?

Support SparkSQL use iN/EXISTS with subquery in JOIN condition.

Why are the changes needed?

Support SQL use iN/EXISTS with subquery in JOIN condition.

Does this PR introduce any user-facing change?

This PR is for enable user use subquery in JOIN's ON condition. such as we have create three table

CREATE TABLE A(id String);
CREATE TABLE B(id String);
CREATE TABLE C(id String);

we can do query like :

SELECT A.id  from  A JOIN B ON A.id = B.id and A.id IN (select C.id from C)

How was this patch tested?

ADDED UT

This reverts commit c3de557.

This reverts commit 6dc61e7.

dongjoon-hyun · 2019-09-19T17:21:42Z

ok to test

SparkQA · 2019-09-19T17:29:02Z

Test build #111015 has finished for PR 25854 at commit 5aa2ed6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-09-20T00:53:03Z

Does this PR introduce any user-facing change?

No? It seems ths pr intends to accept a new statement in DataFrame/SQL?

maropu · 2019-09-20T00:53:38Z

Also, can you add end-to-end tests in SQLQueryTestSuite or somewhere?

AngersZhuuuu · 2019-09-20T01:30:23Z

end-to-end tests in SQLQ

Ok, I will do this. End-to-end UT add to SubQuerySuit

SparkQA · 2019-09-20T04:19:36Z

Test build #111029 has finished for PR 25854 at commit fa55b3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-20T06:45:11Z

Test build #111037 has finished for PR 25854 at commit bd7c098.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T10:31:19Z

Test build #112100 has finished for PR 25854 at commit 3108da2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-15T14:16:49Z

Test build #112101 has finished for PR 25854 at commit 6b58893.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2019-10-16T01:35:41Z

gentle ping @maropu @wangyum @cloud-fan

cloud-fan · 2019-10-16T04:59:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

@@ -602,7 +602,7 @@ trait CheckAnalysis extends PredicateHelper {

      case inSubqueryOrExistsSubquery =>
        plan match {
-          case _: Filter | _: SupportsSubquery => // Ok
+          case _: Filter | _: SupportsSubquery | _: Join => // Ok
          case _ =>
            failAnalysis(s"IN/EXISTS predicate sub-queries can only be used in" +
                s" Filter and a few commands: $plan")


let's update the message: Filter/Join and a few commands

let's update the message: Filter/Join and a few commands

Done

cloud-fan · 2019-10-16T05:00:03Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+      Seq(3, 4, 6, 9).toDF("id").createOrReplaceTempView("s3")
+
+      checkAnswer(
+        sql("SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select 9)"),


can we put correlated subquery in join condition?

can we put correlated subquery in join condition?

Subquery is in join condition, LogicalPlan as below:

== Parsed Logical Plan == 'Project ['s1.id] +- 'Join Inner, (('s1.id = 's2.id) AND 's1.id IN (list#258 [])) : +- 'Project [unresolvedalias(9, None)] : +- OneRowRelation :- 'UnresolvedRelation [s1] +- 'UnresolvedRelation [s2] == Analyzed Logical Plan == id: int Project [id#244] +- Join Inner, ((id#244 = id#250) AND id#244 IN (list#258 [])) : +- Project [9 AS 9#259] : +- OneRowRelation :- SubqueryAlias `s1` : +- Project [value#241 AS id#244] : +- LocalRelation [value#241] +- SubqueryAlias `s2` +- Project [value#247 AS id#250] +- LocalRelation [value#247] == Optimized Logical Plan == Project [id#244] +- Join Inner, (id#244 = id#250) :- Project [value#241 AS id#244] : +- Join LeftSemi, (value#241 = 9#259) : :- LocalRelation [value#241] : +- Project [9 AS 9#259] : +- OneRowRelation +- Project [value#247 AS id#250] +- Join LeftSemi, (value#247 = 9#259) :- LocalRelation [value#247] +- Project [9 AS 9#259] +- OneRowRelation == Physical Plan == *(4) Project [id#244] +- *(4) BroadcastHashJoin [id#244], [id#250], Inner, BuildRight :- *(4) Project [value#241 AS id#244] : +- *(4) BroadcastHashJoin [value#241], [9#259], LeftSemi, BuildRight : :- *(4) LocalTableScan [value#241] : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#145] : +- *(1) Project [9 AS 9#259] : +- *(1) Scan OneRowRelation[] +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#173] +- *(3) Project [value#247 AS id#250] +- *(3) BroadcastHashJoin [value#247], [9#259], LeftSemi, BuildRight :- *(3) LocalTableScan [value#247] +- ReusedExchange [9#259], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#145]

cloud-fan · 2019-10-16T06:19:18Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+        Row(1) :: Row(3) :: Nil)
+
+      checkAnswer(
+        sql("SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3)"),


for example, do we support
SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3 where s3.id = s2.id)

for example, do we support
SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id IN (select id from s3 where s3.id = s2.id)

Cann't since strategy's idempotence is broken. Seem write sql like this is not reasonable...

also cc @dilipbiswal

I checked with pgsql and it's supported. We need to update RewriteCorrelatedScalarSubquery to support it in Spark.

also cc @dilipbiswal

I checked with pgsql and it's supported. We need to update RewriteCorrelatedScalarSubquery to support it in Spark.

We should support it, checking on this issue.

We need to address the support in this pr? I think its ok to do in another jira. kindly ping @dilipbiswal

SparkQA · 2019-10-16T07:05:02Z

Test build #112140 has finished for PR 25854 at commit 25f31dc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

AngersZhuuuu · 2019-10-23T09:56:18Z

@AngersZhuuuu I just quickly checked the plan for following query :

query
SELECT s1.id from s1 JOIN s2 ON s1.id = s2.id and s1.id NOT IN (select id from s3)
plan
Project [id#244]
+- Join Inner, (id#244 = id#250)
  :- Project [value#241 AS id#244]
  :  +- Join LeftAnti, ((value#241 = id#256) OR isnull((value#241 = id#256)))
  :     :- LocalRelation [value#241]
  :     +- Project [value#253 AS id#256]
  :        +- LocalRelation [value#253]
  +- Project [value#247 AS id#250]
     +- Join LeftAnti, ((value#247 = id#256) OR isnull((value#247 = id#256)))
        :- LocalRelation [value#247]
        +- Project [value#253 AS id#256]
           +- LocalRelation [value#253]
Thats the reason i asked to test out the outer joins. Lets please make sure that in case of Outer joins we preserve the full join condition in the main join. Lets add few tests to make sure please.

Check whole process, you show is optimized plan, in analyzed plan, join condition is still in main join, after optimize, it was pushed down.

AngersZhuuuu · 2019-10-23T10:25:21Z

@dilipbiswal
Add more UT, result is ok, I wonder if it covers all the cases you want

SparkQA · 2019-10-23T11:11:34Z

Test build #112521 has finished for PR 25854 at commit 2ead378.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-23T14:17:41Z

Test build #112534 has finished for PR 25854 at commit 3db4aaf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-10-23T17:52:54Z

@AngersZhuuuu Great.. Thanks a lot for adding the UTs. Looks good to me.

maropu · 2019-10-24T05:01:03Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

+        Row(3) :: Row(9) :: Nil)
+
+      checkAnswer(
+        sql("SELECT s1.id as id2 from s1 LEFT SEMI JOIN s2 " +


nit: can you follow the format of the other tests? In multi-line cases, the format seems to be like this;

sql(""" | | ... ... | ) """.stripMargin)

maropu · 2019-10-24T05:02:29Z

Can you update the title? Support sub-queries in join conditions?

SparkQA · 2019-10-24T07:05:02Z

Test build #112585 has finished for PR 25854 at commit 307802a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-24T07:06:24Z

retest this please

SparkQA · 2019-10-24T11:07:51Z

Test build #112593 has finished for PR 25854 at commit 307802a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-10-24T12:55:29Z

Thanks! Merged to master.

gatorsmile · 2019-11-05T05:02:19Z

sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala

@@ -204,6 +204,154 @@ class SubquerySuite extends QueryTest with SharedSparkSession {
    }
  }

+  test("SPARK-29145: JOIN Condition use QueryList") {


Can we move it to SQLQueryTestSuite?

It sounds like it does not contain any test case that check the EXISTS subquery? Could you also add it?

Can we move it to SQLQueryTestSuite?

It sounds like it does not contain any test case that check the EXISTS subquery? Could you also add it?

Ok, will raise a pr follow your comment.

…query/in-subquery/in-joins.sql` ### What changes were proposed in this pull request? Follow comment of #25854 (comment) ### Why are the changes needed? NO ### Does this PR introduce any user-facing change? NO ### How was this patch tested? ADD TEST CASE Closes #26406 from AngersZhuuuu/SPARK-29145-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

cloud-fan · 2021-05-10T15:19:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

@@ -1697,6 +1697,8 @@ class Analyzer(
      // Only a few unary nodes (Project/Filter/Aggregate) can contain subqueries.
      case q: UnaryNode if q.childrenResolved =>
        resolveSubQueries(q, q.children)
+      case j: Join if j.childrenResolved =>
+        resolveSubQueries(j, Seq(j, j.left, j.right))


Can't recall the details, but why it's not Seq(j.left, j.right)?

Can't recall the details, but why it's not Seq(j.left, j.right)?

Should be a mistake, raise a pr and remove this?

…in join conditions ### What changes were proposed in this pull request? According to discuss #25854 (comment) ### Why are the changes needed? Clean code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #32499 from AngersZhuuuu/SPARK-29145-fix. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

AngersZhuuuu added 9 commits September 10, 2019 17:44

save change

c3de557

Revert "save change"

2cf3153

This reverts commit c3de557.

Merge remote-tracking branch 'upstream/master'

e5cd06c

Merge remote-tracking branch 'upstream/master'

569ab8a

Merge remote-tracking branch 'upstream/master'

91d1031

TEST-SPARK-29015

6dc61e7

Revert "TEST-SPARK-29015"

f087b10

This reverts commit 6dc61e7.

Merge remote-tracking branch 'upstream/master'

5ef8dad

SUPPORT IN/EXISTS in join condition

5aa2ed6

fix scala style

fa55b3a

add end-to-end test

bd7c098

dongjoon-hyun added the SQL label Sep 20, 2019

AngersZhuuuu added 2 commits October 15, 2019 18:26

Merge branch 'master' into SPARK-29145

3108da2

remove DeleteFromTable

dd37df8

fix scala style

6b58893

cloud-fan reviewed Oct 16, 2019

View reviewed changes

fllow comment

25f31dc

cloud-fan reviewed Oct 16, 2019

View reviewed changes

Add more UT case

3db4aaf

maropu reviewed Oct 24, 2019

View reviewed changes

AngersZhuuuu changed the title ~~[SPARK-29145][SQL] Spark SQL cannot handle "NOT IN" condition when using "JOIN"~~ [SPARK-29145][SQL] Support sub-queries in join conditions Oct 24, 2019

AngersZhuuuu added 2 commits October 24, 2019 13:26

make test case sql clear

4ba7a17

format code

307802a

cloud-fan approved these changes Oct 24, 2019

View reviewed changes

maropu approved these changes Oct 24, 2019

View reviewed changes

maropu closed this in 67cf043 Oct 24, 2019

gatorsmile reviewed Nov 5, 2019

View reviewed changes

This was referenced Nov 6, 2019

[SPARK-29145][SQL][FOLLOW-UP] Move tests from SubquerySuite to subquery/in-subquery/in-joins.sql #26406

Closed

[SPARK-29800][SQL] Rewrite non-correlated EXISTS subquery use ScalaSubquery to optimize perf #26437

Closed

cloud-fan reviewed May 10, 2021

View reviewed changes

AngersZhuuuu mentioned this pull request May 11, 2021

[SPARK-29145][SQL][FOLLOWUP] Clean up code about support sub-queries in join conditions #32499

Closed

[SPARK-29145][SQL] Support sub-queries in join conditions #25854

[SPARK-29145][SQL] Support sub-queries in join conditions #25854

Conversation

AngersZhuuuu commented Sep 19, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Sep 19, 2019

SparkQA commented Sep 19, 2019

maropu commented Sep 20, 2019

maropu commented Sep 20, 2019

AngersZhuuuu commented Sep 20, 2019 • edited Loading

SparkQA commented Sep 20, 2019

SparkQA commented Sep 20, 2019

SparkQA commented Oct 15, 2019

SparkQA commented Oct 15, 2019

AngersZhuuuu commented Oct 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 16, 2019

AngersZhuuuu commented Oct 23, 2019

AngersZhuuuu commented Oct 23, 2019

SparkQA commented Oct 23, 2019

SparkQA commented Oct 23, 2019

dilipbiswal commented Oct 23, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Oct 24, 2019

SparkQA commented Oct 24, 2019

maropu commented Oct 24, 2019

SparkQA commented Oct 24, 2019

maropu commented Oct 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AngersZhuuuu commented Sep 19, 2019 •

edited

Loading

AngersZhuuuu commented Sep 20, 2019 •

edited

Loading