-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12660] [SPARK-14967] [SQL] Implement Except Distinct by Left Anti Join #12736
Conversation
cc @hvanhovell Thank you for your quick fix! This PR contains the fix of anti-join: #12730 for avoiding test case failure. When #12730 is merged, I will remove the related code. |
Test build #57134 has finished for PR 12736 at commit
|
Merged fix. |
Thank you very much! Will clean the code soon |
@@ -291,7 +291,7 @@ public void testSetOperation() { | |||
unioned.collectAsList()); | |||
|
|||
Dataset<String> subtracted = ds.except(ds2); | |||
Assert.assertEquals(Arrays.asList("abc", "abc"), subtracted.collectAsList()); | |||
Assert.assertEquals(Arrays.asList("abc"), subtracted.collectAsList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty strange. I will check the exact behavior of our current Except
operator. It sounds like it does not remove the duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except apparently does not remove duplicates:
range(0, 10).registerTempTable("a")
range(5, 15).registerTempTable("b")
sql("(select * from a union all select * from a) except select * from b")
results in:
+---+
| id|
+---+
| 2|
| 2|
| 0|
| 0|
| 3|
| 3|
| 4|
| 4|
| 1|
| 1|
+---+
So that is weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to decide what we should do next? This PR is doing Except Distinct
.
If we want to keep the existing behavior, which is Except All
, we need to change the external API in Dataset.
However, the current SQL interface for Except
is wrong. We need to correct it at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation (before this PR) is somewhere between EXCEPT and EXCEPT ALL it will will remove all rows if it finds a match (essentially eliminating duplicates), but it does not remove duplicates where there is no match. Lets follow the principle of least surprise and create a correct EXCEPT (one that removes duplicates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, I see... Let me add a test case to cover it for ensuring it will not be broken again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the example to show why the current master is wrong.
test("except") {
val df_left = Seq(1, 2, 2, 3, 3, 4).toDF("id")
val df_right = Seq(1, 3).toDF("id")
checkAnswer(
df_left.except(df_right),
Row(2) :: Row(2) :: Row(4) :: Nil
)
}
For EXCEPT ALL
, we should output
Row(2) :: Row(2) :: Row(3) :: Row(4) :: Nil
For EXCEPT DISTINCT
, we should output
Row(2) :: Row(4) :: Nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did this PR also fix the semantic of Except
, or it only added the optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. After this PR, the behavior of EXCEPT
is changed to the standard behavior of EXCEPT DISTINCT
.
Test build #57143 has finished for PR 12736 at commit
|
Test build #57164 has finished for PR 12736 at commit
|
Test build #57183 has finished for PR 12736 at commit
|
|
||
// Check if no Project is added | ||
assert(r3.left.isInstanceOf[LocalRelation]) | ||
assert(r3.right.isInstanceOf[LocalRelation]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove these? We didn't change the analysis of Except
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added a new condition in resolved
: https://github.com/apache/spark/pull/12736/files#diff-72917e7b68f0311b2fb42990e0dc616dR180 In this test case, we are trying to Except
the same table firstTable
. Thus, the value of resolved
becomes false
now.
When we resolving WidenSetOperationTypes
in
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
Line 230 in c417cec
left.output.length == right.output.length && !s.resolved => |
if
condition becomes true. (Before this PR, this condition is always false if you try to except the same table. Thus, we did nothing in this case. Thus, no Project
is added)
However, when if
condition becomes true
, we execute the corresponding logics. We are able to find the common type. Then, we will add Project
. This is doing the duplicate checking like what we did in
spark/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercionSuite.scala
Lines 479 to 490 in d280d1d
val r1 = wt(Except(firstTable, secondTable)).asInstanceOf[Except] | |
val r2 = wt(Intersect(firstTable, secondTable)).asInstanceOf[Intersect] | |
checkOutput(r1.left, expectedTypes) | |
checkOutput(r1.right, expectedTypes) | |
checkOutput(r2.left, expectedTypes) | |
checkOutput(r2.right, expectedTypes) | |
// Check if a Project is added | |
assert(r1.left.isInstanceOf[Project]) | |
assert(r1.right.isInstanceOf[Project]) | |
assert(r2.left.isInstanceOf[Project]) | |
assert(r2.right.isInstanceOf[Project]) |
Sorry, it is a little bit complicated to explain the whole logics. Let me know if anything is not clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks for the explanation!
LGTM. A unrelated question, how do we express the EXCEPT ALL semantic? |
@cloud-fan Thank you for your review! Like Originally, I plan to do it after this release. All of you have more time to review them. If we need to implement in this release, I can start writing both |
Test build #57227 has finished for PR 12736 at commit
|
thanks! merging to master! |
What changes were proposed in this pull request?
Replaces a logical
Except
operator with aLeft-anti Join
operator. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).Note:
join conditions will be incorrect.
This PR also corrects the existing behavior in Spark. Before this PR, the behavior is like
After this PR, the result is corrected. We strictly follow the SQL compliance of
Except Distinct
.How was this patch tested?
Modified and added a few test cases to verify the optimization rule and the results of operators.