-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23087][SQL] CheckCartesianProduct too restrictive when condition is false/null #20333
Conversation
spark.sessionState.executePlan(planNull).optimizedPlan | ||
|
||
val dfOne = df.select(lit(1).as("a")) | ||
val dfTwo = spark.range(10).select(lit(2).as("a")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a
-> b
Test build #86401 has finished for PR 20333 at commit
|
} | ||
|
||
def apply(plan: LogicalPlan): LogicalPlan = | ||
if (SQLConf.get.crossJoinEnabled) { | ||
plan | ||
} else plan transform { | ||
case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, condition) | ||
case j @ Join(left, right, Inner | LeftOuter | RightOuter | FullOuter, _) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For inner joins, we will not hit this, because it is already optimized to an empty relation. For the other outer join types, we face the exactly same issue as the condition is true. That is, the size of the join result sets is still the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you saying that the size of the result set is the same?
If you have a relation A (of size n, let's say 1M rows) in outer join with a relation B (of size m, let's say 1M rows). If the condition is true, the output relation is 1M * 1M (ie. (n * m)); if the condition is false, the result is 1M (n) for a left join, 1M (m) for a right join, 1M + 1M (m +n) for a full outer join. Therefore the size is not the same at all. But maybe you meant something different, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. For outer join, it makes sense to remove this check
@@ -274,4 +274,18 @@ class DataFrameJoinSuite extends QueryTest with SharedSQLContext { | |||
checkAnswer(innerJoin, Row(1) :: Nil) | |||
} | |||
|
|||
test("SPARK-23087: don't throw Analysis Exception in CheckCartesianProduct when join condition " + | |||
"is false or null") { | |||
val df = spark.range(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
withSQLConf(CROSS_JOINS_ENABLED.key -> "true") {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be false?
LGTM except one minor comment. |
Test build #86418 has finished for PR 20333 at commit
|
Thanks! Merged to master/2.3 |
Will address my comment in my PR. |
…on is false/null ## What changes were proposed in this pull request? CheckCartesianProduct raises an AnalysisException also when the join condition is always false/null. In this case, we shouldn't raise it, since the result will not be a cartesian product. ## How was this patch tested? added UT Author: Marco Gaido <marcogaido91@gmail.com> Closes #20333 from mgaido91/SPARK-23087. (cherry picked from commit 121dc96) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
CheckCartesianProduct raises an AnalysisException also when the join condition is always false/null. In this case, we shouldn't raise it, since the result will not be a cartesian product.
How was this patch tested?
added UT