Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31958][SQL] normalize special floating numbers in subquery #28785

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip Subquery, and subquery expressions will be handled by OptimizeSubqueries, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in OptimizeSubqueries, once later when it becomes join. This is OK as NormalizeFloatingNumbers is idempotent now.

Why are the changes needed?

fix a bug

Does this PR introduce any user-facing change?

yes, see the newly added test.

How was this patch tested?

new test

@cloud-fan
Copy link
Contributor Author

cc @hvanhovell @maropu @viirya

@@ -56,10 +56,6 @@ import org.apache.spark.sql.types._
object NormalizeFloatingNumbers extends Rule[LogicalPlan] {

def apply(plan: LogicalPlan): LogicalPlan = plan match {
// A subquery will be rewritten into join later, and will go through this rule
// eventually. Here we skip subquery, as we only need to run this rule once.
case _: Subquery => plan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding tests for the subquery case in NormalizeFloatingPointNumbersSuite, too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we can't.

This fix relies on the rule OptimizeSubqueries, which is an inner object of the class Optimizer as it needs to rerun the entire optimizer for subquery. So we can't use OptimizeSubqueries in NormalizeFloatingPointNumbersSuite.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123771 has finished for PR 28785 at commit 4dc6413.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -56,10 +56,6 @@ import org.apache.spark.sql.types._
object NormalizeFloatingNumbers extends Rule[LogicalPlan] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also mean This batch must be executed after the RewriteSubquery batch, which creates joins. is not definitely true now?

Copy link
Contributor Author

@cloud-fan cloud-fan Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's still true, the correlated subquery becomes join, and may have new join keys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master/3.0!

@cloud-fan cloud-fan closed this in 6fb9c80 Jun 11, 2020
cloud-fan added a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This is a followup of #23388 .

#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes #28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
holdenk pushed a commit to holdenk/spark that referenced this pull request Jun 25, 2020
### What changes were proposed in this pull request?

This is a followup of apache#23388 .

apache#23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes apache#28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6fb9c80)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants