Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31334][SQL] Don't ResolveReference/ResolveMissingReference when Filter condition with aggregate expression #28107

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Apr 3, 2020

What changes were proposed in this pull request?

As I have show in https://issues.apache.org/jira/browse/SPARK-31334 's description, same type of sql, when one column type is different as string, catalyst can't analyze it right.

For test sql

test("xxxxxxxx") {
    Seq(
      ("1", "3"),
      ("2", "3"),
      ("3", "6"),
      ("4", "7"),
      ("5", "9"),
      ("6", "9")
    ).toDF("a", "b").createOrReplaceTempView("testData")

    val x = sql(
      """
        | SELECT b, sum(a) as a
        | FROM testData
        | GROUP BY b
        | HAVING sum(a) > 3
      """.stripMargin)

    x.explain()
    x.show()
  }

When analyze having clause's condition by ResolveAggregateFunctions

'Filter ('sum('a) > 3)
+- Aggregate [b#181], [b#181, sum(a#180) AS a#184L]
   +- SubqueryAlias `testdata`
      +- Project [_1#177 AS a#180, _2#178 AS b#181]
         +- LocalRelation [_1#177, _2#178]

Since a is StringType type and then aggregation's agg expression is unresolved (Because Sum. checkInputDataTypes() need NumericType, but a is StringType), so  ResolveAggregateFunctions won't make a change on above LogicalPlan, then sum(a) in Filter condition will be resolved in ResolveReference and this a will be resolved as aggregation's output column a#184 , then error happened .

Why are the changes needed?

Fix bug in analyzer

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Added UT

@AngersZhuuuu
Copy link
Contributor Author

cc @dongjoon-hyun @wangyum

@cloud-fan
Copy link
Contributor

thanks for reporting this bug! It's fragile to rely on rule order, can we fix it more completely? My proposal: Do not resolve Filter if it contains agg functions. We can implement it in ResolveReferences

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Apr 3, 2020

thanks for reporting this bug! It's fragile to rely on rule order, can we fix it more completely? My proposal: Do not resolve Filter if it contains agg functions. We can implement it in ResolveReferences

Yea, I know it's fragile to change order, but don't have other good ideal, Thanks for you suggestions, I will test follow your suggestion.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-31334][SQL] TypeCoercion should before then ResolveAggregateFunctions [SPARK-31334][SQL] Don't ResolveReference/ResolveMissingReference when Filter condition with aggregate expression Apr 3, 2020
@AngersZhuuuu
Copy link
Contributor Author

Do not resolve Filter if it contains agg functions. We can implement it in ResolveReferences

Changed

@cloud-fan
Copy link
Contributor

ok to test

@@ -1391,11 +1391,30 @@ class Analyzer(
notMatchedActions = newNotMatchedActions)
}

case f @ Filter(cond, _) if containsAggregate(cond) => f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line between cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a blank line between cases.

Done

q.mapExpressions(resolveExpressionTopDown(_, q))
}

def containsAggregate(e: Expression): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we reuse ResolveAggregateFunctions.containsAggregate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we reuse ResolveAggregateFunctions.containsAggregate?

Since here function is still UnresolvedFunction, we can't just reuse this.

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120779 has finished for PR 28107 at commit 0d81b4d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120778 has finished for PR 28107 at commit 24e9d4d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120781 has finished for PR 28107 at commit a5cf877.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2020

Test build #120776 has finished for PR 28107 at commit 54df9d3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2020

Test build #120792 has finished for PR 28107 at commit 4a799fe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 4, 2020

Test build #120798 has finished for PR 28107 at commit 5f6eef1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

Test build #120798 has finished for PR 28107 at commit 5f6eef1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

For this test, failed because filter's cond 's expr is unresolved so lookup registered udf failed

@SparkQA
Copy link

SparkQA commented Apr 4, 2020

Test build #120799 has finished for PR 28107 at commit cc0b018.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

@cloud-fan
pass other test now. Minimize the scope of this pr's influence.
But I am worried that the processing of the Filter is mixed now between havingClause and where condition.

These makes future job more complex, maybe when beginning, it's better to use a new Class present havingClause and convert it to Filter in Analyzer after handing it.

@SparkQA
Copy link

SparkQA commented Apr 4, 2020

Test build #120803 has finished for PR 28107 at commit b2855bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

(6, 9)
).toDF("a", "b").createOrReplaceTempView("testData1")

checkAnswer(sql(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test fail before your patch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test fail before your patch?

No, it's won't failed, here is for contrast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test is not qualified to reproduce the bug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this test is not qualified to reproduce the bug?

The first SQL is used for comparison, and the second can reproduce bugs.
If don't need, we can just delete first one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's delete

Done

@SparkQA
Copy link

SparkQA commented Apr 6, 2020

Test build #120868 has finished for PR 28107 at commit 20febce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Apr 20, 2020

ping @cloud-fan any more need to update?

@cloud-fan
Copy link
Contributor

This PR exposes a more serious problem, that we rely on rules order to resolve HAVING. We can easily get correctness issues because of it.

#28294 proposes a new solution to fix this problem completely, can you take a look?

@AngersZhuuuu
Copy link
Contributor Author

This PR exposes a more serious problem, that we rely on rules order to resolve HAVING. We can easily get correctness issues because of it.

#28294 proposes a new solution to fix this problem completely, can you take a look?

The pr you mentioned is similar like I have said in #28107 (comment)

It's better to to like this, I will notice that pr and check if that pr will solve this pr's problem, if not, I will change based on his pr.

@AngersZhuuuu
Copy link
Contributor Author

@cloud-fan
#28294 can fix this issue. Close this pr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants