[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

viirya · 2017-08-16T05:02:08Z

What changes were proposed in this pull request?

We have many optimization rules now in Optimzer. Right now we don't have any checks in the optimizer to check for the structural integrity of the plan (e.g. resolved). When debugging, it is difficult to identify which rules return invalid plans.

It would be great if in test mode, we can check whether a plan is still resolved after the execution of each rule, so we can catch rules that return invalid plans.

How was this patch tested?

Added tests.

rxin · 2017-08-16T05:11:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -37,6 +37,12 @@ import org.apache.spark.sql.types._
 abstract class Optimizer(sessionCatalog: SessionCatalog)
  extends RuleExecutor[LogicalPlan] {

+  // Check for structural integrity of the plan in test mode. Currently we only check if a plan is
+  // still resolved after the execution of each rule.
+  override protected def planChecker: Option[LogicalPlan => Boolean] = Some(


can we move the checking of whether this is a test in here, then this method simply returns boolean, and by default it returns true.

Thanks. I will update it.

SparkQA · 2017-08-16T06:35:03Z

Test build #80715 has finished for PR 18956 at commit 21d86ba.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-16T06:38:38Z

Test build #80717 has finished for PR 18956 at commit 9170ceb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-16T06:49:26Z

Test build #80718 has finished for PR 18956 at commit c99011d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-16T06:52:25Z

Interesting, existing PullupCorrelatedPredicates produces unresolved plan. I'll figure out the reason.

viirya · 2017-08-16T10:36:54Z

The reason PullupCorrelatedPredicates produces unresolved plan:

The query causing the problem in subquery_in_having.q looks like:

select b.key, min(b.value)
from src b
group by b.key
having b.key in ( select a.key
                from src a
                where a.value > 'val_9' and a.value = min(b.value)
                )
order by b.key
;

The optimized plan looks like:

'Sort [key#201 ASC NULLS FIRST], true
+- 'Project [key#201, min(value)#204]
   +- 'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])
      :  +- Project [key#206, value#207]
      :     +- Filter (value#207 > val_9)
      :        +- InMemoryRelation [key#206, value#207], true, 5, StorageLevel(disk, memory, deserialized, 1 replicas), src
      :              +- HiveTableScan [key#0, value#1], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]
   +- Aggregate [key#201], [key#201, min(value#202) AS min(value)#204, min(value#202) AS min(value#202)#209]
   +- InMemoryRelation [key#201, value#202], true, 5, StorageLevel(disk, memory, deserialized, 1 replicas), src
   +- HiveTableScan [key#0, value#1], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]

Before PullupCorrelatedPredicates rule, the subquery in ListQuery looks like:

Project [key#206]
+- Filter ((value#207 > val_9) && (value#207 = outer(min(value)#204)))
   +- InMemoryRelation [key#206, value#207], true, 5, StorageLevel(disk, memory, deserialized, 1 replicas), src
         +- HiveTableScan [key#0, value#1], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]

Currently the In predicate in top Filter is resolved.

After the rule, the subquery looks like:

Project [key#206, value#207]
+- Filter (value#207 > val_9)
   +- InMemoryRelation [key#206, value#207], true, 5, StorageLevel(disk, memory, deserialized, 1 replicas), src
         +- HiveTableScan [key#0, value#1], HiveTableRelation `default`.`src`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [key#0, value#1]

Notice key#206 has been added into Project and the condition value#207 = outer(min(value)#204) has been pulled out to top Filter:

 'Filter key#201 IN (list#200 [(value#207 = min(value)#204)])

Because In.checkInputDataTypes checks if the size of left columns (key#201) matches the size of subquery output (key#206, value#207), so it fails and returns false for In.resolved.

The unresolved Filter doesn't cause problem before because it will be converted to Join by RewritePredicateSubquery rule later. But it has been detected by this structural integrity check.

By modifying In.checkInputDataTypes, we can solve this issue. I'd submit another PR for it.

viirya · 2017-08-17T05:34:37Z

The PR going to fix the issue described in #18956 (comment) is submitted at #18968.

viirya · 2017-08-24T13:53:19Z

retest this please.

viirya · 2017-08-24T13:53:38Z

#18968 is merged. This should pass the tests now.

SparkQA · 2017-08-24T15:40:37Z

Test build #81079 has finished for PR 18956 at commit c99011d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-08-25T00:02:41Z

Seems there are other issues caused by RewritePredicateSubquery rule. I'll investigate and fix it.

viirya · 2017-08-25T05:47:56Z

RewritePredicateSubquery fails structural integrity check because it can produce Join with conflicting attributes in its left and right plans.

I submitted #19050 to fix it.

viirya · 2017-09-06T14:58:05Z

#19050 is merged now. Let's see if there still is any rule can fail this structural integrity check.

viirya · 2017-09-06T14:58:15Z

retest this please.

SparkQA · 2017-09-06T16:47:54Z

Test build #81460 has finished for PR 18956 at commit c99011d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-07T07:04:46Z

Test build #81496 has finished for PR 18956 at commit e1e4aa1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class NettyMemoryMetrics implements MetricSet
class ByteArrayConstructor extends net.razorvine.pickle.objects.ByteArrayConstructor
* (4) the main class for the child
public class JavaFeatureHasherExample
sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary

viirya · 2017-09-07T07:14:37Z

retest this please.

SparkQA · 2017-09-07T09:11:31Z

Test build #81504 has finished for PR 18956 at commit e1e4aa1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class NettyMemoryMetrics implements MetricSet
class ByteArrayConstructor extends net.razorvine.pickle.objects.ByteArrayConstructor
* (4) the main class for the child
public class JavaFeatureHasherExample
sealed trait LogisticRegressionTrainingSummary extends LogisticRegressionSummary
sealed trait BinaryLogisticRegressionSummary extends LogisticRegressionSummary

…ion.

SparkQA · 2017-09-07T17:31:10Z

Test build #81518 has finished for PR 18956 at commit 959e315.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-07T17:48:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

+            if (!planChecker(result)) {
+              val message = s"After applying rule ${rule.ruleName} in batch ${batch.name}, " +
+                "the structural integrity of the plan is broken."
+              throw new TreeNodeException(result, message, null)


move the exception throwing logics into the planChecker ?

nvm. The message also has rule and batch names.

gatorsmile · 2017-09-07T17:51:22Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerSICheckerSuite.scala

+import org.apache.spark.sql.internal.SQLConf
+
+
+class OptimizerSICheckerkSuite extends PlanTest {


-> OptimizerStructuralIntegrityCheckerkSuite

gatorsmile · 2017-09-07T18:01:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

+   * `Optimizer`, so we can catch rules that return invalid plans. The check function will returns
+   * `false` if the given plan doesn't pass the structural integrity check.
+   */
+  protected def planChecker(plan: TreeType): Boolean = true


planChecker -> isPlanIntegral?

Looks good.

gatorsmile · 2017-09-07T23:38:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

@@ -64,6 +64,14 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging {
  protected def batches: Seq[Batch]

  /**
+   * Defines a check function which checks for structural integrity of the plan after the execution
+   * of each rule. For example, we can check whether a plan is still resolved after each rule in
+   * `Optimizer`, so we can catch rules that return invalid plans. The check function will returns


will returns -> returns

gatorsmile · 2017-09-07T23:38:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/rules/RuleExecutor.scala

@@ -64,6 +64,14 @@ abstract class RuleExecutor[TreeType <: TreeNode[_]] extends Logging {
  protected def batches: Seq[Batch]

  /**
+   * Defines a check function which checks for structural integrity of the plan after the execution


which -> that

gatorsmile · 2017-09-07T23:39:47Z

LGTM except two minor comments

SparkQA · 2017-09-08T02:01:32Z

Test build #81529 has finished for PR 18956 at commit d1db7cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-08T03:03:44Z

Test build #81531 has finished for PR 18956 at commit ecdfb7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-08T06:12:01Z

Thanks! Merged to master.

viirya · 2017-09-08T06:13:52Z

Thanks @rxin @gatorsmile

Check for structural integrity of the plan in Optimzer in test mode.

9170ceb

viirya force-pushed the SPARK-21726 branch from 21d86ba to 9170ceb Compare August 16, 2017 05:07

rxin reviewed Aug 16, 2017

View reviewed changes

Address comment.

c99011d

Merge remote-tracking branch 'upstream/master' into SPARK-21726

0af1efb

viirya mentioned this pull request Sep 4, 2017

[SPARK-21835][SQL] RewritePredicateSubquery should not produce unresolved query plans #19050

Closed

Merge remote-tracking branch 'upstream/master' into SPARK-21726

e1e4aa1

We should analyze the plan for evaluating expression before optimizat…

959e315

…ion.

gatorsmile reviewed Sep 7, 2017

View reviewed changes

Address comments.

ecdfb7d

viirya force-pushed the SPARK-21726 branch from d1db7cf to ecdfb7d Compare September 8, 2017 00:19

asfgit closed this in 6e37524 Sep 8, 2017

viirya deleted the SPARK-21726 branch December 27, 2023 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

viirya commented Aug 16, 2017

rxin Aug 16, 2017 •

edited

Loading

viirya Aug 16, 2017

SparkQA commented Aug 16, 2017

SparkQA commented Aug 16, 2017

SparkQA commented Aug 16, 2017

viirya commented Aug 16, 2017

viirya commented Aug 16, 2017 •

edited

Loading

viirya commented Aug 17, 2017 •

edited

Loading

viirya commented Aug 24, 2017

viirya commented Aug 24, 2017

SparkQA commented Aug 24, 2017

viirya commented Aug 25, 2017

viirya commented Aug 25, 2017

viirya commented Sep 6, 2017 •

edited

Loading

viirya commented Sep 6, 2017

SparkQA commented Sep 6, 2017

SparkQA commented Sep 7, 2017

viirya commented Sep 7, 2017

SparkQA commented Sep 7, 2017

SparkQA commented Sep 7, 2017

gatorsmile Sep 7, 2017

gatorsmile Sep 7, 2017

gatorsmile Sep 7, 2017

viirya Sep 7, 2017

gatorsmile Sep 7, 2017

viirya Sep 7, 2017

gatorsmile Sep 7, 2017

gatorsmile Sep 7, 2017

gatorsmile commented Sep 7, 2017

SparkQA commented Sep 8, 2017

SparkQA commented Sep 8, 2017

gatorsmile commented Sep 8, 2017

viirya commented Sep 8, 2017

		import org.apache.spark.sql.internal.SQLConf


		class OptimizerSICheckerkSuite extends PlanTest {

[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

[SPARK-21726][SQL] Check for structural integrity of the plan in Optimzer in test mode. #18956

Conversation

viirya commented Aug 16, 2017

What changes were proposed in this pull request?

How was this patch tested?

rxin Aug 16, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 16, 2017

SparkQA commented Aug 16, 2017

SparkQA commented Aug 16, 2017

viirya commented Aug 16, 2017

viirya commented Aug 16, 2017 • edited Loading

viirya commented Aug 17, 2017 • edited Loading

viirya commented Aug 24, 2017

viirya commented Aug 24, 2017

SparkQA commented Aug 24, 2017

viirya commented Aug 25, 2017

viirya commented Aug 25, 2017

viirya commented Sep 6, 2017 • edited Loading

viirya commented Sep 6, 2017

SparkQA commented Sep 6, 2017

SparkQA commented Sep 7, 2017

viirya commented Sep 7, 2017

SparkQA commented Sep 7, 2017

SparkQA commented Sep 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Sep 7, 2017

SparkQA commented Sep 8, 2017

SparkQA commented Sep 8, 2017

gatorsmile commented Sep 8, 2017

viirya commented Sep 8, 2017

rxin Aug 16, 2017 •

edited

Loading

viirya commented Aug 16, 2017 •

edited

Loading

viirya commented Aug 17, 2017 •

edited

Loading

viirya commented Sep 6, 2017 •

edited

Loading