[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

gatorsmile · 2016-10-15T15:22:44Z

What changes were proposed in this pull request?

This PR is to backport #15048 and #15459.

However, in 2.0, we do not have a unified logical node CreateTable and the analyzer rule PreWriteCheck is also different. To minimize the code changes, this PR adds a new rule AnalyzeCreateTableAsSelect. Please treat it as a new PR to review. Thanks!

As explained in #14797:

Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result.

We should not optimize the query in CTAS more than once. For example,

spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))

Before this PR, the results do not match

== Results ==
!== Correct Answer - 2 ==       == Spark Answer - 2 ==
![100,100.000000000000000000]   [100,null]
 [99,99.000000000000000000]     [99,99.000000000000000000]

After this PR, the results match.

+---+----------------------+
|id |num                   |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+

In this PR, we do not treat the query in CTAS as a child. Thus, the query will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule PreprocessDDL, because so far only this rule needs the analyzed plan of the query.

How was this patch tested?

SparkQA · 2016-10-15T17:27:47Z

Test build #67018 has finished for PR 15502 at commit 9cfebc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class AnalyzeCreateTableAsSelect(sparkSession: SparkSession) extends Rule[LogicalPlan]

gatorsmile · 2016-10-17T06:52:04Z

cc @yhuai @hvanhovell @cloud-fan I guess this needs to be merged to 2.0.2 ASAP?

cloud-fan · 2016-10-17T07:08:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

@@ -510,7 +510,7 @@ private[hive] case class InsertIntoHiveTable(
    child: LogicalPlan,
    overwrite: Boolean,
    ifNotExists: Boolean)
-  extends LogicalPlan with Command {


why it's not a command anymore?

In the Command, this PR requires the child must be empty . Should we convert InsertIntoHiveTable to a non-child Command?

Just FYI, in Spark 2.1, InsertIntoTable is still a LogicalPlan instead of a Command.

ah i see, this command is gone in 2.1

…15048 ### What changes were proposed in this pull request? This PR is to backport #15048 and #15459. However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks! As explained in #14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #15502 from gatorsmile/ctasOptimize2.0.

cloud-fan · 2016-10-17T07:30:27Z

LGTM, merging to 2.0!

gatorsmile · 2016-10-17T07:35:54Z

Thanks! Close it now.

gatorsmile added 4 commits October 13, 2016 22:16

the first set of changes

a9931a5

2nd change set

d5f9187

more comment

a658da4

rename

9cfebc5

cloud-fan reviewed Oct 17, 2016

View reviewed changes

gatorsmile closed this Oct 17, 2016

gatorsmile mentioned this pull request Oct 24, 2016

[SPARK-17409] [SQL] [FOLLOW-UP] Do Not Optimize Query in CTAS More Than Once #15459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

gatorsmile commented Oct 15, 2016

SparkQA commented Oct 15, 2016

gatorsmile commented Oct 17, 2016

cloud-fan Oct 17, 2016

gatorsmile Oct 17, 2016

cloud-fan Oct 17, 2016

cloud-fan commented Oct 17, 2016

gatorsmile commented Oct 17, 2016

[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502

Conversation

gatorsmile commented Oct 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Oct 15, 2016

gatorsmile commented Oct 17, 2016

cloud-fan Oct 17, 2016

Choose a reason for hiding this comment

gatorsmile Oct 17, 2016

Choose a reason for hiding this comment

cloud-fan Oct 17, 2016

Choose a reason for hiding this comment

cloud-fan commented Oct 17, 2016

gatorsmile commented Oct 17, 2016