-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17892] [SQL] [2.0] Do Not Optimize Query in CTAS More Than Once #15048 #15502
Conversation
Test build #67018 has finished for PR 15502 at commit
|
cc @yhuai @hvanhovell @cloud-fan I guess this needs to be merged to 2.0.2 ASAP? |
@@ -510,7 +510,7 @@ private[hive] case class InsertIntoHiveTable( | |||
child: LogicalPlan, | |||
overwrite: Boolean, | |||
ifNotExists: Boolean) | |||
extends LogicalPlan with Command { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why it's not a command anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Command, this PR requires the child must be empty . Should we convert InsertIntoHiveTable
to a non-child Command
?
Just FYI, in Spark 2.1, InsertIntoTable
is still a LogicalPlan
instead of a Command
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah i see, this command is gone in 2.1
…15048 ### What changes were proposed in this pull request? This PR is to backport #15048 and #15459. However, in 2.0, we do not have a unified logical node `CreateTable` and the analyzer rule `PreWriteCheck` is also different. To minimize the code changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it as a new PR to review. Thanks! As explained in #14797: >Some analyzer rules have assumptions on logical plans, optimizer may break these assumption, we should not pass an optimized query plan into QueryExecution (will be analyzed again), otherwise we may some weird bugs. For example, we have a rule for decimal calculation to promote the precision before binary operations, use PromotePrecision as placeholder to indicate that this rule should not apply twice. But a Optimizer rule will remove this placeholder, that break the assumption, then the rule applied twice, cause wrong result. We should not optimize the query in CTAS more than once. For example, ```Scala spark.range(99, 101).createOrReplaceTempView("tab1") val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18)) as num FROM tab1" sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt") checkAnswer(spark.table("tab2"), sql(sqlStmt)) ``` Before this PR, the results do not match ``` == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![100,100.000000000000000000] [100,null] [99,99.000000000000000000] [99,99.000000000000000000] ``` After this PR, the results match. ``` +---+----------------------+ |id |num | +---+----------------------+ |99 |99.000000000000000000 | |100|100.000000000000000000| +---+----------------------+ ``` In this PR, we do not treat the `query` in CTAS as a child. Thus, the `query` will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this rule needs the analyzed plan of the `query`. ### How was this patch tested? Author: gatorsmile <gatorsmile@gmail.com> Closes #15502 from gatorsmile/ctasOptimize2.0.
LGTM, merging to 2.0! |
Thanks! Close it now. |
What changes were proposed in this pull request?
This PR is to backport #15048 and #15459.
However, in 2.0, we do not have a unified logical node
CreateTable
and the analyzer rulePreWriteCheck
is also different. To minimize the code changes, this PR adds a new ruleAnalyzeCreateTableAsSelect
. Please treat it as a new PR to review. Thanks!As explained in #14797:
We should not optimize the query in CTAS more than once. For example,
Before this PR, the results do not match
After this PR, the results match.
In this PR, we do not treat the
query
in CTAS as a child. Thus, thequery
will not be optimized when optimizing CTAS statement. However, we still need to analyze it for normalizing and verifying the CTAS in the Analyzer. Thus, we do it in the analyzer rulePreprocessDDL
, because so far only this rule needs the analyzed plan of thequery
.How was this patch tested?