-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344
Conversation
Test build #82167 has finished for PR 19344 at commit
|
@gatorsmile if you get time, please check this. thanks. |
ping |
case _ => | ||
} | ||
// logical plan and adding up the sizes of all tables that appear in the plan. | ||
val planToCheck = mutable.Stack[LogicalPlan](spark.sql(queryString).queryExecution.logical) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the plan that has been analyzed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analyzer rule CTESubstitution
will replace With
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yea. Since the original code does so, I just added the logic. But, the suggestion sounds good to me, so I'll update soon. Thanks.
0df2663
to
dd84919
Compare
val queryRelations = scala.collection.mutable.HashSet[String]() | ||
spark.sql(queryString).queryExecution.logical.map { | ||
spark.sql(queryString).queryExecution.analyzed.map { | ||
case UnresolvedRelation(t: TableIdentifier) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the plan is successfully analyzed, UnresolvedRelation
should not exist
Test build #82280 has finished for PR 19344 at commit
|
Test build #82281 has finished for PR 19344 at commit
|
3f31048
to
f7359af
Compare
@gatorsmile ok, fixed. Also, I checked this code could collect all the relations. |
f7359af
to
489f2a2
Compare
Test build #82301 has finished for PR 19344 at commit
|
Test build #82303 has finished for PR 19344 at commit
|
case _ => | ||
} | ||
} | ||
spark.sql(queryString).queryExecution.analyzed.map { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foreach
} | ||
spark.sql(queryString).queryExecution.analyzed.map { | ||
case SubqueryAlias(name, _: LogicalRelation) => | ||
queryRelations.add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another case for HiveTableRelation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC ditto; HiveTableRelation
never happens here.
} | ||
} | ||
spark.sql(queryString).queryExecution.analyzed.map { | ||
case SubqueryAlias(name, _: LogicalRelation) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using LogicalRelation
's catalogTable
? Just issue an exception if it is None
. I think this benchmark will not hit None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked again and I found we can't use catalogTable
here because these TPCDS tables are locally temporary ones (IIUC these tables are always transformed into ScalaAlias
(LocalRelation
)).
6dfb004
to
00cfb21
Compare
Test build #82309 has finished for PR 19344 at commit
|
Test build #82311 has finished for PR 19344 at commit
|
00cfb21
to
5691cf6
Compare
Test build #82314 has finished for PR 19344 at commit
|
retest this please |
} | ||
} | ||
spark.sql(queryString).queryExecution.analyzed.foreach { | ||
case SubqueryAlias(name, _: LogicalRelation) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add all the three scenarios, although the current codes only use temp views?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, will do
Test build #82329 has finished for PR 19344 at commit
|
Test build #82340 has finished for PR 19344 at commit
|
cbac959
to
8d8a9ff
Compare
LGTM pending Jenkins |
Test build #82341 has finished for PR 19344 at commit
|
btw, could we also add |
These modified test cases are not following the standards. Impala added extra (partition) predicates. The perf results are misleading. |
Thanks! Merged to master. |
ok, thanks! |
What changes were proposed in this pull request?
Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in
q2
, this fix could catch all the input relations:web_sales
,date_dim
, andcatalog_sales
(the current code catchesdate_dim
only). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this.How was this patch tested?
Manually checked.