[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

maropu · 2017-09-26T01:21:23Z

What changes were proposed in this pull request?

Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in q2, this fix could catch all the input relations: web_sales, date_dim, and catalog_sales (the current code catches date_dim only). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this.

How was this patch tested?

Manually checked.

SparkQA · 2017-09-26T03:59:54Z

Test build #82167 has finished for PR 19344 at commit d9be37e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-09-26T04:29:30Z

@gatorsmile if you get time, please check this. thanks.

maropu · 2017-09-28T02:22:35Z

ping

gatorsmile · 2017-09-28T03:36:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-            case _ =>
-          }
+      // logical plan and adding up the sizes of all tables that appear in the plan.
+      val planToCheck = mutable.Stack[LogicalPlan](spark.sql(queryString).queryExecution.logical)


Why not using the plan that has been analyzed?

The analyzer rule CTESubstitution will replace With

oh, yea. Since the original code does so, I just added the logic. But, the suggestion sounds good to me, so I'll update soon. Thanks.

gatorsmile · 2017-09-28T16:21:48Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

      val queryRelations = scala.collection.mutable.HashSet[String]()
-      spark.sql(queryString).queryExecution.logical.map {
+      spark.sql(queryString).queryExecution.analyzed.map {
        case UnresolvedRelation(t: TableIdentifier) =>


If the plan is successfully analyzed, UnresolvedRelation should not exist

SparkQA · 2017-09-28T18:21:17Z

Test build #82280 has finished for PR 19344 at commit 0df2663.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-28T18:25:14Z

Test build #82281 has finished for PR 19344 at commit dd84919.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-09-29T00:11:12Z

@gatorsmile ok, fixed. Also, I checked this code could collect all the relations.

SparkQA · 2017-09-29T02:44:07Z

Test build #82301 has finished for PR 19344 at commit f7359af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-29T03:07:50Z

Test build #82303 has finished for PR 19344 at commit 489f2a2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-29T06:27:39Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-            case _ =>
-          }
-        }
+      spark.sql(queryString).queryExecution.analyzed.map {


gatorsmile · 2017-09-29T06:33:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-        }
+      spark.sql(queryString).queryExecution.analyzed.map {
+        case SubqueryAlias(name, _: LogicalRelation) =>
+          queryRelations.add(name)


Add another case for HiveTableRelation

IIUC ditto; HiveTableRelation never happens here.

gatorsmile · 2017-09-29T06:39:57Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-          }
-        }
+      spark.sql(queryString).queryExecution.analyzed.map {
+        case SubqueryAlias(name, _: LogicalRelation) =>


Why not using LogicalRelation 's catalogTable? Just issue an exception if it is None. I think this benchmark will not hit None

I checked again and I found we can't use catalogTable here because these TPCDS tables are locally temporary ones (IIUC these tables are always transformed into ScalaAlias(LocalRelation)).

SparkQA · 2017-09-29T07:13:44Z

Test build #82309 has finished for PR 19344 at commit 6dfb004.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-29T09:49:25Z

Test build #82311 has finished for PR 19344 at commit 00cfb21.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-29T11:47:37Z

Test build #82314 has finished for PR 19344 at commit 5691cf6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-29T18:48:37Z

retest this please

gatorsmile · 2017-09-29T18:49:45Z

sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/TPCDSQueryBenchmark.scala

-          }
-        }
+      spark.sql(queryString).queryExecution.analyzed.foreach {
+        case SubqueryAlias(name, _: LogicalRelation) =>


Could we add all the three scenarios, although the current codes only use temp views?

sure, will do

SparkQA · 2017-09-29T20:11:47Z

Test build #82329 has finished for PR 19344 at commit 5691cf6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-30T00:28:59Z

Test build #82340 has finished for PR 19344 at commit cbac959.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-30T02:39:36Z

LGTM pending Jenkins

SparkQA · 2017-09-30T04:06:14Z

Test build #82341 has finished for PR 19344 at commit 8d8a9ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2017-09-30T04:10:28Z

btw, could we also add tpcds-modifiedQueries here?

gatorsmile · 2017-09-30T04:36:15Z

These modified test cases are not following the standards. Impala added extra (partition) predicates. The perf results are misleading.

gatorsmile · 2017-09-30T04:37:13Z

Thanks! Merged to master.

maropu · 2017-09-30T04:48:25Z

ok, thanks!

Respect With in TPCDSQueryBenchmark

f147497

gatorsmile reviewed Sep 28, 2017

View reviewed changes

maropu force-pushed the RespectWithInTPCDSBench branch 3 times, most recently from 0df2663 to dd84919 Compare September 28, 2017 15:42

gatorsmile reviewed Sep 28, 2017

View reviewed changes

maropu force-pushed the RespectWithInTPCDSBench branch 2 times, most recently from 3f31048 to f7359af Compare September 29, 2017 00:03

Fix

489f2a2

maropu force-pushed the RespectWithInTPCDSBench branch from f7359af to 489f2a2 Compare September 29, 2017 00:27

maropu changed the title ~~[SPARK-22122][SQL] Respect WITH clauses to count input rows in TPCDSQueryBenchmark~~ [SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark Sep 29, 2017

gatorsmile reviewed Sep 29, 2017

View reviewed changes

maropu force-pushed the RespectWithInTPCDSBench branch from 6dfb004 to 00cfb21 Compare September 29, 2017 07:11

Fix

5691cf6

maropu force-pushed the RespectWithInTPCDSBench branch from 00cfb21 to 5691cf6 Compare September 29, 2017 10:18

gatorsmile reviewed Sep 29, 2017

View reviewed changes

Fix more

8d8a9ff

maropu force-pushed the RespectWithInTPCDSBench branch from cbac959 to 8d8a9ff Compare September 30, 2017 01:23

asfgit closed this in c6610a9 Sep 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

maropu commented Sep 26, 2017

SparkQA commented Sep 26, 2017

maropu commented Sep 26, 2017

maropu commented Sep 28, 2017

gatorsmile Sep 28, 2017

gatorsmile Sep 28, 2017

maropu Sep 28, 2017

gatorsmile Sep 28, 2017

SparkQA commented Sep 28, 2017

SparkQA commented Sep 28, 2017

maropu commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

gatorsmile Sep 29, 2017

gatorsmile Sep 29, 2017

maropu Sep 29, 2017

gatorsmile Sep 29, 2017

maropu Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

gatorsmile commented Sep 29, 2017

gatorsmile Sep 29, 2017

maropu Sep 30, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 30, 2017

gatorsmile commented Sep 30, 2017

SparkQA commented Sep 30, 2017

maropu commented Sep 30, 2017 •

edited

Loading

gatorsmile commented Sep 30, 2017

gatorsmile commented Sep 30, 2017

maropu commented Sep 30, 2017

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

Conversation

maropu commented Sep 26, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 26, 2017

maropu commented Sep 26, 2017

maropu commented Sep 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 28, 2017

SparkQA commented Sep 28, 2017

maropu commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

SparkQA commented Sep 29, 2017

gatorsmile commented Sep 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 29, 2017

SparkQA commented Sep 30, 2017

gatorsmile commented Sep 30, 2017

SparkQA commented Sep 30, 2017

maropu commented Sep 30, 2017 • edited Loading

gatorsmile commented Sep 30, 2017

gatorsmile commented Sep 30, 2017

maropu commented Sep 30, 2017

maropu commented Sep 30, 2017 •

edited

Loading