Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark #19344

Closed
wants to merge 4 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Sep 26, 2017

What changes were proposed in this pull request?

Since the current code ignores WITH clauses to check input relations in TPCDS queries, this leads to inaccurate per-row processing time for benchmark results. For example, in q2, this fix could catch all the input relations: web_sales, date_dim, and catalog_sales (the current code catches date_dim only). The one-third of the TPCDS queries uses WITH clauses, so I think it is worth fixing this.

How was this patch tested?

Manually checked.

@SparkQA
Copy link

SparkQA commented Sep 26, 2017

Test build #82167 has finished for PR 19344 at commit d9be37e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Sep 26, 2017

@gatorsmile if you get time, please check this. thanks.

@maropu
Copy link
Member Author

maropu commented Sep 28, 2017

ping

case _ =>
}
// logical plan and adding up the sizes of all tables that appear in the plan.
val planToCheck = mutable.Stack[LogicalPlan](spark.sql(queryString).queryExecution.logical)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the plan that has been analyzed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The analyzer rule CTESubstitution will replace With

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yea. Since the original code does so, I just added the logic. But, the suggestion sounds good to me, so I'll update soon. Thanks.

@maropu maropu force-pushed the RespectWithInTPCDSBench branch 3 times, most recently from 0df2663 to dd84919 Compare September 28, 2017 15:42
val queryRelations = scala.collection.mutable.HashSet[String]()
spark.sql(queryString).queryExecution.logical.map {
spark.sql(queryString).queryExecution.analyzed.map {
case UnresolvedRelation(t: TableIdentifier) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is successfully analyzed, UnresolvedRelation should not exist

@SparkQA
Copy link

SparkQA commented Sep 28, 2017

Test build #82280 has finished for PR 19344 at commit 0df2663.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 28, 2017

Test build #82281 has finished for PR 19344 at commit dd84919.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the RespectWithInTPCDSBench branch 2 times, most recently from 3f31048 to f7359af Compare September 29, 2017 00:03
@maropu
Copy link
Member Author

maropu commented Sep 29, 2017

@gatorsmile ok, fixed. Also, I checked this code could collect all the relations.

@maropu maropu force-pushed the RespectWithInTPCDSBench branch from f7359af to 489f2a2 Compare September 29, 2017 00:27
@maropu maropu changed the title [SPARK-22122][SQL] Respect WITH clauses to count input rows in TPCDSQueryBenchmark [SPARK-22122][SQL] Use analyzed logical plans to count input rows in TPCDSQueryBenchmark Sep 29, 2017
@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82301 has finished for PR 19344 at commit f7359af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82303 has finished for PR 19344 at commit 489f2a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case _ =>
}
}
spark.sql(queryString).queryExecution.analyzed.map {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foreach

}
spark.sql(queryString).queryExecution.analyzed.map {
case SubqueryAlias(name, _: LogicalRelation) =>
queryRelations.add(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another case for HiveTableRelation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC ditto; HiveTableRelation never happens here.

}
}
spark.sql(queryString).queryExecution.analyzed.map {
case SubqueryAlias(name, _: LogicalRelation) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using LogicalRelation 's catalogTable? Just issue an exception if it is None. I think this benchmark will not hit None

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked again and I found we can't use catalogTable here because these TPCDS tables are locally temporary ones (IIUC these tables are always transformed into ScalaAlias(LocalRelation)).

@maropu maropu force-pushed the RespectWithInTPCDSBench branch from 6dfb004 to 00cfb21 Compare September 29, 2017 07:11
@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82309 has finished for PR 19344 at commit 6dfb004.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82311 has finished for PR 19344 at commit 00cfb21.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the RespectWithInTPCDSBench branch from 00cfb21 to 5691cf6 Compare September 29, 2017 10:18
@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82314 has finished for PR 19344 at commit 5691cf6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

}
}
spark.sql(queryString).queryExecution.analyzed.foreach {
case SubqueryAlias(name, _: LogicalRelation) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add all the three scenarios, although the current codes only use temp views?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do

@SparkQA
Copy link

SparkQA commented Sep 29, 2017

Test build #82329 has finished for PR 19344 at commit 5691cf6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #82340 has finished for PR 19344 at commit cbac959.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the RespectWithInTPCDSBench branch from cbac959 to 8d8a9ff Compare September 30, 2017 01:23
@gatorsmile
Copy link
Member

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Sep 30, 2017

Test build #82341 has finished for PR 19344 at commit 8d8a9ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Sep 30, 2017

btw, could we also add tpcds-modifiedQueries here?

@gatorsmile
Copy link
Member

These modified test cases are not following the standards. Impala added extra (partition) predicates. The perf results are misleading.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in c6610a9 Sep 30, 2017
@maropu
Copy link
Member Author

maropu commented Sep 30, 2017

ok, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants