Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

Closed
wants to merge 5 commits into from

Conversation

liancheng
Copy link
Contributor

This PR refines in-memory columnar table statistics:

  1. adds 2 more statistics for in-memory table columns: count and sizeInBytes

  2. adds filter pushdown support for IS NULL and IS NOT NULL.

  3. caches and propagates statistics in InMemoryRelation once the underlying cached RDD is materialized.

    Statistics are collected to driver side with an accumulator.

This PR also fixes SPARK-3914 by properly propagating in-memory statistics.

val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = false)()
val nullCount = AttributeReference(a.name + ".nullCount", IntegerType, nullable = false)()
val upperBound = AttributeReference(a.name + ".upperBound", a.dataType, nullable = true)()
val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = true)()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upper/lower bound can be null for types like string.

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have started for PR 2860 at commit 7dc6a34.

  • This patch merges cleanly.

@liancheng liancheng changed the title [SPARK-3537][SQL] Refines in-memory columnar table statistics [SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics Oct 20, 2014
@@ -76,4 +76,24 @@ class PlannerSuite extends FunSuite {

setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, origThreshold.toString)
}

test("InMemoryRelation statistics propagation") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test case for SPARK-3914.

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have finished for PR 2860 at commit 7dc6a34.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(sizeInBytes: BigInt)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21923/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2860 at commit a8c818d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2860 at commit a8c818d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(sizeInBytes: BigInt)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21979/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2860 at commit c5ff904.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2860 at commit c5ff904.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(sizeInBytes: BigInt)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21981/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2860 at commit c5ff904.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2860 at commit c5ff904.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(sizeInBytes: BigInt)

@liancheng
Copy link
Contributor Author

The compilation error seems to be due to JDK upgrade on Jenkins. Will try later.

@liancheng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2860 at commit c5ff904.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2860 at commit c5ff904.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Statistics(sizeInBytes: BigInt)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21991/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have started for PR 2860 at commit 0cc5271.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 22, 2014

QA tests have finished for PR 2860 at commit 0cc5271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22027/
Test PASSed.

@liancheng
Copy link
Contributor Author

@marmbrus This is ready to go.

@asfgit asfgit closed this in 2838bf8 Oct 26, 2014
@liancheng liancheng deleted the propagates-in-mem-stats branch October 27, 2014 01:29
asfgit pushed a commit that referenced this pull request Oct 27, 2014
PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics.

This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection.

Author: Cheng Lian <lian@databricks.com>

Closes #2960 from liancheng/fix-join-suite and squashes the following commits:

715b2de [Cheng Lian] Fixes caching related JoinSuite failure
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants