[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

liancheng · 2014-10-20T17:27:40Z

This PR refines in-memory columnar table statistics:

adds 2 more statistics for in-memory table columns: count and sizeInBytes
adds filter pushdown support for IS NULL and IS NOT NULL.
caches and propagates statistics in InMemoryRelation once the underlying cached RDD is materialized.

Statistics are collected to driver side with an accumulator.

This PR also fixes SPARK-3914 by properly propagating in-memory statistics.

liancheng · 2014-10-20T17:29:12Z

sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala

-  val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = false)()
-  val nullCount =  AttributeReference(a.name + ".nullCount", IntegerType, nullable = false)()
+  val upperBound = AttributeReference(a.name + ".upperBound", a.dataType, nullable = true)()
+  val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = true)()


Upper/lower bound can be null for types like string.

SparkQA · 2014-10-20T17:34:52Z

QA tests have started for PR 2860 at commit 7dc6a34.

This patch merges cleanly.

liancheng · 2014-10-20T17:37:00Z

sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala

@@ -76,4 +76,24 @@ class PlannerSuite extends FunSuite {

    setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, origThreshold.toString)
  }
+
+  test("InMemoryRelation statistics propagation") {


Test case for SPARK-3914.

SparkQA · 2014-10-20T17:42:54Z

QA tests have finished for PR 2860 at commit 7dc6a34.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(sizeInBytes: BigInt)

AmplabJenkins · 2014-10-20T17:42:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21923/
Test FAILed.

SparkQA · 2014-10-21T08:39:53Z

QA tests have started for PR 2860 at commit a8c818d.

This patch merges cleanly.

SparkQA · 2014-10-21T08:48:38Z

QA tests have finished for PR 2860 at commit a8c818d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(sizeInBytes: BigInt)

AmplabJenkins · 2014-10-21T08:48:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21979/
Test FAILed.

SparkQA · 2014-10-21T09:14:43Z

QA tests have started for PR 2860 at commit c5ff904.

This patch merges cleanly.

SparkQA · 2014-10-21T09:51:25Z

QA tests have finished for PR 2860 at commit c5ff904.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(sizeInBytes: BigInt)

AmplabJenkins · 2014-10-21T09:51:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21981/
Test FAILed.

SparkQA · 2014-10-21T11:56:40Z

QA tests have started for PR 2860 at commit c5ff904.

This patch merges cleanly.

SparkQA · 2014-10-21T12:00:21Z

QA tests have finished for PR 2860 at commit c5ff904.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(sizeInBytes: BigInt)

liancheng · 2014-10-21T12:34:18Z

The compilation error seems to be due to JDK upgrade on Jenkins. Will try later.

liancheng · 2014-10-21T13:34:03Z

retest this please

SparkQA · 2014-10-21T13:39:54Z

QA tests have started for PR 2860 at commit c5ff904.

This patch merges cleanly.

SparkQA · 2014-10-21T14:38:14Z

QA tests have finished for PR 2860 at commit c5ff904.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Statistics(sizeInBytes: BigInt)

AmplabJenkins · 2014-10-21T14:38:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21991/
Test PASSed.

SparkQA · 2014-10-22T07:04:42Z

QA tests have started for PR 2860 at commit 0cc5271.

This patch merges cleanly.

SparkQA · 2014-10-22T07:57:56Z

QA tests have finished for PR 2860 at commit 0cc5271.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-22T07:57:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22027/
Test PASSed.

liancheng · 2014-10-22T08:24:21Z

@marmbrus This is ready to go.

PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics. This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection. Author: Cheng Lian <lian@databricks.com> Closes #2960 from liancheng/fix-join-suite and squashes the following commits: 715b2de [Cheng Lian] Fixes caching related JoinSuite failure

Adds more in-memory table statistics and propagates them properly

7dc6a34

liancheng reviewed Oct 20, 2014
View reviewed changes

liancheng changed the title ~~[SPARK-3537][SQL] Refines in-memory columnar table statistics~~ [SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics Oct 20, 2014

liancheng reviewed Oct 20, 2014
View reviewed changes

liancheng added 2 commits October 21, 2014 12:36

Bug fix: shouldn't call STRING.actualSize on null string value

1d01074

Refines tests

a8c818d

Fixes test table name conflict

c5ff904

Restricts visibility of o.a.s.s.c.p.l.Statistics

0cc5271

asfgit closed this in 2838bf8 Oct 26, 2014

liancheng deleted the propagates-in-mem-stats branch October 27, 2014 01:29

liancheng mentioned this pull request Oct 27, 2014

Fixes caching related JoinSuite failure #2960

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

liancheng commented Oct 20, 2014

liancheng Oct 20, 2014

SparkQA commented Oct 20, 2014

liancheng Oct 20, 2014

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

liancheng commented Oct 21, 2014

liancheng commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 22, 2014

SparkQA commented Oct 22, 2014

AmplabJenkins commented Oct 22, 2014

liancheng commented Oct 22, 2014

[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860

Conversation

liancheng commented Oct 20, 2014

liancheng Oct 20, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2014

liancheng Oct 20, 2014

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2014

AmplabJenkins commented Oct 20, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

liancheng commented Oct 21, 2014

liancheng commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SparkQA commented Oct 22, 2014

SparkQA commented Oct 22, 2014

AmplabJenkins commented Oct 22, 2014

liancheng commented Oct 22, 2014