-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3537][SPARK-3914][SQL] Refines in-memory columnar table statistics #2860
Conversation
val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = false)() | ||
val nullCount = AttributeReference(a.name + ".nullCount", IntegerType, nullable = false)() | ||
val upperBound = AttributeReference(a.name + ".upperBound", a.dataType, nullable = true)() | ||
val lowerBound = AttributeReference(a.name + ".lowerBound", a.dataType, nullable = true)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upper/lower bound can be null for types like string.
QA tests have started for PR 2860 at commit
|
@@ -76,4 +76,24 @@ class PlannerSuite extends FunSuite { | |||
|
|||
setConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD, origThreshold.toString) | |||
} | |||
|
|||
test("InMemoryRelation statistics propagation") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test case for SPARK-3914.
QA tests have finished for PR 2860 at commit
|
Test FAILed. |
QA tests have started for PR 2860 at commit
|
QA tests have finished for PR 2860 at commit
|
Test FAILed. |
QA tests have started for PR 2860 at commit
|
QA tests have finished for PR 2860 at commit
|
Test FAILed. |
QA tests have started for PR 2860 at commit
|
QA tests have finished for PR 2860 at commit
|
The compilation error seems to be due to JDK upgrade on Jenkins. Will try later. |
retest this please |
QA tests have started for PR 2860 at commit
|
QA tests have finished for PR 2860 at commit
|
Test PASSed. |
QA tests have started for PR 2860 at commit
|
QA tests have finished for PR 2860 at commit
|
Test PASSed. |
@marmbrus This is ready to go. |
PR #2860 refines in-memory table statistics and enables broader broadcasted hash join optimization for in-memory tables. This makes `JoinSuite` fail when some test suite caches test table `testData` and gets executed before `JoinSuite`. Because expected `ShuffledHashJoin`s are optimized to `BroadcastedHashJoin` according to collected in-memory table statistics. This PR fixes this issue by clearing the cache before testing join operator selection. A separate test case is also added to test broadcasted hash join operator selection. Author: Cheng Lian <lian@databricks.com> Closes #2960 from liancheng/fix-join-suite and squashes the following commits: 715b2de [Cheng Lian] Fixes caching related JoinSuite failure
This PR refines in-memory columnar table statistics:
adds 2 more statistics for in-memory table columns:
count
andsizeInBytes
adds filter pushdown support for
IS NULL
andIS NOT NULL
.caches and propagates statistics in
InMemoryRelation
once the underlying cached RDD is materialized.Statistics are collected to driver side with an accumulator.
This PR also fixes SPARK-3914 by properly propagating in-memory statistics.