[SQL] More aggressive defaults #3064

marmbrus · 2014-11-03T00:14:40Z

Turns on compression for in-memory cached data by default
Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
Ups the batch size to 10,000 rows
Increases the broadcast threshold to 10mb.
Uses our parquet implementation instead of the hive one by default.
Cache parquet metadata by default.

marmbrus · 2014-11-03T00:15:00Z

aarondav · 2014-11-03T00:18:31Z

sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala

@@ -109,7 +109,7 @@ private[sql] trait SQLConf {
   * Hive setting: hive.auto.convert.join.noconditionaltask.size, whose default value is also 10000.


nit: comment seems slightly out of date now

SparkQA · 2014-11-03T00:30:04Z

Test build #22781 has started for PR 3064 at commit 97ee9f8.

This patch merges cleanly.

AmplabJenkins · 2014-11-03T00:32:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22779/
Test FAILed.

AmplabJenkins · 2014-11-03T00:37:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22780/
Test FAILed.

SparkQA · 2014-11-03T00:37:52Z

Test build #506 has started for PR 3064 at commit 97ee9f8.

This patch merges cleanly.

SparkQA · 2014-11-03T02:01:35Z

Test build #506 has finished for PR 3064 at commit 97ee9f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-03T02:10:52Z

Test build #22781 has finished for PR 3064 at commit 97ee9f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class GenericStrategy[PhysicalPlan <: TreeNode[PhysicalPlan]] extends Logging
- trait RunnableCommand extends logical.Command
- case class ExecutedCommand(cmd: RunnableCommand) extends SparkPlan
- protected case class Keyword(str: String)
- sys.error(s"Failed to load class for data source: $provider")
- case class EqualTo(attribute: String, value: Any) extends Filter
- case class GreaterThan(attribute: String, value: Any) extends Filter
- case class GreaterThanOrEqual(attribute: String, value: Any) extends Filter
- case class LessThan(attribute: String, value: Any) extends Filter
- case class LessThanOrEqual(attribute: String, value: Any) extends Filter
- trait RelationProvider
- abstract class BaseRelation
- abstract class TableScan extends BaseRelation
- abstract class PrunedScan extends BaseRelation
- abstract class PrunedFilteredScan extends BaseRelation

AmplabJenkins · 2014-11-03T02:10:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22781/
Test PASSed.

liancheng · 2014-11-03T03:19:48Z

This LGTM.

(Why does MiMA complain about interfaces introduced by the foreign data source API here?)

marmbrus · 2014-11-03T03:41:18Z

Thanks for looking at this!

I don't think that's MIMA, as that would actually fail the build (and is
turned off for Spark SQL). The PR notification about new classes is just
based on some rough git magic / string matching as far as I know.
On Nov 2, 2014 7:19 PM, "Cheng Lian" notifications@github.com wrote:

This LGTM.

(Why does MiMA complain about interfaces introduced by the foreign data
source API here?)

—
Reply to this email directly or view it on GitHub
#3064 (comment).

- Turns on compression for in-memory cached data by default - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory) - Ups the batch size to 10,000 rows - Increases the broadcast threshold to 10mb. - Uses our parquet implementation instead of the hive one by default. - Cache parquet metadata by default. Author: Michael Armbrust <michael@databricks.com> Closes #3064 from marmbrus/fasterDefaults and squashes the following commits: 97ee9f8 [Michael Armbrust] parquet codec docs e641694 [Michael Armbrust] Remote also a12866a [Michael Armbrust] Cache metadata. 2d73acc [Michael Armbrust] Update docs defaults. d63d2d5 [Michael Armbrust] document parquet option da373f9 [Michael Armbrust] More aggressive defaults (cherry picked from commit 25bef7e) Signed-off-by: Michael Armbrust <michael@databricks.com>

More aggressive defaults

da373f9

document parquet option

d63d2d5

aarondav reviewed Nov 3, 2014
View reviewed changes

marmbrus added 4 commits November 2, 2014 16:24

Update docs defaults.

2d73acc

Cache metadata.

a12866a

Remote also

e641694

parquet codec docs

97ee9f8

asfgit closed this in 25bef7e Nov 3, 2014

marmbrus deleted the fasterDefaults branch November 19, 2014 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SQL] More aggressive defaults #3064

[SQL] More aggressive defaults #3064

marmbrus commented Nov 3, 2014

marmbrus commented Nov 3, 2014

aarondav Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

liancheng commented Nov 3, 2014

marmbrus commented Nov 3, 2014

		@@ -109,7 +109,7 @@ private[sql] trait SQLConf {
		* Hive setting: hive.auto.convert.join.noconditionaltask.size, whose default value is also 10000.

[SQL] More aggressive defaults #3064

[SQL] More aggressive defaults #3064

Conversation

marmbrus commented Nov 3, 2014

marmbrus commented Nov 3, 2014

aarondav Nov 3, 2014

Choose a reason for hiding this comment

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

SparkQA commented Nov 3, 2014

AmplabJenkins commented Nov 3, 2014

liancheng commented Nov 3, 2014

marmbrus commented Nov 3, 2014