Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SQL] More aggressive defaults #3064

Closed
wants to merge 6 commits into from
Closed

Conversation

marmbrus
Copy link
Contributor

@marmbrus marmbrus commented Nov 3, 2014

  • Turns on compression for in-memory cached data by default
  • Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
  • Ups the batch size to 10,000 rows
  • Increases the broadcast threshold to 10mb.
  • Uses our parquet implementation instead of the hive one by default.
  • Cache parquet metadata by default.

@marmbrus
Copy link
Contributor Author

marmbrus commented Nov 3, 2014

/cc @liancheng @mateiz

@@ -109,7 +109,7 @@ private[sql] trait SQLConf {
* Hive setting: hive.auto.convert.join.noconditionaltask.size, whose default value is also 10000.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment seems slightly out of date now

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22781 has started for PR 3064 at commit 97ee9f8.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22779/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22780/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #506 has started for PR 3064 at commit 97ee9f8.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #506 has finished for PR 3064 at commit 97ee9f8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22781 has finished for PR 3064 at commit 97ee9f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class GenericStrategy[PhysicalPlan <: TreeNode[PhysicalPlan]] extends Logging
    • trait RunnableCommand extends logical.Command
    • case class ExecutedCommand(cmd: RunnableCommand) extends SparkPlan
    • protected case class Keyword(str: String)
    • sys.error(s"Failed to load class for data source: $provider")
    • case class EqualTo(attribute: String, value: Any) extends Filter
    • case class GreaterThan(attribute: String, value: Any) extends Filter
    • case class GreaterThanOrEqual(attribute: String, value: Any) extends Filter
    • case class LessThan(attribute: String, value: Any) extends Filter
    • case class LessThanOrEqual(attribute: String, value: Any) extends Filter
    • trait RelationProvider
    • abstract class BaseRelation
    • abstract class TableScan extends BaseRelation
    • abstract class PrunedScan extends BaseRelation
    • abstract class PrunedFilteredScan extends BaseRelation

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22781/
Test PASSed.

@liancheng
Copy link
Contributor

This LGTM.

(Why does MiMA complain about interfaces introduced by the foreign data source API here?)

@marmbrus
Copy link
Contributor Author

marmbrus commented Nov 3, 2014

Thanks for looking at this!

I don't think that's MIMA, as that would actually fail the build (and is
turned off for Spark SQL). The PR notification about new classes is just
based on some rough git magic / string matching as far as I know.
On Nov 2, 2014 7:19 PM, "Cheng Lian" notifications@github.com wrote:

This LGTM.

(Why does MiMA complain about interfaces introduced by the foreign data
source API here?)


Reply to this email directly or view it on GitHub
#3064 (comment).

asfgit pushed a commit that referenced this pull request Nov 3, 2014
 - Turns on compression for in-memory cached data by default
 - Changes the default parquet compression format back to gzip (we have seen more OOMs with production workloads due to the way Snappy allocates memory)
 - Ups the batch size to 10,000 rows
 - Increases the broadcast threshold to 10mb.
 - Uses our parquet implementation instead of the hive one by default.
 - Cache parquet metadata by default.

Author: Michael Armbrust <michael@databricks.com>

Closes #3064 from marmbrus/fasterDefaults and squashes the following commits:

97ee9f8 [Michael Armbrust] parquet codec docs
e641694 [Michael Armbrust] Remote also
a12866a [Michael Armbrust] Cache metadata.
2d73acc [Michael Armbrust] Update docs defaults.
d63d2d5 [Michael Armbrust] document parquet option
da373f9 [Michael Armbrust] More aggressive defaults

(cherry picked from commit 25bef7e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@asfgit asfgit closed this in 25bef7e Nov 3, 2014
@marmbrus marmbrus deleted the fasterDefaults branch November 19, 2014 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants