-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-4230. Doc for spark.default.parallelism is incorrect #3107
Conversation
Test build #22921 has started for PR 3107 at commit
|
Test build #22921 has finished for PR 3107 at commit
|
Test PASSed. |
@@ -563,8 +566,8 @@ Apart from these, the following properties are also available, and may be useful | |||
</ul> | |||
</td> | |||
<td> | |||
Default number of tasks to use across the cluster for distributed shuffle operations | |||
(<code>groupByKey</code>, <code>reduceByKey</code>, etc) when not set by user. | |||
Default number of output partitions for operations like <code>join</code>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this say "number of shuffle partitions" - it's slightly weird to me to say "output" when this refers to something that is totally internal to Spark - it's output on the map side but input on he read side. In other cases I think output tends to mean things like saving as HDFS data, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that Spark's APIs have no mention of the concept of a "shuffle partition" (e.g. the term is referenced nowhere on https://spark.apache.org/docs/latest/programming-guide.html), but even novice Spark users are meant to understand that every transformation has input and output RDDs and that every RDD has a number of partitions.
Maybe "Default number of partitions for the RDDs produced by operations like ..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see - what about "Default number of partitions in RDD's returned by join, reduceByKey..."
Had some minor wording questions. |
14ca79b
to
37a1d19
Compare
Test build #23129 has started for PR 3107 at commit
|
Test build #23129 has finished for PR 3107 at commit
|
Test FAILed. |
Test failure looks unrelated |
LG - pulling it in. |
Author: Sandy Ryza <sandy@cloudera.com> Closes #3107 from sryza/sandy-spark-4230 and squashes the following commits: 37a1d19 [Sandy Ryza] Clear up a couple things 34d53de [Sandy Ryza] SPARK-4230. Doc for spark.default.parallelism is incorrect (cherry picked from commit c6f4e70) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
No description provided.