From 156833643d9ea1479222e9033164e92a9846351c Mon Sep 17 00:00:00 2001
From: Chen Chao <crazyjvm@gmail.com>
Date: Mon, 14 Apr 2014 15:39:50 +0800
Subject: [PATCH] misleading task number of groupByKey

"By default, this uses only 8 parallel tasks to do the grouping." is a big misleading. Please refer to https://github.com/apache/spark/pull/389

detail is as following code :
<code>
  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {
    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse
    for (r <- bySize if r.partitioner.isDefined) {
      return r.partitioner.get
    }
    if (rdd.context.conf.contains("spark.default.parallelism")) {
      new HashPartitioner(rdd.context.defaultParallelism)
    } else {
      new HashPartitioner(bySize.head.partitions.size)
    }
  }
</code>
---
 docs/scala-programming-guide.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index a07cd2e0a32a2..a9d7d87e231a6 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -189,7 +189,7 @@ The following tables list the transformations and actions currently supported (s
 <tr>
   <td> <b>groupByKey</b>([<i>numTasks</i>]) </td>
   <td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Seq[V]) pairs. <br />
-<b>Note:</b> By default, this uses only 8 parallel tasks to do the grouping. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
+<b>Note:</b> By default, if the RDD already has a partitioner,  the task number is decided by the partition number of the partitioner, or else relies on the value of <code>spark.default.parallelism</code> if the property is set , otherwise depends on the partition number of the RDD. You can pass an optional <code>numTasks</code> argument to set a different number of tasks.
 </td>
 </tr>
 <tr>