[SPARK-3074] [PySpark] support groupByKey() with single huge key #1977

davies · 2014-08-16T00:32:33Z

This patch change groupByKey() to use external sort based approach, so it can support single huge key.

For example, it can group by a dataset including one hot key with 40 millions values (strings), using 500M memory for Python worker, finished in about 2 minutes. (it will need 6G memory in hash based approach).

During groupByKey(), it will do in-memory groupBy first. If the dataset can not fit in memory, then data will be partitioned by hash. If one partition still can not fit in memory, it will switch to sort based groupBy().

SparkQA · 2014-08-16T00:35:21Z

QA tests have started for PR 1977 at commit 083d842.

This patch merges cleanly.

SparkQA · 2014-08-16T01:10:29Z

QA tests have started for PR 1977 at commit d05060d.

This patch merges cleanly.

SparkQA · 2014-08-16T01:28:22Z

QA tests have finished for PR 1977 at commit 083d842.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SameKey(object):
- class GroupByKey(object):
- class ResultIterable(object):
- class ExternalSorter(object):

SparkQA · 2014-08-16T02:02:37Z

QA tests have finished for PR 1977 at commit d05060d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SameKey(object):
- class GroupByKey(object):
- class ResultIterable(object):
- class ExternalSorter(object):

sryza · 2014-08-16T03:14:21Z

Does / will the same functionality exist in Scala/Java?

andrewor14 · 2014-08-16T04:16:48Z

I believe this is one of those few things in Spark where python is ahead of Scala

davies · 2014-08-16T05:30:35Z

@sryza There are similar things in Scala, but we can not compare the Python object in Scala, so it can not use the groupByKey() in Scala directly. All the aggregation should be implemented in Python also.

@andrewor14, I hope PySpark could catch up with Scala.

this will reduce the memory used when merging many files together.

SparkQA · 2014-08-16T07:15:05Z

QA tests have started for PR 1977 at commit efa23df.

This patch merges cleanly.

SparkQA · 2014-08-16T07:59:27Z

QA tests have finished for PR 1977 at commit efa23df.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class Serializer
- abstract class SerializerInstance
- abstract class SerializationStream
- abstract class DeserializationStream
- class ShuffleBlockManager(blockManager: BlockManager,
- class SameKey(object):
- class GroupByKey(object):
- class ResultIterable(object):
- class FlattedValuesSerializer(BatchedSerializer):
- class ExternalSorter(object):

davies · 2014-08-16T13:39:39Z

Jenkins, retest this please.

SparkQA · 2014-08-16T13:45:06Z

QA tests have started for PR 1977 at commit b40bae7.

This patch merges cleanly.

SparkQA · 2014-08-16T14:37:16Z

QA tests have finished for PR 1977 at commit b40bae7.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SameKey(object):
- class GroupByKey(object):
- class ResultIterable(object):
- class FlattedValuesSerializer(BatchedSerializer):
- class ExternalSorter(object):

SparkQA · 2014-08-19T05:10:15Z

QA tests have started for PR 1977 at commit 1ea0669.

This patch does not merge cleanly!

SparkQA · 2014-08-19T06:17:27Z

QA tests have finished for PR 1977 at commit 1ea0669.

This patch fails unit tests.
This patch does not merge cleanly!

Conflicts: python/pyspark/rdd.py

SparkQA · 2014-08-19T06:35:31Z

QA tests have started for PR 1977 at commit 085aef8.

This patch merges cleanly.

SparkQA · 2014-08-19T07:30:22Z

QA tests have started for PR 1977 at commit 11ba318.

This patch merges cleanly.

SparkQA · 2014-08-19T07:33:44Z

QA tests have finished for PR 1977 at commit 085aef8.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ResultIterable(object):
- class FlattedValuesSerializer(BatchedSerializer):
- class ExternalSorter(object):
- class SameKey(object):
- class GroupByKey(object):
- class ExternalGroupBy(ExternalMerger):

SparkQA · 2014-08-19T08:24:43Z

QA tests have finished for PR 1977 at commit 11ba318.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")
- class ResultIterable(object):
- class FlattedValuesSerializer(BatchedSerializer):
- class ExternalSorter(object):
- class SameKey(object):
- class GroupByKey(object):
- class ExternalGroupBy(ExternalMerger):

davies · 2015-04-08T22:58:19Z

@JoshRosen the last comments had been addressed.

SparkQA · 2015-04-08T23:03:25Z

Test build #29900 has started for PR 1977 at commit 0b0fde8.

SparkQA · 2015-04-08T23:40:50Z

Test build #29895 has finished for PR 1977 at commit 0dcf320.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlattenedValuesSerializer(BatchedSerializer):
- class ExternalList(object):
- class GroupByKey(object):
- class ChainedIterable(object):
- class ExternalGroupBy(ExternalMerger):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-08T23:40:54Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29895/
Test PASSed.

SparkQA · 2015-04-09T00:26:53Z

Test build #29900 has finished for PR 1977 at commit 0b0fde8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlattenedValuesSerializer(BatchedSerializer):
- class ExternalList(object):
- class ExternalListOfList(ExternalList):
- class GroupByKey(object):
- class GroupListsByKey(GroupByKey):
- class ChainedIterable(object):
- class ExternalGroupBy(ExternalMerger):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T00:26:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29900/
Test PASSed.

JoshRosen · 2015-04-09T00:39:26Z

python/pyspark/shuffle.py

+        self.count += len(value) - 1
+
+
+class GroupByKey(object):


It looks like we only directly use GroupByKey in tests, while the actual shuffle code only uses GroupListsByKey. Is this intentional?

Yes, I'd want to the code and test of ExternalList and GroupByKey could be easy to understand.

JoshRosen · 2015-04-09T01:01:11Z

Sorry for my initial confusion regarding the external lists of lists. I think that the __len__ thing might be an issue if we ever directly expose ExternalListOfList to users, but it looks like we currently only expose it through a ChainedIterable in this code, so it doesn't appear to be a problem yet. This still might be worth addressing if you agree that it could help prevent future bugs if we start using this in more places.

davies · 2015-04-09T03:44:46Z

@JoshRosen Thanks for the comments, it looks better now.

SparkQA · 2015-04-09T03:48:42Z

Test build #29921 has started for PR 1977 at commit e78c15c.

SparkQA · 2015-04-09T04:53:49Z

Test build #29921 has finished for PR 1977 at commit e78c15c.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlattenedValuesSerializer(BatchedSerializer):
- class ExternalList(object):
- class ExternalListOfList(ExternalList):
- class GroupByKey(object):
- class ExternalGroupBy(ExternalMerger):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T04:53:53Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29921/
Test FAILed.

SparkQA · 2015-04-09T05:48:28Z

Test build #29925 has started for PR 1977 at commit 67772dd.

SparkQA · 2015-04-09T07:14:15Z

Test build #29925 has finished for PR 1977 at commit 67772dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlattenedValuesSerializer(BatchedSerializer):
- class ExternalList(object):
- class ExternalListOfList(ExternalList):
- class GroupByKey(object):
- class ExternalGroupBy(ExternalMerger):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T07:14:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29925/
Test PASSed.

JoshRosen · 2015-04-09T20:26:49Z

I spent a bit of time fuzz-testing this code to try to reach 100% coverage of the changes in this patch. While doing so, I think I uncovered a bug:

../Spark/python/pyspark/shuffle.py:383: in _external_items
    for v in self._merged_items(i):
../Spark/python/pyspark/shuffle.py:826: in <genexpr>
    return ((k, vs) for k, vs in GroupByKey(sorted_items))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pyspark.shuffle.GroupByKey object at 0x1048d0990>

    def next(self):
>       key, value = self.next_item if self.next_item else next(self.iterator)
E       TypeError: list object is not an iterator

../Spark/python/pyspark/shuffle.py:669: TypeError

It looks like the GroupByKey object expects to be instantiated with an iterator, but in GroupBy. _merge_sorted_items we end up calling it with the output of ExternalSorter.sorted. It looks like there's a branch in ExternalSorter.sorted where we can end up returning a list instead of an iterator (line 517), where we return current_chunk.

davies · 2015-04-09T21:37:00Z

@JoshRosen Good catch! fixed it.

SparkQA · 2015-04-09T21:38:27Z

Test build #29967 has started for PR 1977 at commit af3713a.

SparkQA · 2015-04-09T23:00:24Z

Test build #29967 has finished for PR 1977 at commit af3713a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FlattenedValuesSerializer(BatchedSerializer):
- class ExternalList(object):
- class ExternalListOfList(ExternalList):
- class GroupByKey(object):
- class ExternalGroupBy(ExternalMerger):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-09T23:00:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29967/
Test PASSed.

JoshRosen · 2015-04-10T00:06:24Z

LGTM. I spent more time testing this locally, commenting out various memory threshold flags as necessary in order to get good branch coverage, and didn't find any new problems. We should definitely do performance benchmarking of this feature during the 1.4 QA period in order to quantify its impact, but that isn't a blocker to merging this now. If this does turn out to have any performance issues for certain workloads, users should be able to feature-flag it by configuring Spark with a higher spilling threshold (or we could introduce a new flag specifically to bypass this).

I'm going to merge this into master (1.4.0). Thanks!

davies · 2015-04-10T01:01:17Z

Great thanks to test it, it did help us to find a bug!

davies added 2 commits August 15, 2014 14:48

use external sort in sortBy() and sortByKey()

55602ee

sorted based groupByKey()

083d842

group the same key before shuffle, reduce the comparison during sorting

d05060d

davies added 2 commits August 15, 2014 23:17

flatten the combined values when dumping into disks

250be4e

this will reduce the memory used when merging many files together.

refactor, add spark.shuffle.sort=False

efa23df

bugfix

b40bae7

davies changed the title ~~[SPARK-3074] [PySpark] support groupByKey() with single huge key~~ [WIP] [SPARK-3074] [PySpark] support groupByKey() with single huge key Aug 18, 2014

choose sort based groupByKey() automatically

1ea0669

davies added 2 commits August 18, 2014 23:27

switch to sort based groupBy, based on size of data

3ee58e5

Merge branch 'master' into groupby

085aef8

Conflicts: python/pyspark/rdd.py

typo

11ba318

davies changed the title ~~[WIP] [SPARK-3074] [PySpark] support groupByKey() with single huge key~~ [SPARK-3074] [PySpark] support groupByKey() with single huge key Aug 19, 2014

JoshRosen reviewed Apr 9, 2015
View reviewed changes

address comments

e78c15c

fix tests

67772dd

make sure it's iterator

af3713a

asfgit closed this in b5c51c8 Apr 10, 2015

[SPARK-3074] [PySpark] support groupByKey() with single huge key #1977

[SPARK-3074] [PySpark] support groupByKey() with single huge key #1977

Conversation

davies commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

sryza commented Aug 16, 2014

andrewor14 commented Aug 16, 2014

davies commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

davies commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 16, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

SparkQA commented Aug 19, 2014

davies commented Apr 8, 2015

SparkQA commented Apr 8, 2015

SparkQA commented Apr 8, 2015

AmplabJenkins commented Apr 8, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

JoshRosen Apr 9, 2015

Choose a reason for hiding this comment

davies Apr 9, 2015

Choose a reason for hiding this comment

JoshRosen commented Apr 9, 2015

davies commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

JoshRosen commented Apr 9, 2015

davies commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 9, 2015

AmplabJenkins commented Apr 9, 2015

JoshRosen commented Apr 10, 2015

davies commented Apr 10, 2015