-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3074] [PySpark] support groupByKey() with single huge key #1977
Conversation
QA tests have started for PR 1977 at commit
|
QA tests have started for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
Does / will the same functionality exist in Scala/Java? |
I believe this is one of those few things in Spark where python is ahead of Scala |
@sryza There are similar things in Scala, but we can not compare the Python object in Scala, so it can not use the groupByKey() in Scala directly. All the aggregation should be implemented in Python also. @andrewor14, I hope PySpark could catch up with Scala. |
this will reduce the memory used when merging many files together.
QA tests have started for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
Jenkins, retest this please. |
QA tests have started for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
QA tests have started for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
Conflicts: python/pyspark/rdd.py
QA tests have started for PR 1977 at commit
|
QA tests have started for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
QA tests have finished for PR 1977 at commit
|
@JoshRosen the last comments had been addressed. |
Test build #29900 has started for PR 1977 at commit |
Test build #29895 has finished for PR 1977 at commit
|
Test PASSed. |
Test build #29900 has finished for PR 1977 at commit
|
Test PASSed. |
self.count += len(value) - 1 | ||
|
||
|
||
class GroupByKey(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we only directly use GroupByKey
in tests, while the actual shuffle code only uses GroupListsByKey
. Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd want to the code and test of ExternalList and GroupByKey could be easy to understand.
Sorry for my initial confusion regarding the external lists of lists. I think that the |
@JoshRosen Thanks for the comments, it looks better now. |
Test build #29921 has started for PR 1977 at commit |
Test build #29921 has finished for PR 1977 at commit
|
Test FAILed. |
Test build #29925 has started for PR 1977 at commit |
Test build #29925 has finished for PR 1977 at commit
|
Test PASSed. |
I spent a bit of time fuzz-testing this code to try to reach 100% coverage of the changes in this patch. While doing so, I think I uncovered a bug:
It looks like the |
@JoshRosen Good catch! fixed it. |
Test build #29967 has started for PR 1977 at commit |
Test build #29967 has finished for PR 1977 at commit
|
Test PASSed. |
LGTM. I spent more time testing this locally, commenting out various memory threshold flags as necessary in order to get good branch coverage, and didn't find any new problems. We should definitely do performance benchmarking of this feature during the 1.4 QA period in order to quantify its impact, but that isn't a blocker to merging this now. If this does turn out to have any performance issues for certain workloads, users should be able to feature-flag it by configuring Spark with a higher spilling threshold (or we could introduce a new flag specifically to bypass this). I'm going to merge this into |
Great thanks to test it, it did help us to find a bug! |
This patch change groupByKey() to use external sort based approach, so it can support single huge key.
For example, it can group by a dataset including one hot key with 40 millions values (strings), using 500M memory for Python worker, finished in about 2 minutes. (it will need 6G memory in hash based approach).
During groupByKey(), it will do in-memory groupBy first. If the dataset can not fit in memory, then data will be partitioned by hash. If one partition still can not fit in memory, it will switch to sort based groupBy().