-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2871] [PySpark] add key
argument for max(), min() and top(n)
#2094
Conversation
QA tests have started for PR 2094 at commit
|
QA tests have finished for PR 2094 at commit
|
""" | ||
Find the maximum item in this RDD. | ||
|
||
>>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max() | ||
@param comp: A function used to compare two elements, the builtin `cmp` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - the buildin 'max'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think cmp
is the function used in max
or min
, so cmp
is the default value for comp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using comp
here is bit confusing. The builtin min
use key
, it will be better for Python programer, but it will be different than Scala API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use key
in Python instead of Ordering
in Scala, so I had change it into key
.
Also , I would like to add key
to top(), will be helpful, such as:
rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))
We already have ord
in Scala. Should I add this in this PR?
are you planning to add tests for these? |
@mattf thank you for reviewing this, I think the docs tests is enough, they have cover the cases w or w/o |
""" | ||
return self.reduce(min) | ||
if comp is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider default of comp=min in arg list and test for comp is not min
same for max method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
min and comp have different meanings:
>>> min(1, 2)
1
>>> cmp(1, 2)
-1
agreed re doctest. i forgot it was in use. |
QA tests have started for PR 2094 at commit
|
QA tests have finished for PR 2094 at commit
|
QA tests have started for PR 2094 at commit
|
QA tests have started for PR 2094 at commit
|
QA tests have finished for PR 2094 at commit
|
QA tests have finished for PR 2094 at commit
|
I like this updated approach of using |
comp
argument for RDD.max() and RDD.min()key
argument for max(), min() and top(n)
I've merged this into master. Thanks! |
RDD.max(key=None) param key: A function used to generate key for comparing >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0]) >>> rdd.max() 43.0 >>> rdd.max(key=str) 5.0 RDD.min(key=None) Find the minimum item in this RDD. param key: A function used to generate key for comparing >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0]) >>> rdd.min() 2.0 >>> rdd.min(key=str) 10.0 RDD.top(num, key=None) Get the top N elements from a RDD. Note: It returns the list sorted in descending order. >>> sc.parallelize([10, 4, 2, 12, 3]).top(1) [12] >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2) [6, 5] >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str) [4, 3, 2] Author: Davies Liu <davies.liu@gmail.com> Closes apache#2094 from davies/cmp and squashes the following commits: ccbaf25 [Davies Liu] add `key` to top() ad7e374 [Davies Liu] fix tests 2f63512 [Davies Liu] change `comp` to `key` in min/max dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
RDD.max(key=None)
RDD.min(key=None)
RDD.top(num, key=None)