[SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) #2094

davies · 2014-08-22T04:58:25Z

RDD.max(key=None)

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
    >>> rdd.max()
    43.0
    >>> rdd.max(key=str)
    5.0

RDD.min(key=None)

    Find the minimum item in this RDD.

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
    >>> rdd.min()
    2.0
    >>> rdd.min(key=str)
    10.0

RDD.top(num, key=None)

    Get the top N elements from a RDD.

    Note: It returns the list sorted in descending order.
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
    [12]
    >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
    [6, 5]
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
    [4, 3, 2]

SparkQA · 2014-08-22T05:00:34Z

QA tests have started for PR 2094 at commit dd91e08.

This patch merges cleanly.

SparkQA · 2014-08-22T05:55:39Z

QA tests have finished for PR 2094 at commit dd91e08.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mattf · 2014-08-23T01:56:47Z

python/pyspark/rdd.py

        """
        Find the maximum item in this RDD.

-        >>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
+        @param comp: A function used to compare two elements, the builtin `cmp`


nit - the buildin 'max'

I think cmp is the function used in max or min, so cmp is the default value for comp.

cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.

Yes, using comp here is bit confusing. The builtin min use key, it will be better for Python programer, but it will be different than Scala API.

cc @mateiz @rxin @JoshRosen

We already use key in Python instead of Ordering in Scala, so I had change it into key.

Also , I would like to add key to top(), will be helpful, such as:

rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))

We already have ord in Scala. Should I add this in this PR?

mattf · 2014-08-23T02:03:46Z

are you planning to add tests for these?

davies · 2014-08-23T05:17:08Z

@mattf thank you for reviewing this, I think the docs tests is enough, they have cover the cases w or w/o comp, which kinds of tests should be added?

mattf · 2014-08-23T12:25:48Z

python/pyspark/rdd.py

        """
-        return self.reduce(min)
+        if comp is not None:


consider default of comp=min in arg list and test for comp is not min

same for max method

min and comp have different meanings:

>>> min(1, 2) 1 >>> cmp(1, 2) -1

mattf · 2014-08-23T12:28:32Z

agreed re doctest. i forgot it was in use.

SparkQA · 2014-08-23T17:55:58Z

QA tests have started for PR 2094 at commit 2f63512.

This patch merges cleanly.

SparkQA · 2014-08-23T18:43:24Z

QA tests have finished for PR 2094 at commit 2f63512.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-23T22:00:42Z

QA tests have started for PR 2094 at commit ad7e374.

This patch merges cleanly.

SparkQA · 2014-08-23T22:05:44Z

QA tests have started for PR 2094 at commit ccbaf25.

This patch merges cleanly.

SparkQA · 2014-08-23T22:52:27Z

QA tests have finished for PR 2094 at commit ccbaf25.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-23T22:56:15Z

QA tests have finished for PR 2094 at commit ad7e374.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-08-24T00:32:21Z

Epydoc renders docstrings + @params kind of oddly, but I don't think it's a big deal:

In the long run, we might want to move to Sphinx, since that seems to be what's popular with most major Python projects.

JoshRosen · 2014-08-24T00:35:27Z

I like this updated approach of using key instead of a comparator, since that's a closer match to Python's min function. Can you update the PR's title and description to reflect this?

JoshRosen · 2014-08-24T01:56:18Z

I've merged this into master. Thanks!

RDD.max(key=None) param key: A function used to generate key for comparing >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0]) >>> rdd.max() 43.0 >>> rdd.max(key=str) 5.0 RDD.min(key=None) Find the minimum item in this RDD. param key: A function used to generate key for comparing >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0]) >>> rdd.min() 2.0 >>> rdd.min(key=str) 10.0 RDD.top(num, key=None) Get the top N elements from a RDD. Note: It returns the list sorted in descending order. >>> sc.parallelize([10, 4, 2, 12, 3]).top(1) [12] >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2) [6, 5] >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str) [4, 3, 2] Author: Davies Liu <davies.liu@gmail.com> Closes apache#2094 from davies/cmp and squashes the following commits: ccbaf25 [Davies Liu] add `key` to top() ad7e374 [Davies Liu] fix tests 2f63512 [Davies Liu] change `comp` to `key` in min/max dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()

add comp argument for RDD.max() and RDD.min()

dd91e08

davies mentioned this pull request Aug 22, 2014

[SPARK-2871] [PySpark] Add missing API #1791

Closed

mattf reviewed Aug 23, 2014
View reviewed changes

change comp to key in min/max

2f63512

davies added 2 commits August 23, 2014 14:54

fix tests

ad7e374

add key to top()

ccbaf25

davies changed the title ~~[SPARK-2871] [PySpark] add comp argument for RDD.max() and RDD.min()~~ [SPARK-2871] [PySpark] add key argument for max(), min() and top(n) Aug 24, 2014

asfgit closed this in db436e3 Aug 24, 2014

davies deleted the cmp branch September 15, 2014 22:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) #2094

[SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) #2094

davies commented Aug 22, 2014

SparkQA commented Aug 22, 2014

SparkQA commented Aug 22, 2014

mattf Aug 23, 2014

davies Aug 23, 2014

mattf Aug 23, 2014

davies Aug 23, 2014

davies Aug 23, 2014

mattf commented Aug 23, 2014

davies commented Aug 23, 2014

mattf Aug 23, 2014

davies Aug 23, 2014

mattf commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014

[SPARK-2871] [PySpark] add key argument for max(), min() and top(n) #2094

[SPARK-2871] [PySpark] add key argument for max(), min() and top(n) #2094

Conversation

davies commented Aug 22, 2014

SparkQA commented Aug 22, 2014

SparkQA commented Aug 22, 2014

mattf Aug 23, 2014

Choose a reason for hiding this comment

davies Aug 23, 2014

Choose a reason for hiding this comment

mattf Aug 23, 2014

Choose a reason for hiding this comment

davies Aug 23, 2014

Choose a reason for hiding this comment

davies Aug 23, 2014

Choose a reason for hiding this comment

mattf commented Aug 23, 2014

davies commented Aug 23, 2014

mattf Aug 23, 2014

Choose a reason for hiding this comment

davies Aug 23, 2014

Choose a reason for hiding this comment

mattf commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

SparkQA commented Aug 23, 2014

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014

[SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) #2094

[SPARK-2871] [PySpark] add `key` argument for max(), min() and top(n) #2094