Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2871] [PySpark] add key argument for max(), min() and top(n) #2094

Closed
wants to merge 4 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 22, 2014

RDD.max(key=None)

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
    >>> rdd.max()
    43.0
    >>> rdd.max(key=str)
    5.0

RDD.min(key=None)

    Find the minimum item in this RDD.

    @param key: A function used to generate key for comparing

    >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
    >>> rdd.min()
    2.0
    >>> rdd.min(key=str)
    10.0

RDD.top(num, key=None)

    Get the top N elements from a RDD.

    Note: It returns the list sorted in descending order.
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
    [12]
    >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
    [6, 5]
    >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
    [4, 3, 2]

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have started for PR 2094 at commit dd91e08.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have finished for PR 2094 at commit dd91e08.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

"""
Find the maximum item in this RDD.

>>> sc.parallelize([1.0, 5.0, 43.0, 10.0]).max()
@param comp: A function used to compare two elements, the builtin `cmp`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - the buildin 'max'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think cmp is the function used in max or min, so cmp is the default value for comp.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp may be used in max, but for this func the default is on line 829. either way, a minor nitpick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, using comp here is bit confusing. The builtin min use key, it will be better for Python programer, but it will be different than Scala API.

cc @mateiz @rxin @JoshRosen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use key in Python instead of Ordering in Scala, so I had change it into key.

Also , I would like to add key to top(), will be helpful, such as:

rdd.map(lambda x: (x, 1)).reduce(add).top(20, key=itemgetter(1))

We already have ord in Scala. Should I add this in this PR?

@mattf
Copy link

mattf commented Aug 23, 2014

are you planning to add tests for these?

@davies
Copy link
Contributor Author

davies commented Aug 23, 2014

@mattf thank you for reviewing this, I think the docs tests is enough, they have cover the cases w or w/o comp, which kinds of tests should be added?

"""
return self.reduce(min)
if comp is not None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider default of comp=min in arg list and test for comp is not min

same for max method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min and comp have different meanings:

>>> min(1, 2)
1
>>> cmp(1, 2)
-1

@mattf
Copy link

mattf commented Aug 23, 2014

agreed re doctest. i forgot it was in use.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit 2f63512.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit 2f63512.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit ad7e374.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have started for PR 2094 at commit ccbaf25.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit ccbaf25.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 23, 2014

QA tests have finished for PR 2094 at commit ad7e374.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Epydoc renders docstrings + @params kind of oddly, but I don't think it's a big deal:

image

In the long run, we might want to move to Sphinx, since that seems to be what's popular with most major Python projects.

@JoshRosen
Copy link
Contributor

I like this updated approach of using key instead of a comparator, since that's a closer match to Python's min function. Can you update the PR's title and description to reflect this?

@davies davies changed the title [SPARK-2871] [PySpark] add comp argument for RDD.max() and RDD.min() [SPARK-2871] [PySpark] add key argument for max(), min() and top(n) Aug 24, 2014
@JoshRosen
Copy link
Contributor

I've merged this into master. Thanks!

@asfgit asfgit closed this in db436e3 Aug 24, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RDD.max(key=None)

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([1.0, 5.0, 43.0, 10.0])
        >>> rdd.max()
        43.0
        >>> rdd.max(key=str)
        5.0

RDD.min(key=None)

        Find the minimum item in this RDD.

        param key: A function used to generate key for comparing

        >>> rdd = sc.parallelize([2.0, 5.0, 43.0, 10.0])
        >>> rdd.min()
        2.0
        >>> rdd.min(key=str)
        10.0

RDD.top(num, key=None)

        Get the top N elements from a RDD.

        Note: It returns the list sorted in descending order.
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(1)
        [12]
        >>> sc.parallelize([2, 3, 4, 5, 6], 2).top(2)
        [6, 5]
        >>> sc.parallelize([10, 4, 2, 12, 3]).top(3, key=str)
        [4, 3, 2]

Author: Davies Liu <davies.liu@gmail.com>

Closes apache#2094 from davies/cmp and squashes the following commits:

ccbaf25 [Davies Liu] add `key` to top()
ad7e374 [Davies Liu] fix tests
2f63512 [Davies Liu] change `comp` to `key` in min/max
dd91e08 [Davies Liu] add `comp` argument for RDD.max() and RDD.min()
@davies davies deleted the cmp branch September 15, 2014 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants