Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2871] [PySpark] add approx API for RDD #2095

Closed
wants to merge 1 commit into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 22, 2014

RDD.countApprox(self, timeout, confidence=0.95)

    :: Experimental ::
    Approximate version of count() that returns a potentially incomplete
    result within a timeout, even if not all tasks have finished.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> rdd.countApprox(1000, 1.0)
    1000

RDD.sumApprox(self, timeout, confidence=0.95)

    Approximate operation to return the sum within a timeout
    or meet the confidence.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> r = sum(xrange(1000))
    >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

    :: Experimental ::
    Approximate operation to return the mean within a timeout
    or meet the confidence.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> r = sum(xrange(1000)) / 1000.0
    >>> (rdd.meanApprox(1000) - r) / r < 0.05
    True

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have started for PR 2095 at commit e8c252b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 22, 2014

QA tests have finished for PR 2095 at commit e8c252b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BoundedFloat(float):

return True
return False

def _to_jrdd(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a small possibility that this name could be confusing, since self._jrdd returns a Java RDD of Python objects, whereas self.to_jrdd() returns a JavaRDD of Java objects. I would maybe rename this to something like to_java_object_rdd.

@JoshRosen
Copy link
Contributor

This looks good to me; I had one minor comment about a potentially-confusing internal name, but we can take care of that later as part of a more general Python RDD <-> Java Object RDD utility method refactoring / cleanup.

@JoshRosen
Copy link
Contributor

I've merged this into master. Thanks!

@asfgit asfgit closed this in 8df4dad Aug 24, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RDD.countApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate version of count() that returns a potentially incomplete
        result within a timeout, even if not all tasks have finished.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> rdd.countApprox(1000, 1.0)
        1000

RDD.sumApprox(self, timeout, confidence=0.95)

        Approximate operation to return the sum within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000))
        >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate operation to return the mean within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000)) / 1000.0
        >>> (rdd.meanApprox(1000) - r) / r < 0.05
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes apache#2095 from davies/approx and squashes the following commits:

e8c252b [Davies Liu] add approx API for RDD
@davies davies deleted the approx branch September 15, 2014 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants