[SPARK-2871] [PySpark] add approx API for RDD #2095

davies · 2014-08-22T05:21:20Z

RDD.countApprox(self, timeout, confidence=0.95)

    :: Experimental ::
    Approximate version of count() that returns a potentially incomplete
    result within a timeout, even if not all tasks have finished.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> rdd.countApprox(1000, 1.0)
    1000

RDD.sumApprox(self, timeout, confidence=0.95)

    Approximate operation to return the sum within a timeout
    or meet the confidence.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> r = sum(xrange(1000))
    >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

    :: Experimental ::
    Approximate operation to return the mean within a timeout
    or meet the confidence.

    >>> rdd = sc.parallelize(range(1000), 10)
    >>> r = sum(xrange(1000)) / 1000.0
    >>> (rdd.meanApprox(1000) - r) / r < 0.05
    True

SparkQA · 2014-08-22T05:25:37Z

QA tests have started for PR 2095 at commit e8c252b.

This patch merges cleanly.

SparkQA · 2014-08-22T06:20:05Z

QA tests have finished for PR 2095 at commit e8c252b.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BoundedFloat(float):

JoshRosen · 2014-08-24T02:27:20Z

python/pyspark/rdd.py

+            return True
+        return False
+
+    def _to_jrdd(self):


There's a small possibility that this name could be confusing, since self._jrdd returns a Java RDD of Python objects, whereas self.to_jrdd() returns a JavaRDD of Java objects. I would maybe rename this to something like to_java_object_rdd.

JoshRosen · 2014-08-24T02:30:54Z

This looks good to me; I had one minor comment about a potentially-confusing internal name, but we can take care of that later as part of a more general Python RDD <-> Java Object RDD utility method refactoring / cleanup.

JoshRosen · 2014-08-24T02:33:54Z

I've merged this into master. Thanks!

RDD.countApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. >>> rdd = sc.parallelize(range(1000), 10) >>> rdd.countApprox(1000, 1.0) 1000 RDD.sumApprox(self, timeout, confidence=0.95) Approximate operation to return the sum within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) >>> (rdd.sumApprox(1000) - r) / r < 0.05 RDD.meanApprox(self, timeout, confidence=0.95) :: Experimental :: Approximate operation to return the mean within a timeout or meet the confidence. >>> rdd = sc.parallelize(range(1000), 10) >>> r = sum(xrange(1000)) / 1000.0 >>> (rdd.meanApprox(1000) - r) / r < 0.05 True Author: Davies Liu <davies.liu@gmail.com> Closes apache#2095 from davies/approx and squashes the following commits: e8c252b [Davies Liu] add approx API for RDD

add approx API for RDD

e8c252b

davies mentioned this pull request Aug 22, 2014

[SPARK-2871] [PySpark] Add missing API #1791

Closed

JoshRosen reviewed Aug 24, 2014
View reviewed changes

asfgit closed this in 8df4dad Aug 24, 2014

davies deleted the approx branch September 15, 2014 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2871] [PySpark] add approx API for RDD #2095

[SPARK-2871] [PySpark] add approx API for RDD #2095

davies commented Aug 22, 2014

SparkQA commented Aug 22, 2014

SparkQA commented Aug 22, 2014

JoshRosen Aug 24, 2014

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014

[SPARK-2871] [PySpark] add approx API for RDD #2095

[SPARK-2871] [PySpark] add approx API for RDD #2095

Conversation

davies commented Aug 22, 2014

SparkQA commented Aug 22, 2014

SparkQA commented Aug 22, 2014

JoshRosen Aug 24, 2014

Choose a reason for hiding this comment

JoshRosen commented Aug 24, 2014

JoshRosen commented Aug 24, 2014