[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

jkbradley · 2014-08-10T18:37:26Z

Added examples for statistical summarization:

Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:

Scala: RandomAndSampledRDDs.scala
python: random_and_sampled_rdds.py
Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala

Added 1 test for RDDs with only 1 value

RowMatrix.scala

numCols(): Added check for numRows = 0, with error message.
computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)

Added toDense() function

python/run-tests script

Added stat.py (doc test)

CC: @mengxr @dorx Main changes were examples to show usage across APIs.

* Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test)

…heck

* Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey

…heck

* numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.

jkbradley · 2014-08-10T18:37:38Z

Q: Is the Python SparseVector.toDense() function too big an API update?

SparkQA · 2014-08-10T18:39:39Z

QA tests have started for PR 1878. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18283/consoleFull

mengxr · 2014-08-10T23:52:24Z

Jenkins, retest this please.

SparkQA · 2014-08-10T23:54:42Z

QA tests have started for PR 1878. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18291/consoleFull

mengxr · 2014-08-12T04:08:04Z

examples/src/main/python/mllib/random_and_sampled_rdds.py

+#
+
+"""
+Randomly generated and sampled RDDs.


I don't quite understand why putting random data generation and sampling in a single example file. We can demo generating random uniform/normal/guassian/poisson RDDs in one example, and then stratified sampling in another (e.g., sampling based on the label to re-balance positive/negative examples).

Sure, I can separate them. I'll call them random_rdds.py and sampled_rdds.py

… correlations.py

…heck

Split RandomAndSampledRDDs into RandomRDDGeneration and SampledRDDs. (The name RandomRDDGeneration is to avoid a naming conflict with RandomRDDs.) RandomRDDGeneration prints first 5 samples Did same split for Python: random_rdd_generation.py and sampled_rdds.py Other small updates based on code review.

jkbradley · 2014-08-17T08:23:08Z

@mengxr Thanks for the comments! Updated accordingly.

SparkQA · 2014-08-17T08:25:10Z

QA tests have started for PR 1878 at commit 32173b7.

This patch merges cleanly.

SparkQA · 2014-08-17T09:17:41Z

QA tests have finished for PR 1878 at commit 32173b7.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-08-17T17:30:13Z

QA tests have started for PR 1878 at commit 4e5d15e.

This patch merges cleanly.

SparkQA · 2014-08-17T18:16:13Z

QA tests have finished for PR 1878 at commit 4e5d15e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

jkbradley · 2014-08-17T21:14:30Z

@mengxr It looks like the failures are in other tests; how best to proceed? With respect to the case class Params, is it OK to have them public since they are in examples? (Other examples have them public too.)

mengxr · 2014-08-18T05:40:28Z

Jenkins, test this please.

SparkQA · 2014-08-18T05:45:17Z

QA tests have started for PR 1878 at commit 60c72d9.

This patch merges cleanly.

mengxr · 2014-08-18T06:01:08Z

examples/src/main/scala/org/apache/spark/examples/mllib/SampledRDDs.scala

+    println(s"Key\tOrig\tApprox Sample\tExact Sample")
+    keyCounts.keys.toSeq.sorted.foreach { key =>
+      val origFrac = keyCounts(key) / numExamples.toDouble
+      val approxFrac = keyCountsB(key) / sizeB.toDouble


There is a chance that keyCountsB doesn't contains key. It is safer to use keyCountsB.getOrElse here.

mengxr · 2014-08-18T06:03:01Z

@jkbradley I tested the examples and found that tree.py is not included in run-tests.py. If we include it, it will throw errors due to trainClassifier needs at least three arguments. So we need to update both the unit tests and the example code.

SparkQA · 2014-08-18T06:39:47Z

QA tests have finished for PR 1878 at commit 60c72d9.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

jkbradley · 2014-08-18T15:55:59Z

@mengxr Thanks! I'll send the tree fixes in the other PR I sent just now on treeAggregate(), and I will do the keyCount fix in this PR.

…heck

…or division by 0 and for missing key in maps.

jkbradley · 2014-08-18T18:29:21Z

Jenkins, test this please.

SparkQA · 2014-08-18T18:35:22Z

QA tests have started for PR 1878 at commit dafebe2.

This patch merges cleanly.

jkbradley · 2014-08-18T18:54:58Z

@mengxr Hopefully ready pending Jenkins

SparkQA · 2014-08-18T19:26:57Z

QA tests have finished for PR 1878 at commit dafebe2.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
- case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

…heck

jkbradley · 2014-08-18T20:23:29Z

Driver suite test failed...merging with updated master and trying again.

jkbradley · 2014-08-18T21:00:26Z

Jenkins, test this please.

SparkQA · 2014-08-18T21:05:29Z

QA tests have started for PR 1878 at commit ea5c047.

This patch merges cleanly.

SparkQA · 2014-08-18T21:52:39Z

QA tests have finished for PR 1878 at commit ea5c047.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-18T22:11:12Z

Jenkins, retest this please.

SparkQA · 2014-08-18T22:15:28Z

QA tests have started for PR 1878 at commit ea5c047.

This patch merges cleanly.

SparkQA · 2014-08-18T23:11:53Z

QA tests have finished for PR 1878 at commit ea5c047.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-19T01:02:22Z

LGTM. Merged into master and branch-1.1. Thanks!

Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test) CC: mengxr dorx Main changes were examples to show usage across APIs. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits: ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps. 8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN. b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan. 32173b7 [Joseph K. Bradley] Stats examples update. c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. 65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey 064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) (cherry picked from commit c8b16ca) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API) Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey Added sc.stop() to all examples. CorrelationSuite.scala * Added 1 test for RDDs with only 1 value RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. Python SparseVector (pyspark/mllib/linalg.py) * Added toDense() function python/run-tests script * Added stat.py (doc test) CC: mengxr dorx Main changes were examples to show usage across APIs. Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1878 from jkbradley/mllib-stats-api-check and squashes the following commits: ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps. 8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN. b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan. 32173b7 [Joseph K. Bradley] Stats examples update. c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 0b7cec3 [Joseph K. Bradley] Small updates based on code review. Renamed statistical_summary.py to correlations.py ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message. 65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check 8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey 064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

jkbradley added 5 commits August 7, 2014 14:34

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

064985b

…heck

Added examples for random and sampled RDDs:

8195c78

* Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

65e4ebc

…heck

RowMatrix.scala

ab48f6e

* numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.

mengxr reviewed Aug 12, 2014
View reviewed changes

jkbradley added 4 commits August 13, 2014 10:44

Small updates based on code review. Renamed statistical_summary.py to…

0b7cec3

… correlations.py

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

cf70b07

…heck

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

c8c20dc

…heck

Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.

4e5d15e

mengxr reviewed Aug 18, 2014
View reviewed changes

jkbradley added 2 commits August 18, 2014 09:06

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

8d1e555

…heck

Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check f…

dafebe2

…or division by 0 and for missing key in maps.

Merge remote-tracking branch 'upstream/master' into mllib-stats-api-c…

ea5c047

…heck

asfgit closed this in c8b16ca Aug 19, 2014

jkbradley deleted the mllib-stats-api-check branch August 26, 2014 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

jkbradley commented Aug 10, 2014

jkbradley commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mengxr commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mengxr Aug 12, 2014

jkbradley Aug 13, 2014

jkbradley commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

jkbradley commented Aug 17, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr Aug 18, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 19, 2014

[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

Conversation

jkbradley commented Aug 10, 2014

jkbradley commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mengxr commented Aug 10, 2014

SparkQA commented Aug 10, 2014

mengxr Aug 12, 2014

Choose a reason for hiding this comment

jkbradley Aug 13, 2014

Choose a reason for hiding this comment

jkbradley commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

SparkQA commented Aug 17, 2014

jkbradley commented Aug 17, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr Aug 18, 2014

Choose a reason for hiding this comment

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

jkbradley commented Aug 18, 2014

jkbradley commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 18, 2014

SparkQA commented Aug 18, 2014

SparkQA commented Aug 18, 2014

mengxr commented Aug 19, 2014