Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-2850] [SPARK-2626] [mllib] MLlib stats examples + small fixes #1878

Closed
wants to merge 15 commits into from

Conversation

jkbradley
Copy link
Member

Added examples for statistical summarization:

  • Scala: StatisticalSummary.scala
    ** Tests: correlation, MultivariateOnlineSummarizer
  • python: statistical_summary.py
    ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:

  • Scala: RandomAndSampledRDDs.scala
  • python: random_and_sampled_rdds.py
  • Both test:
    ** RandomRDDGenerators.normalRDD, normalVectorRDD
    ** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala

  • Added 1 test for RDDs with only 1 value

RowMatrix.scala

  • numCols(): Added check for numRows = 0, with error message.
  • computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)

  • Added toDense() function

python/run-tests script

  • Added stat.py (doc test)

CC: @mengxr @dorx Main changes were examples to show usage across APIs.

* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.
@jkbradley
Copy link
Member Author

Q: Is the Python SparseVector.toDense() function too big an API update?

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA tests have started for PR 1878. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18283/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 10, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 10, 2014

QA tests have started for PR 1878. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18291/consoleFull

#

"""
Randomly generated and sampled RDDs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why putting random data generation and sampling in a single example file. We can demo generating random uniform/normal/guassian/poisson RDDs in one example, and then stratified sampling in another (e.g., sampling based on the label to re-balance positive/negative examples).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can separate them. I'll call them random_rdds.py and sampled_rdds.py

Split RandomAndSampledRDDs into RandomRDDGeneration and SampledRDDs.
(The name RandomRDDGeneration is to avoid a naming conflict with RandomRDDs.)

RandomRDDGeneration prints first 5 samples

Did same split for Python: random_rdd_generation.py and sampled_rdds.py

Other small updates based on code review.
@jkbradley
Copy link
Member Author

@mengxr Thanks for the comments! Updated accordingly.

@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have started for PR 1878 at commit 32173b7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have finished for PR 1878 at commit 32173b7.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have started for PR 1878 at commit 4e5d15e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 17, 2014

QA tests have finished for PR 1878 at commit 4e5d15e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

@jkbradley
Copy link
Member Author

@mengxr It looks like the failures are in other tests; how best to proceed? With respect to the case class Params, is it OK to have them public since they are in examples? (Other examples have them public too.)

@mengxr
Copy link
Contributor

mengxr commented Aug 18, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1878 at commit 60c72d9.

  • This patch merges cleanly.

println(s"Key\tOrig\tApprox Sample\tExact Sample")
keyCounts.keys.toSeq.sorted.foreach { key =>
val origFrac = keyCounts(key) / numExamples.toDouble
val approxFrac = keyCountsB(key) / sizeB.toDouble
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance that keyCountsB doesn't contains key. It is safer to use keyCountsB.getOrElse here.

@mengxr
Copy link
Contributor

mengxr commented Aug 18, 2014

@jkbradley I tested the examples and found that tree.py is not included in run-tests.py. If we include it, it will throw errors due to trainClassifier needs at least three arguments. So we need to update both the unit tests and the example code.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1878 at commit 60c72d9.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

@jkbradley
Copy link
Member Author

@mengxr Thanks! I'll send the tree fixes in the other PR I sent just now on treeAggregate(), and I will do the keyCount fix in this PR.

@jkbradley
Copy link
Member Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1878 at commit dafebe2.

  • This patch merges cleanly.

@jkbradley
Copy link
Member Author

@mengxr Hopefully ready pending Jenkins

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1878 at commit dafebe2.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_linear_regression_data.txt")
    • case class Params(input: String = "data/mllib/sample_binary_classification_data.txt")

@jkbradley
Copy link
Member Author

Driver suite test failed...merging with updated master and trying again.

@jkbradley
Copy link
Member Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1878 at commit ea5c047.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1878 at commit ea5c047.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 18, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have started for PR 1878 at commit ea5c047.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 18, 2014

QA tests have finished for PR 1878 at commit ea5c047.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 19, 2014

LGTM. Merged into master and branch-1.1. Thanks!

asfgit pushed a commit that referenced this pull request Aug 19, 2014
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

(cherry picked from commit c8b16ca)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in c8b16ca Aug 19, 2014
@jkbradley jkbradley deleted the mllib-stats-api-check branch August 26, 2014 17:41
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
Added examples for statistical summarization:
* Scala: StatisticalSummary.scala
** Tests: correlation, MultivariateOnlineSummarizer
* python: statistical_summary.py
** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)

Added examples for random and sampled RDDs:
* Scala: RandomAndSampledRDDs.scala
* python: random_and_sampled_rdds.py
* Both test:
** RandomRDDGenerators.normalRDD, normalVectorRDD
** RDD.sample, takeSample, sampleByKey

Added sc.stop() to all examples.

CorrelationSuite.scala
* Added 1 test for RDDs with only 1 value

RowMatrix.scala
* numCols(): Added check for numRows = 0, with error message.
* computeCovariance(): Added check for numRows <= 1, with error message.

Python SparseVector (pyspark/mllib/linalg.py)
* Added toDense() function

python/run-tests script
* Added stat.py (doc test)

CC: mengxr dorx  Main changes were examples to show usage across APIs.

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes apache#1878 from jkbradley/mllib-stats-api-check and squashes the following commits:

ea5c047 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
dafebe2 [Joseph K. Bradley] Bug fixes for examples SampledRDDs.scala and sampled_rdds.py: Check for division by 0 and for missing key in maps.
8d1e555 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
60c72d9 [Joseph K. Bradley] Fixed stat.py doc test to work for Python versions printing nan or NaN.
b20d90a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
4e5d15e [Joseph K. Bradley] Changed pyspark/mllib/stat.py doc tests to use NaN instead of nan.
32173b7 [Joseph K. Bradley] Stats examples update.
c8c20dc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
cf70b07 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
0b7cec3 [Joseph K. Bradley] Small updates based on code review.  Renamed statistical_summary.py to correlations.py
ab48f6e [Joseph K. Bradley] RowMatrix.scala * numCols(): Added check for numRows = 0, with error message. * computeCovariance(): Added check for numRows <= 1, with error message.
65e4ebc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
8195c78 [Joseph K. Bradley] Added examples for random and sampled RDDs: * Scala: RandomAndSampledRDDs.scala * python: random_and_sampled_rdds.py * Both test: ** RandomRDDGenerators.normalRDD, normalVectorRDD ** RDD.sample, takeSample, sampleByKey
064985b [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into mllib-stats-api-check
ee918e9 [Joseph K. Bradley] Added examples for statistical summarization: * Scala: StatisticalSummary.scala ** Tests: correlation, MultivariateOnlineSummarizer * python: statistical_summary.py ** Tests: correlation (since MultivariateOnlineSummarizer has no Python API)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants