Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA #7963

Closed
wants to merge 7 commits into from

Conversation

MechCoder
Copy link
Contributor

Singular Value Decomposition wrappers are missing in PySpark. Since the base for a RowMatrix has been laid writing the wrappers becomes straightforward. Will follow up with the PCA Wrappers in another PR.

@MechCoder MechCoder changed the title [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015
@MechCoder
Copy link
Contributor Author

Actually I'll add the PCA wrappers in this PR as well.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39888 has finished for PR 7963 at commit 25999f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39895 has finished for PR 7963 at commit a65efbb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):

@MechCoder MechCoder force-pushed the svd_pyspark branch 3 times, most recently from 56978ae to f64a83f Compare August 5, 2015 19:07
@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39901 has finished for PR 7963 at commit f64a83f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39904 has finished for PR 7963 at commit 2286bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

@MechCoder MechCoder changed the title [SPARK-6227] [WIP] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD Aug 5, 2015
@MechCoder
Copy link
Contributor Author

All right this PR is ready for review.

cc: @dusenberrymw @mengxr

@MechCoder MechCoder changed the title [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD [SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA Aug 5, 2015
@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39916 has finished for PR 7963 at commit 30ef817.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):
    • case class In(value: Expression, list: Seq[Expression]) extends Predicate
    • case class InSet(child: Expression, hset: Set[Any]) extends UnaryExpression with Predicate

@@ -128,6 +128,26 @@ quick-start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.

</div>
<div data-lang="python" markdown="1">
{% highlight python %}
from pyspark.mllib.linalg import Matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This import isn't needed.

@dusenberrymw
Copy link
Contributor

Great work, @MechCoder! I left some very small comments, and otherwise it looks good.

@MechCoder
Copy link
Contributor Author

Thanks for the reviews, I have addressed your comments. Do you have anything else?

@SparkQA
Copy link

SparkQA commented Aug 10, 2015

Test build #40286 has finished for PR 7963 at commit c62e622.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class SingularValueDecomposition(JavaModelWrapper):

>>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect()
[IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])]
"""
return IndexedRowMatrix(self._java_matrix_wrapper.call("multiply", matrix))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd check that matrix is a DenseMatrix here as well.

@dusenberrymw
Copy link
Contributor

@MechCoder I'd just add the DenseMatrix checks, and then this will be great. Thanks!

ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 27, 2016
…ted Linear Algebra Classes

This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR apache#7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR apache#7963 for SPARK-6227.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes apache#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
@cavaunpeu
Copy link

any progress on this @dusenberrymw @MechCoder? it would be really helpful if I could do matrix multiplication in pyspark.

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59437 has finished for PR 7963 at commit 70a871d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

@cavaunpeu Thanks for the ping! I think I've addressed the pending diff comment.

It will take me some time to refresh the knowledge of the codebase. Can @MLnick or @holdenk give a final pass?

@SparkQA
Copy link

SparkQA commented May 27, 2016

Test build #59445 has finished for PR 7963 at commit 0bc6a3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

Bump?

@MLnick
Copy link
Contributor

MLnick commented Jun 17, 2016

@MechCoder thanks for updating this - may need to wait until after 2.0 release for review.

@holdenk
Copy link
Contributor

holdenk commented Oct 7, 2016

Now that its past the 2.0 release should we maybe take another look @MLnick / @davies?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this and sorry this fell through the cracks post 2.0. I've left some initial comments - likely the same comments apply to the indexed one as well.

@@ -84,6 +84,25 @@ quick-start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.

</div>
<div data-lang="python" markdown="1">
{% highlight python %}
from pyspark.mllib.linalg.distributed import RowMatrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now days we tend write new examples separately and then use the include example syntax to bring them

The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space.

{% highlight python %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@@ -303,6 +303,121 @@ def tallSkinnyQR(self, computeQ=False):
R = decomp.call("R")
return QRDecomposition(Q, R)

def computeSVD(self, k, computeU=False, rCond=1e-9):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a since annotation here

For more specific details on implementation, please refer
the scala documentation.

:param k: Set the number of singular values to keep.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to copy the longer description from RowMatrix for the k param


def computePrincipalComponents(self, k):
"""
Computes the k principal components of the given row matrix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to copy the warnings form RowMatrix here as well.



class SingularValueDecomposition(JavaModelWrapper):
"""Wrapper around the SingularValueDecomposition scala case class"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably add a versionAdded

@MechCoder
Copy link
Contributor Author

Thanks for the reviews @holdenk . Unfortunately I will not be able to work on this anytime soon. Feel free to cherry-pick the commits, (if you wish)

@holdenk
Copy link
Contributor

holdenk commented Oct 14, 2016

@MechCoder Thanks! I'll look around and see if anyone else is interested in taking this over and bringing it to the finish line otherwise I'll pick it up myself after OSCON :)

@HyukjinKwon
Copy link
Member

Ping @MechCoder, are you able to proceed this PR and address the comments above? If not it might be good to close this for now.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2017

Note I revived this at #17621 based on @MechCoder's work.

@MechCoder MechCoder deleted the svd_pyspark branch May 1, 2017 04:05
ghost pushed a commit to dbtsai/spark that referenced this pull request May 3, 2017
…CA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).

Based on apache#7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder <manojkumarsivaraj334@gmail.com>
Author: Nick Pentreath <nickp@za.ibm.com>

Closes apache#17621 from MLnick/SPARK-6227-pyspark-svd-pca.
asfgit pushed a commit that referenced this pull request May 3, 2017
…CA (v2)

Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only).

Based on #7963, updated.

## How was this patch tested?

New doc tests and unit tests. Ran all examples locally.

Author: MechCoder <manojkumarsivaraj334@gmail.com>
Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca.

(cherry picked from commit db2fb84)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@SixAlien3
Copy link

@MLnick Hi, I'm interesting in this PySpark wrapper for SVD. How many columns can this support? Cuz I see in the old document it can only support columns <1000. How about this wrapper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants