-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6227] [MLlib] [PySpark] Implement PySpark wrappers for SVD and PCA #7963
Conversation
Actually I'll add the PCA wrappers in this PR as well. |
Test build #39888 has finished for PR 7963 at commit
|
Test build #39895 has finished for PR 7963 at commit
|
56978ae
to
f64a83f
Compare
Test build #39901 has finished for PR 7963 at commit
|
Test build #39904 has finished for PR 7963 at commit
|
All right this PR is ready for review. cc: @dusenberrymw @mengxr |
Test build #39916 has finished for PR 7963 at commit
|
@@ -128,6 +128,26 @@ quick-start guide. Be sure to also include *spark-mllib* to your build file as | |||
a dependency. | |||
|
|||
</div> | |||
<div data-lang="python" markdown="1"> | |||
{% highlight python %} | |||
from pyspark.mllib.linalg import Matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: This import isn't needed.
Great work, @MechCoder! I left some very small comments, and otherwise it looks good. |
Thanks for the reviews, I have addressed your comments. Do you have anything else? |
Test build #40286 has finished for PR 7963 at commit
|
>>> mat.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect() | ||
[IndexedRow(0, [2.0,3.0]), IndexedRow(1, [6.0,11.0])] | ||
""" | ||
return IndexedRowMatrix(self._java_matrix_wrapper.call("multiply", matrix)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd check that matrix
is a DenseMatrix
here as well.
@MechCoder I'd just add the |
…ted Linear Algebra Classes This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows: * `RowMatrix` <sup>**[1]**</sup> 1. `computeGramianMatrix` 2. `computeCovariance` 3. `computeColumnSummaryStatistics` 4. `columnSimilarities` 5. `tallSkinnyQR` <sup>**[2]**</sup> * `IndexedRowMatrix` <sup>**[3]**</sup> 1. `computeGramianMatrix` * `CoordinateMatrix` 1. `transpose` * `BlockMatrix` 1. `validate` 2. `cache` 3. `persist` 4. `transpose` **[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR apache#7963 for SPARK-6227. **[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR. Therefore, I have marked this PR as WIP and am open to discussion. **[3]**: Note: `multiply` and `computeSVD` are already part of PR apache#7963 for SPARK-6227. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes apache#9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
any progress on this @dusenberrymw @MechCoder? it would be really helpful if I could do matrix multiplication in pyspark. |
Test build #59437 has finished for PR 7963 at commit
|
@cavaunpeu Thanks for the ping! I think I've addressed the pending diff comment. It will take me some time to refresh the knowledge of the codebase. Can @MLnick or @holdenk give a final pass? |
Test build #59445 has finished for PR 7963 at commit
|
Bump? |
@MechCoder thanks for updating this - may need to wait until after 2.0 release for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this and sorry this fell through the cracks post 2.0. I've left some initial comments - likely the same comments apply to the indexed one as well.
@@ -84,6 +84,25 @@ quick-start guide. Be sure to also include *spark-mllib* to your build file as | |||
a dependency. | |||
|
|||
</div> | |||
<div data-lang="python" markdown="1"> | |||
{% highlight python %} | |||
from pyspark.mllib.linalg.distributed import RowMatrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now days we tend write new examples separately and then use the include example syntax to bring them
The following code demonstrates how to compute principal components on a `RowMatrix` | ||
and use them to project the vectors into a low-dimensional space. | ||
|
||
{% highlight python %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
@@ -303,6 +303,121 @@ def tallSkinnyQR(self, computeQ=False): | |||
R = decomp.call("R") | |||
return QRDecomposition(Q, R) | |||
|
|||
def computeSVD(self, k, computeU=False, rCond=1e-9): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add a since annotation here
For more specific details on implementation, please refer | ||
the scala documentation. | ||
|
||
:param k: Set the number of singular values to keep. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to copy the longer description from RowMatrix for the k param
|
||
def computePrincipalComponents(self, k): | ||
""" | ||
Computes the k principal components of the given row matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be good to copy the warnings form RowMatrix here as well.
|
||
|
||
class SingularValueDecomposition(JavaModelWrapper): | ||
"""Wrapper around the SingularValueDecomposition scala case class""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably add a versionAdded
Thanks for the reviews @holdenk . Unfortunately I will not be able to work on this anytime soon. Feel free to cherry-pick the commits, (if you wish) |
@MechCoder Thanks! I'll look around and see if anyone else is interested in taking this over and bringing it to the finish line otherwise I'll pick it up myself after OSCON :) |
Ping @MechCoder, are you able to proceed this PR and address the comments above? If not it might be good to close this for now. |
Note I revived this at #17621 based on @MechCoder's work. |
…CA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on apache#7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Nick Pentreath <nickp@za.ibm.com> Closes apache#17621 from MLnick/SPARK-6227-pyspark-svd-pca.
…CA (v2) Add PCA and SVD to PySpark's wrappers for `RowMatrix` and `IndexedRowMatrix` (SVD only). Based on #7963, updated. ## How was this patch tested? New doc tests and unit tests. Ran all examples locally. Author: MechCoder <manojkumarsivaraj334@gmail.com> Author: Nick Pentreath <nickp@za.ibm.com> Closes #17621 from MLnick/SPARK-6227-pyspark-svd-pca. (cherry picked from commit db2fb84) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@MLnick Hi, I'm interesting in this PySpark wrapper for SVD. How many columns can this support? Cuz I see in the old document it can only support columns <1000. How about this wrapper? |
Singular Value Decomposition wrappers are missing in PySpark. Since the base for a RowMatrix has been laid writing the wrappers becomes straightforward. Will follow up with the PCA Wrappers in another PR.