Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-3770: Make userFeatures accessible from python #2636

Closed
wants to merge 8 commits into from
Closed

SPARK-3770: Make userFeatures accessible from python #2636

wants to merge 8 commits into from

Conversation

mdagost
Copy link

@mdagost mdagost commented Oct 2, 2014

https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mengxr
Copy link
Contributor

mengxr commented Oct 3, 2014

@mdagost If you convert (Int, Array[Double]) to a java.util.List<Object> (id the first and features the second (without converting to string)), you should be able to get the data correctly on the Python side. If that works, could you add productFeatures as well? Thanks!

@davies

@davies
Copy link
Contributor

davies commented Oct 3, 2014

@mdagost @mengxr We use Pyrolite to convert Java objects into Python objects, you can get the type mapping here: https://github.com/irmen/Pyrolite

So if we convert (Int, Array[Double]) into Array[Object] or 'JList[Object]', we can get a RDD of tuple(int, array(double)) or list(int, array(double)) in Python

@mdagost
Copy link
Author

mdagost commented Oct 3, 2014

I'm totally new to Spark, so sorry if these are all dumb questions.

Are you suggesting that I convert the userFeatures RDD[(Int, Array[Double])] to RDD[Array[Object]] ? If so, do you want a helper function for doing that like I did for the string helper, or should I convert the main userFeatures to be of that type?

Also, I'm sure this is dumb, but what exact type of Object are we talking about?

@davies
Copy link
Contributor

davies commented Oct 3, 2014

We still need this wrapper, but RDD[Array[Object]] is only used for Python API, so it's better to put it in PythonMLLibAPI, maybe more general, like fromTupleRDD, which will convert any RDD[Tuple[,]] into RDD[Array[Any]], Any is similar to Java Object.

@MLnick
Copy link
Contributor

MLnick commented Oct 4, 2014

Ideally we would want to expose the actual RDD[(Int, Array)] on the PySpark side, in case they are really large - they can then be collected if need be.

Can we make use the existing pairRDDToPython function to do the conversion?

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala#L120

@jkbradley
Copy link
Member

@MLnick @mdagost There are a few functions available which you could use for the serialization, but PythonRDD.javaToPython might be a good option. You can see example usage in recommendation.py

@mdagost
Copy link
Author

mdagost commented Oct 7, 2014

I've been having trouble getting either PythonRDD.javaToPython or pairRDDToPython to work. But porting the general function I wrote from MatrixFactorizationModel.scala to PythonMLLibAPI is also giving me some trouble. I'll get back to it later this week and try to make some progress...

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mdagost
Copy link
Author

mdagost commented Oct 20, 2014

@MLnick It doesn't look like pairRDDToPython does the trick. I tried

def userFeatures(self):
    juf = self._java_model.userFeatures()                                                                                                                                                
    juf = sc._jvm.SerDeUtil.pairRDDToPython(juf, 1)
    return juf

but what comes out when I try to print the result of taking the first element of the RDD is just "[[B@176fa1a5" rather than any kind of nicely formatted python object.

@mdagost
Copy link
Author

mdagost commented Oct 20, 2014

@davies Your idea of adding something like fromTupleRDD to PythonMLLibAPI seems to be the way to go. I'm just doing some cleanup and will push userFeatures and productFeatures in just a bit.

Michelangelo D'Agostino added 3 commits October 20, 2014 11:13
…d it to expose the MF userFeatures and productFeatures in python.
… no longer needed now that we added the fromTuple2RDD function.
@mengxr
Copy link
Contributor

mengxr commented Oct 21, 2014

@mdagost Thanks for working on the SerDe! I tested it locally and it works correctly, but the unit tests for the added methods are missing. Do you mind adding them? You can follow

https://github.com/mdagost/spark/blob/mf_user_features/python/pyspark/mllib/recommendation.py#L55

Basically, we want to verify that userFeatures/productFeatures returns an RDD of key-value pairs with the correct number of records and for each records the feature dimension is correct.

@mdagost
Copy link
Author

mdagost commented Oct 21, 2014

Whoops. Forgot the tests :) I'll work on those today.

@mdagost
Copy link
Author

mdagost commented Oct 21, 2014

@mengxr Unit tests are added. I get some unrelated test failures on my local (everything in recommendation.py, including the new stuff, passes.)

@mengxr
Copy link
Contributor

mengxr commented Oct 21, 2014

this is ok to test

@mengxr
Copy link
Contributor

mengxr commented Oct 21, 2014

test this please

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2636 at commit c98f9e2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2636 at commit c98f9e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21994/
Test PASSed.

@asfgit asfgit closed this in 1a623b2 Oct 21, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants