SPARK-3770: Make userFeatures accessible from python #2636

mdagost · 2014-10-02T21:43:42Z

https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

AmplabJenkins · 2014-10-02T21:47:10Z

Can one of the admins verify this patch?

mengxr · 2014-10-03T12:03:00Z

@mdagost If you convert (Int, Array[Double]) to a java.util.List<Object> (id the first and features the second (without converting to string)), you should be able to get the data correctly on the Python side. If that works, could you add productFeatures as well? Thanks!

@davies

davies · 2014-10-03T15:39:06Z

@mdagost @mengxr We use Pyrolite to convert Java objects into Python objects, you can get the type mapping here: https://github.com/irmen/Pyrolite

So if we convert (Int, Array[Double]) into Array[Object] or 'JList[Object]', we can get a RDD of tuple(int, array(double)) or list(int, array(double)) in Python

mdagost · 2014-10-03T19:10:14Z

I'm totally new to Spark, so sorry if these are all dumb questions.

Are you suggesting that I convert the userFeatures RDD[(Int, Array[Double])] to RDD[Array[Object]] ? If so, do you want a helper function for doing that like I did for the string helper, or should I convert the main userFeatures to be of that type?

Also, I'm sure this is dumb, but what exact type of Object are we talking about?

davies · 2014-10-03T20:30:11Z

We still need this wrapper, but RDD[Array[Object]] is only used for Python API, so it's better to put it in PythonMLLibAPI, maybe more general, like fromTupleRDD, which will convert any RDD[Tuple[,]] into RDD[Array[Any]], Any is similar to Java Object.

MLnick · 2014-10-04T13:42:35Z

Ideally we would want to expose the actual RDD[(Int, Array)] on the PySpark side, in case they are really large - they can then be collected if need be.

Can we make use the existing pairRDDToPython function to do the conversion?

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala#L120

jkbradley · 2014-10-07T19:18:15Z

@MLnick @mdagost There are a few functions available which you could use for the serialization, but PythonRDD.javaToPython might be a good option. You can see example usage in recommendation.py

mdagost · 2014-10-07T22:18:59Z

I've been having trouble getting either PythonRDD.javaToPython or pairRDDToPython to work. But porting the general function I wrote from MatrixFactorizationModel.scala to PythonMLLibAPI is also giving me some trouble. I'll get back to it later this week and try to make some progress...

AmplabJenkins · 2014-10-09T20:32:45Z

Can one of the admins verify this patch?

mdagost · 2014-10-20T14:48:22Z

@MLnick It doesn't look like pairRDDToPython does the trick. I tried

def userFeatures(self):
    juf = self._java_model.userFeatures()                                                                                                                                                
    juf = sc._jvm.SerDeUtil.pairRDDToPython(juf, 1)
    return juf

but what comes out when I try to print the result of taking the first element of the RDD is just "[[B@176fa1a5" rather than any kind of nicely formatted python object.

mdagost · 2014-10-20T15:55:00Z

@davies Your idea of adding something like fromTupleRDD to PythonMLLibAPI seems to be the way to go. I'm just doing some cleanup and will push userFeatures and productFeatures in just a bit.

…d it to expose the MF userFeatures and productFeatures in python.

… no longer needed now that we added the fromTuple2RDD function.

mengxr · 2014-10-21T07:35:11Z

@mdagost Thanks for working on the SerDe! I tested it locally and it works correctly, but the unit tests for the added methods are missing. Do you mind adding them? You can follow

https://github.com/mdagost/spark/blob/mf_user_features/python/pyspark/mllib/recommendation.py#L55

Basically, we want to verify that userFeatures/productFeatures returns an RDD of key-value pairs with the correct number of records and for each records the feature dimension is correct.

mdagost · 2014-10-21T13:35:10Z

Whoops. Forgot the tests :) I'll work on those today.

mdagost · 2014-10-21T14:27:23Z

@mengxr Unit tests are added. I get some unrelated test failures on my local (everything in recommendation.py, including the new stuff, passes.)

mengxr · 2014-10-21T16:28:07Z

this is ok to test

mengxr · 2014-10-21T16:28:24Z

test this please

SparkQA · 2014-10-21T16:34:46Z

QA tests have started for PR 2636 at commit c98f9e2.

This patch merges cleanly.

SparkQA · 2014-10-21T17:43:02Z

QA tests have finished for PR 2636 at commit c98f9e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-21T17:43:05Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21994/
Test PASSed.

Michelangelo D'Agostino added 3 commits October 2, 2014 08:33

Added scala function to stringify userFeatures for access in python.

e1fbe5e

It's working now.

cdd98e3

A couple of lint cleanups and a comment.

34cb2a2

Michelangelo D'Agostino added 3 commits October 20, 2014 11:13

Implemented a function called fromTuple2RDD in PythonMLLibAPI and use…

2aa1bf8

…d it to expose the MF userFeatures and productFeatures in python.

Eliminated a function from our first approach to this problem that is…

a6ffb96

… no longer needed now that we added the fromTuple2RDD function.

Merged master and resolved conflict.

2481a2a

Michelangelo D'Agostino added 2 commits October 21, 2014 08:52

Merge branch 'master' into mf_user_features

d5eadf8

Added unit tests for userFeatures and productFeatures and merged master.

c98f9e2

asfgit closed this in 1a623b2 Oct 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-3770: Make userFeatures accessible from python #2636

SPARK-3770: Make userFeatures accessible from python #2636

mdagost commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

mengxr commented Oct 3, 2014

davies commented Oct 3, 2014

mdagost commented Oct 3, 2014

davies commented Oct 3, 2014

MLnick commented Oct 4, 2014

jkbradley commented Oct 7, 2014

mdagost commented Oct 7, 2014

AmplabJenkins commented Oct 9, 2014

mdagost commented Oct 20, 2014

mdagost commented Oct 20, 2014

mengxr commented Oct 21, 2014

mdagost commented Oct 21, 2014

mdagost commented Oct 21, 2014

mengxr commented Oct 21, 2014

mengxr commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014

SPARK-3770: Make userFeatures accessible from python #2636

SPARK-3770: Make userFeatures accessible from python #2636

Conversation

mdagost commented Oct 2, 2014

AmplabJenkins commented Oct 2, 2014

mengxr commented Oct 3, 2014

davies commented Oct 3, 2014

mdagost commented Oct 3, 2014

davies commented Oct 3, 2014

MLnick commented Oct 4, 2014

jkbradley commented Oct 7, 2014

mdagost commented Oct 7, 2014

AmplabJenkins commented Oct 9, 2014

mdagost commented Oct 20, 2014

mdagost commented Oct 20, 2014

mengxr commented Oct 21, 2014

mdagost commented Oct 21, 2014

mdagost commented Oct 21, 2014

mengxr commented Oct 21, 2014

mengxr commented Oct 21, 2014

SparkQA commented Oct 21, 2014

SparkQA commented Oct 21, 2014

AmplabJenkins commented Oct 21, 2014