-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-3770: Make userFeatures accessible from python #2636
Conversation
Can one of the admins verify this patch? |
@mdagost @mengxr We use Pyrolite to convert Java objects into Python objects, you can get the type mapping here: https://github.com/irmen/Pyrolite So if we convert |
I'm totally new to Spark, so sorry if these are all dumb questions. Are you suggesting that I convert the userFeatures Also, I'm sure this is dumb, but what exact type of |
We still need this wrapper, but RDD[Array[Object]] is only used for Python API, so it's better to put it in PythonMLLibAPI, maybe more general, like fromTupleRDD, which will convert any RDD[Tuple[,]] into RDD[Array[Any]], Any is similar to Java Object. |
Ideally we would want to expose the actual RDD[(Int, Array)] on the PySpark side, in case they are really large - they can then be collected if need be. Can we make use the existing |
I've been having trouble getting either |
Can one of the admins verify this patch? |
@MLnick It doesn't look like def userFeatures(self):
juf = self._java_model.userFeatures()
juf = sc._jvm.SerDeUtil.pairRDDToPython(juf, 1)
return juf but what comes out when I try to print the result of taking the first element of the RDD is just "[[B@176fa1a5" rather than any kind of nicely formatted python object. |
@davies Your idea of adding something like |
…d it to expose the MF userFeatures and productFeatures in python.
… no longer needed now that we added the fromTuple2RDD function.
@mdagost Thanks for working on the SerDe! I tested it locally and it works correctly, but the unit tests for the added methods are missing. Do you mind adding them? You can follow Basically, we want to verify that userFeatures/productFeatures returns an RDD of key-value pairs with the correct number of records and for each records the feature dimension is correct. |
Whoops. Forgot the tests :) I'll work on those today. |
@mengxr Unit tests are added. I get some unrelated test failures on my local (everything in |
this is ok to test |
test this please |
QA tests have started for PR 2636 at commit
|
QA tests have finished for PR 2636 at commit
|
Test PASSed. |
https://issues.apache.org/jira/browse/SPARK-3770
We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py