-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614
Conversation
Test build #30678 has started for PR 5614 at commit |
Test build #30678 has finished for PR 5614 at commit
|
Test FAILed. |
@yanboliang Sent you a PR at yanboliang#2 for using namedtuples. I'm also thinking about pickling the items into byte strings on the Python side before training. Then on the JVM side, the items are all strings and we don't need to worry about the compatibility of SerDe. When we map the frequent itemsets back, we can unpickle the byte strings. Maybe we can try this in another PR. |
Test build #30723 has started for PR 5614 at commit |
@mengxr Thank you for your comments and help, I have merged your PR to this PR. |
Test build #30723 has finished for PR 5614 at commit
|
Test FAILed. |
test this please |
Test build #30726 has started for PR 5614 at commit |
Test build #30726 has finished for PR 5614 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks! |
That is different. In FPGrowth, we don't really care about the item type as long as they are serializable. So it is not necessary to map Python objects into their equivalent JVM objects through SerDes. Instead, we can pickle the items on Python side and treat all items as strings on the JVM side. I'm not sure whether it is worth doing this optimization. Maybe we should wait and see whether there are issues with the current implementation first. |
…istent with Java API Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures``` It return a RDD with each tuple is composed of an array and a long value. I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#5614 from yanboliang/spark-6827 and squashes the following commits: da8c404 [Yanbo Liang] use namedtuple 5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
…istent with Java API Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures``` It return a RDD with each tuple is composed of an array and a long value. I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#5614 from yanboliang/spark-6827 and squashes the following commits: da8c404 [Yanbo Liang] use namedtuple 5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
Make PySpark
FPGrowthModel.freqItemsets
consistent with Java/Scala API likeMatrixFactorizationModel.userFeatures
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.