Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

Closed
wants to merge 2 commits into from

Conversation

yanboliang
Copy link
Contributor

Make PySpark FPGrowthModel.freqItemsets consistent with Java/Scala API like MatrixFactorizationModel.userFeatures
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30678 has started for PR 5614 at commit 5532e78.

@SparkQA
Copy link

SparkQA commented Apr 21, 2015

Test build #30678 has finished for PR 5614 at commit 5532e78.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30678/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Apr 21, 2015

@yanboliang Sent you a PR at yanboliang#2 for using namedtuples.

I'm also thinking about pickling the items into byte strings on the Python side before training. Then on the JVM side, the items are all strings and we don't need to worry about the compatibility of SerDe. When we map the frequent itemsets back, we can unpickle the byte strings. Maybe we can try this in another PR.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30723 has started for PR 5614 at commit da8c404.

@yanboliang
Copy link
Contributor Author

@mengxr Thank you for your comments and help, I have merged your PR to this PR.
I have investigated the pickle/unpickle problems, I found the existing code (_py2java and _java2py) has done what you described. In the function _py2java the object will be transformed to bytearray except its type is one of int, long, float, bool, bytes, unicode. Have I understand you correctly?

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30723 has finished for PR 5614 at commit da8c404.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30723/
Test FAILed.

@mengxr
Copy link
Contributor

mengxr commented Apr 22, 2015

test this please

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30726 has started for PR 5614 at commit da8c404.

@SparkQA
Copy link

SparkQA commented Apr 22, 2015

Test build #30726 has finished for PR 5614 at commit da8c404.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30726/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Apr 23, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in f4f3998 Apr 23, 2015
@mengxr
Copy link
Contributor

mengxr commented Apr 23, 2015

That is different. In FPGrowth, we don't really care about the item type as long as they are serializable. So it is not necessary to map Python objects into their equivalent JVM objects through SerDes. Instead, we can pickle the items on Python side and treat all items as strings on the JVM side. I'm not sure whether it is worth doing this optimization. Maybe we should wait and see whether there are issues with the current implementation first.

@yanboliang yanboliang deleted the spark-6827 branch April 24, 2015 10:02
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
…istent with Java API

Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#5614 from yanboliang/spark-6827 and squashes the following commits:

da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
…istent with Java API

Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#5614 from yanboliang/spark-6827 and squashes the following commits:

da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants