[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

yanboliang · 2015-04-21T15:26:02Z

Make PySpark FPGrowthModel.freqItemsets consistent with Java/Scala API like MatrixFactorizationModel.userFeatures
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

SparkQA · 2015-04-21T15:28:36Z

Test build #30678 has started for PR 5614 at commit 5532e78.

SparkQA · 2015-04-21T16:35:58Z

Test build #30678 has finished for PR 5614 at commit 5532e78.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

AmplabJenkins · 2015-04-21T16:36:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30678/
Test FAILed.

mengxr · 2015-04-21T18:05:31Z

@yanboliang Sent you a PR at yanboliang#2 for using namedtuples.

I'm also thinking about pickling the items into byte strings on the Python side before training. Then on the JVM side, the items are all strings and we don't need to worry about the compatibility of SerDe. When we map the frequent itemsets back, we can unpickle the byte strings. Maybe we can try this in another PR.

SparkQA · 2015-04-22T03:18:37Z

Test build #30723 has started for PR 5614 at commit da8c404.

yanboliang · 2015-04-22T03:21:25Z

@mengxr Thank you for your comments and help, I have merged your PR to this PR.
I have investigated the pickle/unpickle problems, I found the existing code (_py2java and _java2py) has done what you described. In the function _py2java the object will be transformed to bytearray except its type is one of int, long, float, bool, bytes, unicode. Have I understand you correctly?

SparkQA · 2015-04-22T04:24:56Z

Test build #30723 has finished for PR 5614 at commit da8c404.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T04:25:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30723/
Test FAILed.

mengxr · 2015-04-22T04:28:18Z

test this please

SparkQA · 2015-04-22T04:33:40Z

Test build #30726 has started for PR 5614 at commit da8c404.

SparkQA · 2015-04-22T06:11:26Z

Test build #30726 has finished for PR 5614 at commit da8c404.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FreqItemset(namedtuple("FreqItemset", ["items", "freq"])):
This patch does not change any dependencies.

AmplabJenkins · 2015-04-22T06:11:31Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/30726/
Test PASSed.

mengxr · 2015-04-23T00:23:17Z

LGTM. Merged into master. Thanks!

mengxr · 2015-04-23T00:28:27Z

That is different. In FPGrowth, we don't really care about the item type as long as they are serializable. So it is not necessary to map Python objects into their equivalent JVM objects through SerDes. Instead, we can pickle the items on Python side and treat all items as strings on the JVM side. I'm not sure whether it is worth doing this optimization. Maybe we should wait and see whether there are issues with the current implementation first.

…istent with Java API Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures``` It return a RDD with each tuple is composed of an array and a long value. I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#5614 from yanboliang/spark-6827 and squashes the following commits: da8c404 [Yanbo Liang] use namedtuple 5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API

Wrap FPGrowthModel.freqItemsets and make it consistent with Java API

5532e78

use namedtuple

da8c404

asfgit closed this in f4f3998 Apr 23, 2015

yanboliang deleted the spark-6827 branch April 24, 2015 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

yanboliang commented Apr 21, 2015

SparkQA commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

mengxr commented Apr 21, 2015

SparkQA commented Apr 22, 2015

yanboliang commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

mengxr commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

mengxr commented Apr 23, 2015

mengxr commented Apr 23, 2015

[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

[SPARK-6827] [mllib] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API #5614

Conversation

yanboliang commented Apr 21, 2015

SparkQA commented Apr 21, 2015

SparkQA commented Apr 21, 2015

AmplabJenkins commented Apr 21, 2015

mengxr commented Apr 21, 2015

SparkQA commented Apr 22, 2015

yanboliang commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

mengxr commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

AmplabJenkins commented Apr 22, 2015

mengxr commented Apr 23, 2015

mengxr commented Apr 23, 2015