[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

jkbradley · 2015-11-06T01:26:26Z

This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:

I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [https://github.com//pull/9484], but I'll try to merge [https://github.com//pull/9484] first and then rebase this PR.

CC: @hhbyyh @feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: @mengxr

SparkQA · 2015-11-06T01:36:46Z

Test build #45182 has finished for PR 9513 at commit 583e173.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

hhbyyh · 2015-11-06T17:33:19Z

@jkbradley Thanks for looping me in and welcome back. I'll try to send some feedback this weekend.

jkbradley · 2015-11-06T19:58:43Z

OK, I believe that's it, so this is ready for review. Thanks!

SparkQA · 2015-11-06T20:00:36Z

Test build #45243 has finished for PR 9513 at commit ffb68c5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

SparkQA · 2015-11-06T20:43:51Z

Test build #45245 has finished for PR 9513 at commit 9589e01.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

feynmanliang · 2015-11-07T00:19:01Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

+   * @group getParam
+   */
+  @Since("1.6.0")
+  def getAlpha: Array[Double] = getDocConcentration


-1 on having both alpha and docConcentration, can we take the opportunity here to choose one of the names and stick with it (it's confusing to have both alpha and docConcentration throughout the code)?

I'm neutral on this.

Hm, yeah, I neglected to mention this in the design doc. I'm OK with removing alpha and beta from everything but the doc. I'll go ahead and take those out.

feynmanliang · 2015-11-07T00:55:22Z

Made a pass

hhbyyh · 2015-11-09T03:24:09Z

mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala

+   * @group param
+   */
+  @Since("1.6.0")
+  final val k = new IntParam(this, "k", "number of clusters to create", ParamValidators.gt(1))


Number of topics to infer ?

…e unset by default

jkbradley · 2015-11-10T00:51:40Z

@feynmanliang I think that's the last fix.

Thinking more about it, I'm on board with changing LDAModel to be abstract, as long as it's a minor change. I'll see about making it in a follow-up PR.

jkbradley · 2015-11-10T01:03:58Z

One more possible change: How do yall feel about renaming tau0 and kappa to follow the new sklearn LDA API?

kappa -> learningDecay
tau0 -> learningOffset

[http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation]

jkbradley · 2015-11-10T01:42:29Z

I'm going to go ahead and change the tau0, kappa names.

SparkQA · 2015-11-10T02:00:48Z

Test build #45452 has finished for PR 9513 at commit a55de6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

SparkQA · 2015-11-10T02:42:10Z

Test build #45463 has finished for PR 9513 at commit 8eaa596.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

SparkQA · 2015-11-10T02:47:40Z

Test build #2025 has finished for PR 9513 at commit 16a061c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

feynmanliang · 2015-11-10T10:09:28Z

+1 on the renames

On Tue, Nov 10, 2015, 02:48 Apache Spark QA notifications@github.com
wrote:

Test build #2025 has finished
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2025/consoleFull
for PR 9513 at commit 16a061c
16a061c
.

This patch passes all tests.

This patch merges cleanly.

This patch adds the following public classes (experimental):\n * class
LDA @SInCE("1.6.0") (\n

—

Reply to this email directly or view it on GitHub
#9513 (comment).

jkbradley · 2015-11-11T00:19:17Z

Got OK from @mengxr offline, so merging with master and branch-1.6

This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines. (cherry picked from commit e281b87) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

jkbradley · 2015-11-11T00:21:36Z

I'll see about sending a follow-up with the subclassing. let me know if there's anything else I'm forgetting.

Thanks @feynmanliang and @hhbyyh for reviewing!

jkbradley · 2015-11-11T01:32:05Z

Actually, @feynmanliang I realized as I was trying to rewrite this that using a lazy val for DistributedLDAModel.oldLocalModel prevents us from one important optimization in DistributedLDAModel.copy, which is called every time we call model.transform:

Currently: We copy the local model if it has been instantiated (involving collecting the topicsMatrix to the driver).
With a lazy val: I don't see a good way to ensure the collect only happens once.

Given that this could mean considerable overhead, including several copies of topicsMatrix on the driver, I'd prefer to keep the current class structure.

feynmanliang · 2015-11-11T01:39:51Z

@jkbradley Not sure I understand, if lazy val oldModel = *something*.collect() then collect() will only be called once on the first reference to oldModel and every subsequent reference to oldModel will use the Array[...] materialized by collect()

feynmanliang · 2015-11-11T01:40:58Z

Oh wait I see what you're saying

feynmanliang · 2015-11-11T01:53:10Z

I still think it's wrong for a LocalLDAModel to optionally have a OldLocalLDAModel when all that class does is wrap functionality in OldLocalLDAModel. Forking the inheritance structure could avoid that by making the Option[OldLocalLDAModel] localized to DistributedLDAModel (and we can still have the copy iff collected already semantics) while also removing the case Some(...) => ... case None => /* should never happen */ boilerplate.

What's happening right now is that functionality in a subclass (DistributedLDAModel's copy if collected semantics) is polluting the design of the superclass (LocalLDAModel having an Option[OldLocalLDAModel])

jkbradley · 2015-11-12T20:14:10Z

Hm, good point. OK I'll try that & ping you on the PR.

Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2. (cherry picked from commit dcb896f) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.

Per discussion in the initial Pipelines LDA PR [apache#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#9678 from jkbradley/lda-pipelines-2.

Per discussion in the initial Pipelines LDA PR [apache/spark#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.

jkbradley added 3 commits November 5, 2015 11:24

partly done adding LDA

c053d0a

done adding LDA. need to add tests

23d40c4

fix indentation

583e173

Added test suite for spark.ml LDA

ffb68c5

jkbradley changed the title ~~[WIP] [SPARK-5565] [ML] LDA wrapper for Pipelines API~~ [SPARK-5565] [ML] LDA wrapper for Pipelines API Nov 6, 2015

scala style fix

9589e01

feynmanliang reviewed Nov 7, 2015
View reviewed changes

hhbyyh reviewed Nov 9, 2015
View reviewed changes

jkbradley added 2 commits November 9, 2015 16:47

changed concentration Params not to use -1 as special value, and to b…

be67704

…e unset by default

tiny fix

a55de6d

Renamed kappa, tau0 to learningDecay, learningOffset

8eaa596

asfgit closed this in e281b87 Nov 11, 2015

jkbradley deleted the lda-pipelines branch November 11, 2015 00:22

jkbradley mentioned this pull request Nov 12, 2015

[SPARK-11712] [ML] Make spark.ml LDAModel be abstract #9678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

jkbradley commented Nov 6, 2015

SparkQA commented Nov 6, 2015

hhbyyh commented Nov 6, 2015

jkbradley commented Nov 6, 2015

SparkQA commented Nov 6, 2015

SparkQA commented Nov 6, 2015

feynmanliang Nov 7, 2015

hhbyyh Nov 9, 2015

jkbradley Nov 9, 2015

feynmanliang commented Nov 7, 2015

hhbyyh Nov 9, 2015

jkbradley Nov 9, 2015

jkbradley commented Nov 10, 2015

jkbradley commented Nov 10, 2015

jkbradley commented Nov 10, 2015

SparkQA commented Nov 10, 2015

SparkQA commented Nov 10, 2015

SparkQA commented Nov 10, 2015

feynmanliang commented Nov 10, 2015

jkbradley commented Nov 11, 2015

jkbradley commented Nov 11, 2015

jkbradley commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

jkbradley commented Nov 12, 2015

[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

Conversation

jkbradley commented Nov 6, 2015

SparkQA commented Nov 6, 2015

hhbyyh commented Nov 6, 2015

jkbradley commented Nov 6, 2015

SparkQA commented Nov 6, 2015

SparkQA commented Nov 6, 2015

feynmanliang Nov 7, 2015

Choose a reason for hiding this comment

hhbyyh Nov 9, 2015

Choose a reason for hiding this comment

jkbradley Nov 9, 2015

Choose a reason for hiding this comment

feynmanliang commented Nov 7, 2015

hhbyyh Nov 9, 2015

Choose a reason for hiding this comment

jkbradley Nov 9, 2015

Choose a reason for hiding this comment

jkbradley commented Nov 10, 2015

jkbradley commented Nov 10, 2015

jkbradley commented Nov 10, 2015

SparkQA commented Nov 10, 2015

SparkQA commented Nov 10, 2015

SparkQA commented Nov 10, 2015

feynmanliang commented Nov 10, 2015

jkbradley commented Nov 11, 2015

jkbradley commented Nov 11, 2015

jkbradley commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

feynmanliang commented Nov 11, 2015

jkbradley commented Nov 12, 2015