Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513

Closed
wants to merge 11 commits into from

Conversation

jkbradley
Copy link
Member

This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:

  • I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [https://github.com//pull/9484], but I'll try to merge [https://github.com//pull/9484] first and then rebase this PR.

CC: @hhbyyh @feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: @mengxr

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45182 has finished for PR 9513 at commit 583e173.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

@hhbyyh
Copy link
Contributor

hhbyyh commented Nov 6, 2015

@jkbradley Thanks for looping me in and welcome back. I'll try to send some feedback this weekend.

@jkbradley jkbradley changed the title [WIP] [SPARK-5565] [ML] LDA wrapper for Pipelines API [SPARK-5565] [ML] LDA wrapper for Pipelines API Nov 6, 2015
@jkbradley
Copy link
Member Author

OK, I believe that's it, so this is ready for review. Thanks!

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45243 has finished for PR 9513 at commit ffb68c5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Nov 6, 2015

Test build #45245 has finished for PR 9513 at commit 9589e01.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n * sealed abstract class LDAOptimizer extends Params\n * class EMLDAOptimizer @Since(\"1.6.0\") (\n * class OnlineLDAOptimizer @Since(\"1.6.0\") (\n

* @group getParam
*/
@Since("1.6.0")
def getAlpha: Array[Double] = getDocConcentration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on having both alpha and docConcentration, can we take the opportunity here to choose one of the names and stick with it (it's confusing to have both alpha and docConcentration throughout the code)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm neutral on this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah, I neglected to mention this in the design doc. I'm OK with removing alpha and beta from everything but the doc. I'll go ahead and take those out.

@feynmanliang
Copy link
Contributor

Made a pass

* @group param
*/
@Since("1.6.0")
final val k = new IntParam(this, "k", "number of clusters to create", ParamValidators.gt(1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of topics to infer ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@jkbradley
Copy link
Member Author

@feynmanliang I think that's the last fix.

Thinking more about it, I'm on board with changing LDAModel to be abstract, as long as it's a minor change. I'll see about making it in a follow-up PR.

@jkbradley
Copy link
Member Author

One more possible change: How do yall feel about renaming tau0 and kappa to follow the new sklearn LDA API?

  • kappa -> learningDecay
  • tau0 -> learningOffset

[http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation]

@jkbradley
Copy link
Member Author

I'm going to go ahead and change the tau0, kappa names.

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45452 has finished for PR 9513 at commit a55de6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #45463 has finished for PR 9513 at commit 8eaa596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

@SparkQA
Copy link

SparkQA commented Nov 10, 2015

Test build #2025 has finished for PR 9513 at commit 16a061c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDA @Since(\"1.6.0\") (\n

@feynmanliang
Copy link
Contributor

+1 on the renames

On Tue, Nov 10, 2015, 02:48 Apache Spark QA notifications@github.com
wrote:

Test build #2025 has finished
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2025/consoleFull

for PR 9513 at commit 16a061c
16a061c
.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class
    LDA @SInCE("1.6.0") (\n

Reply to this email directly or view it on GitHub
#9513 (comment).

@jkbradley
Copy link
Member Author

Got OK from @mengxr offline, so merging with master and branch-1.6

asfgit pushed a commit that referenced this pull request Nov 11, 2015
This adds LDA to spark.ml, the Pipelines API.  It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
* I eliminated doc IDs.  These are not necessary with DataFrames since the user can add an ID column as needed.

Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR.

CC: hhbyyh feynmanliang  If you have a chance to make a pass, that'd be really helpful--thanks!  Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9513 from jkbradley/lda-pipelines.

(cherry picked from commit e281b87)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@jkbradley
Copy link
Member Author

I'll see about sending a follow-up with the subclassing. let me know if there's anything else I'm forgetting.

Thanks @feynmanliang and @hhbyyh for reviewing!

@asfgit asfgit closed this in e281b87 Nov 11, 2015
@jkbradley jkbradley deleted the lda-pipelines branch November 11, 2015 00:22
@jkbradley
Copy link
Member Author

Actually, @feynmanliang I realized as I was trying to rewrite this that using a lazy val for DistributedLDAModel.oldLocalModel prevents us from one important optimization in DistributedLDAModel.copy, which is called every time we call model.transform:

  • Currently: We copy the local model if it has been instantiated (involving collecting the topicsMatrix to the driver).
  • With a lazy val: I don't see a good way to ensure the collect only happens once.

Given that this could mean considerable overhead, including several copies of topicsMatrix on the driver, I'd prefer to keep the current class structure.

@feynmanliang
Copy link
Contributor

@jkbradley Not sure I understand, if lazy val oldModel = *something*.collect() then collect() will only be called once on the first reference to oldModel and every subsequent reference to oldModel will use the Array[...] materialized by collect()

@feynmanliang
Copy link
Contributor

Oh wait I see what you're saying

@feynmanliang
Copy link
Contributor

I still think it's wrong for a LocalLDAModel to optionally have a OldLocalLDAModel when all that class does is wrap functionality in OldLocalLDAModel. Forking the inheritance structure could avoid that by making the Option[OldLocalLDAModel] localized to DistributedLDAModel (and we can still have the copy iff collected already semantics) while also removing the case Some(...) => ... case None => /* should never happen */ boilerplate.

What's happening right now is that functionality in a subclass (DistributedLDAModel's copy if collected semantics) is polluting the design of the superclass (LocalLDAModel having an Option[OldLocalLDAModel])

@jkbradley
Copy link
Member Author

Hm, good point. OK I'll try that & ping you on the PR.

asfgit pushed a commit that referenced this pull request Nov 13, 2015
Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.

CC feynmanliang mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9678 from jkbradley/lda-pipelines-2.

(cherry picked from commit dcb896f)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 13, 2015
Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.

CC feynmanliang mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9678 from jkbradley/lda-pipelines-2.
dskrvk pushed a commit to dskrvk/spark that referenced this pull request Nov 13, 2015
Per discussion in the initial Pipelines LDA PR [apache#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.

CC feynmanliang mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#9678 from jkbradley/lda-pipelines-2.
kiszk pushed a commit to kiszk/spark-gpu that referenced this pull request Dec 26, 2015
Per discussion in the initial Pipelines LDA PR [apache/spark#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases.

CC feynmanliang mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9678 from jkbradley/lda-pipelines-2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants