-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5565] [ML] LDA wrapper for Pipelines API #9513
Conversation
Test build #45182 has finished for PR 9513 at commit
|
@jkbradley Thanks for looping me in and welcome back. I'll try to send some feedback this weekend. |
OK, I believe that's it, so this is ready for review. Thanks! |
Test build #45243 has finished for PR 9513 at commit
|
Test build #45245 has finished for PR 9513 at commit
|
* @group getParam | ||
*/ | ||
@Since("1.6.0") | ||
def getAlpha: Array[Double] = getDocConcentration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 on having both alpha
and docConcentration
, can we take the opportunity here to choose one of the names and stick with it (it's confusing to have both alpha
and docConcentration
throughout the code)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm neutral on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yeah, I neglected to mention this in the design doc. I'm OK with removing alpha and beta from everything but the doc. I'll go ahead and take those out.
Made a pass |
* @group param | ||
*/ | ||
@Since("1.6.0") | ||
final val k = new IntParam(this, "k", "number of clusters to create", ParamValidators.gt(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of topics to infer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
…e unset by default
@feynmanliang I think that's the last fix. Thinking more about it, I'm on board with changing LDAModel to be abstract, as long as it's a minor change. I'll see about making it in a follow-up PR. |
One more possible change: How do yall feel about renaming tau0 and kappa to follow the new sklearn LDA API?
[http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn.decomposition.LatentDirichletAllocation] |
I'm going to go ahead and change the tau0, kappa names. |
Test build #45452 has finished for PR 9513 at commit
|
Test build #45463 has finished for PR 9513 at commit
|
Test build #2025 has finished for PR 9513 at commit
|
+1 on the renames On Tue, Nov 10, 2015, 02:48 Apache Spark QA notifications@github.com
|
Got OK from @mengxr offline, so merging with master and branch-1.6 |
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [#9484], but I'll try to merge [#9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines. (cherry picked from commit e281b87) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
I'll see about sending a follow-up with the subclassing. let me know if there's anything else I'm forgetting. Thanks @feynmanliang and @hhbyyh for reviewing! |
Actually, @feynmanliang I realized as I was trying to rewrite this that using a lazy val for DistributedLDAModel.oldLocalModel prevents us from one important optimization in DistributedLDAModel.copy, which is called every time we call model.transform:
Given that this could mean considerable overhead, including several copies of topicsMatrix on the driver, I'd prefer to keep the current class structure. |
@jkbradley Not sure I understand, if |
Oh wait I see what you're saying |
I still think it's wrong for a What's happening right now is that functionality in a subclass ( |
Hm, good point. OK I'll try that & ping you on the PR. |
Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2. (cherry picked from commit dcb896f) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Per discussion in the initial Pipelines LDA PR [#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.
Per discussion in the initial Pipelines LDA PR [apache#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes apache#9678 from jkbradley/lda-pipelines-2.
Per discussion in the initial Pipelines LDA PR [apache/spark#9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change:
Note: This will conflict with [https://github.com//pull/9484], but I'll try to merge [https://github.com//pull/9484] first and then rebase this PR.
CC: @hhbyyh @feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6.
CC: @mengxr