-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML #15913
Conversation
Test build #68743 has finished for PR 15913 at commit
|
Test build #68757 has started for PR 15913 at commit |
Test build #68780 has finished for PR 15913 at commit
|
Test build #68829 has finished for PR 15913 at commit
|
Test build #68849 has finished for PR 15913 at commit
|
Test build #68892 has finished for PR 15913 at commit
|
Test build #68893 has finished for PR 15913 at commit
|
Jenkins, test this please. |
Test build #68900 has finished for PR 15913 at commit
|
/** | ||
* Number of trees in ensemble | ||
*/ | ||
val getNumTrees: Int = trees.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this was moved? It seems unrelated to the cleanup described in the title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a headache. Before, getNumTrees
on gbts/rfs would dispatch here. Now, we removed numTrees
from RF (it was deprecated) and so now we can have rf inherit from RandomForestParams
which (now) has a param numTrees
and its associated getter. So if we hadn't removed this, rf would have two getNumTrees
methods and one would take precedence. We removed it from here to avoid that, then we had to add them back in the individual GBT subclasses. I think I've got that right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethah You explained this problem very clearly and totally right. Thanks.
def setStepSize(value: Double): this.type = set(stepSize, value) | ||
|
||
override def validateParams(): Unit = { | ||
def setStepSize(value: Double): this.type = { | ||
require(ParamValidators.inRange(0, 1, lowerInclusive = false, upperInclusive = true)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is where we want to do this? The deprecation warning said to move checks from here into transformSchema
- but its possible the deprecation warning was incomplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to fail at set time rather than when fit
is called. This was also not used before, seems we missed it when we deprecated validateParams()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original validateParams()
was only used to check for interactions between parameters. Parameter value checks which do not depend on other parameters are handled by ````Param.validate()``` at the definition, i.e.
final val k = new IntParam(this, "k", "The number of clusters to create. " +
"Must be > 1.", ParamValidators.gt(1))
However, stepSize
was a trait and inherited by lots of sub-classes which have different constraints for this parameter, so I add the check at setter.
Yeah, I think the original place of this check(putting in validateParams
) was a bug, since it does not involves interactions between parameters. We missed it when we deprecated validateParams()
, since there is no test case to validate this param before. I added corresponding test in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes more sense :) +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation should really be in the Param itself. It's not since it's a sharedParam, but perhaps we should just copy the shared Param code here. The problem with putting validation in the setter is that Params can get set in other ways too (such as gbt.set(stepSize, value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall, a few small comments.
@@ -176,8 +176,12 @@ private[classification] trait LogisticRegressionParams extends ProbabilisticClas | |||
} | |||
} | |||
|
|||
override def validateParams(): Unit = { | |||
override protected def validateAndTransformSchema( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually necessary? Before, validateParams()
was never used. Seems like we could just remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a bug that validateParams
was never used. It should validate params interaction before fitting(if necessary), this is why we deprecate validateParams
and move what it does to transformSchema
. We do not have corresponding test cases before, so no test was broken when we deprecated validateParams
. I added test cases in this PR.
def setStepSize(value: Double): this.type = set(stepSize, value) | ||
|
||
override def validateParams(): Unit = { | ||
def setStepSize(value: Double): this.type = { | ||
require(ParamValidators.inRange(0, 1, lowerInclusive = false, upperInclusive = true)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to fail at set time rather than when fit
is called. This was also not used before, seems we missed it when we deprecated validateParams()
?
/** | ||
* Number of trees in ensemble | ||
*/ | ||
val getNumTrees: Int = trees.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a headache. Before, getNumTrees
on gbts/rfs would dispatch here. Now, we removed numTrees
from RF (it was deprecated) and so now we can have rf inherit from RandomForestParams
which (now) has a param numTrees
and its associated getter. So if we hadn't removed this, rf would have two getNumTrees
methods and one would take precedence. We removed it from here to avoid that, then we had to add them back in the individual GBT subclasses. I think I've got that right.
/** | ||
* Number of trees in ensemble | ||
*/ | ||
val getNumTrees: Int = trees.length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this have an @Since
tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
* (default = 20) | ||
* @group param | ||
*/ | ||
final val numTrees: IntParam = new IntParam(this, "numTrees", "Number of trees to train (>= 1)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: the reason that we cannot add this to both GBT and RF (i.e. in TreeEnsembleParams
) is because the param maxIter
controls how many trees a GBT has. The semantics in the algos are a bit different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, added.
"""Sets the SQL context to use for saving.""" | ||
""" | ||
Sets the SQL context to use for saving. | ||
.. note:: Deprecated in 2.1, use session instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove it from Scala in 2.2, then we have to remove it here in 2.2 right? I guess we should note it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
*/ | ||
@Since("1.6.0") | ||
@deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0") | ||
def setLabelCol(value: String): this.type = set(labelCol, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to remove it from Python as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All python ML models don't support setter/getter params currently, so no need to remove it.
Test build #69111 has started for PR 15913 at commit |
Test build #69108 has finished for PR 15913 at commit
|
Jenkins, test this please. |
Test build #69130 has finished for PR 15913 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Adding a couple of comments
* Validates parameter values stored internally. | ||
* Raise an exception if any parameter value is invalid. | ||
* | ||
* This only needs to check for interactions between parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add the relevant text from here to transformSchema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
def setStepSize(value: Double): this.type = set(stepSize, value) | ||
|
||
override def validateParams(): Unit = { | ||
def setStepSize(value: Double): this.type = { | ||
require(ParamValidators.inRange(0, 1, lowerInclusive = false, upperInclusive = true)( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This validation should really be in the Param itself. It's not since it's a sharedParam, but perhaps we should just copy the shared Param code here. The problem with putting validation in the setter is that Params can get set in other ways too (such as gbt.set(stepSize, value)
getStepSize), "GBT parameter stepSize should be in interval (0, 1], " + | ||
s"but it given invalid value $getStepSize.") | ||
value), "GBT parameter stepSize should be in interval (0, 1], " + | ||
s"but it given invalid value $value.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"it given invalid value" -> "it was given an invalid value"
setDefault(numTrees -> 20) | ||
|
||
/** @group setParam */ | ||
def setNumTrees(value: Int): this.type = set(numTrees, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these setter methods in traits Java-compatible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we already have setNumTrees
which calls super.setNumTrees
in RandomForestClassifier
and RandomForestRegressor
.
One more thing: This breaks several public APIs, although in mostly minor ways and with good reasons. Could you please update the JIRA to list the public API changes? |
Test build #69158 has finished for PR 15913 at commit
|
Test build #69159 has finished for PR 15913 at commit
|
Test build #69160 has finished for PR 15913 at commit
|
Test build #69167 has finished for PR 15913 at commit
|
/** @group getParam */ | ||
final def getStepSize: Double = $(stepSize) | ||
|
||
/** @group setParam */ | ||
def setStepSize(value: Double): this.type = set(stepSize, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked about this being Java-friendly because Param setter methods in traits used to have the wrong type in Java. I wonder if this is no longer true. Still, it might make sense to remove the setter method from the trait since it does not make sense to have it in the Model classes. We could put the setter method in each subclass and then deprecate the method in the Model classes.
This issue also shows up in MimaExcludes. If setFeatureSubsetStrategy were put in the concrete classes from the outset, then you would not need to include it in MimaExcludes now.
I'm Ok with doing the change now or in a follow-up PR.
I believe I was the one who incorrectly put the setters in the traits...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I understand what you mean. If we would like to correct the setter methods in traits, we involve changes to lots of traits which include DecisionTreeParams
, TreeClassifierParams
, TreeRegressorParams
, RandomForestParams
, GBTParams
, etc. So i will merge this firstly after it passed Jenkins and address this issue in a separate follow-up PR. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbradley I have sent #16017 to fix this issue, please feel free to comment that. Thanks.
Jenkins, test this please. |
Test build #69183 has finished for PR 15913 at commit
|
Merged into master and branch-2.1. Thanks for all reviewing. |
## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15913 from yanboliang/spark-18481. (cherry picked from commit c4a7eef) Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here #15913 (comment). ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16017 from yanboliang/spark-18592. (cherry picked from commit 95f7985) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here #15913 (comment). ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16017 from yanboliang/spark-18592.
## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15913 from yanboliang/spark-18481.
## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here apache#15913 (comment). ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16017 from yanboliang/spark-18592.
## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15913 from yanboliang/spark-18481.
## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here apache#15913 (comment). ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#16017 from yanboliang/spark-18592.
What changes were proposed in this pull request?
Remove deprecated methods for ML.
How was this patch tested?
Existing tests.