-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set #19993
Conversation
…lti-column params are both set
Test build #84970 has finished for PR 19993 at commit
|
Jenkins, retest this please |
Test build #84978 has finished for PR 19993 at commit
|
if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) || | ||
isSet(inputCol) && isSet(outputCols)) { | ||
throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " + | ||
"only supports setting either `inputCol` or `inputCols`.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is better to add one more check:
if single column, only set splits
param.
if multiple column, only set splitsArray
param. and check:
inputCols.length == outputCols.length == splitsArray.length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91, Thanks for the PR.
Since we have several classes that need the improvement and more classes will support HasInputCols, I feel like we should develop some code that can be shared by other classes.
We also need common unit test methods for the case.
@hhbyyh thanks for the review. I see that for some classes there are ongoing PRs. Thus I cannot change them now in order to have a common place and a common test. Should I wait for those PRs to be merged then? Or were you suggesting something different? |
I would suggest to develop the common infrastructure and unit test first, then other PR can take it or we can send follow-up fix. cc @MLnick for advice. |
@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |||
* by `inputCol`. A warning will be printed if both are set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should modify the document above too.
@hhbyyh I created a common infrastructure. Please let me know if I have to modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you have any specific idea about the right place? Thanks. |
Test build #85123 has finished for PR 19993 at commit
|
Test build #85124 has finished for PR 19993 at commit
|
Test build #85128 has finished for PR 19993 at commit
|
Test build #85130 has finished for PR 19993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update.
I like the current implementation that can be shared by other classes. For the test case, I think it makes sense to add something in checkParams
.
} else { | ||
false | ||
ParamValidators.assertColOrCols(this) | ||
if (isSet(inputCol) && isSet(splitsArray)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed isBucketizeMultipleColumns
is invoked in many places and maybe we can put the checks in other places like transformSchema. It also makes the code consistent with function name.
model match { | ||
case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) => | ||
raiseIncompatibleParamsException("inputCols", "inputCol") | ||
case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not necessarily be an error for some classes, but we can keep it for now.
* this is not true, an `IllegalArgumentException` is raised. | ||
* @param model | ||
*/ | ||
def assertColOrCols(model: Params): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]
} | ||
} | ||
|
||
def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]
} | ||
|
||
def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = { | ||
throw new IllegalArgumentException(s"Both `$paramName1` and `$paramName2` are set.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error message can be more straight forward. e.g. $paramName1
and $paramName2
cannot be set simultaneously.
Test build #85133 has finished for PR 19993 at commit
|
@hhbyyh thanks for the comments. I already fixed all your comments, but I am waiting to push for the UT. Honestly I think that |
To make it available for other classes, we need to support checking for both |
thanks @hhbyyh, I updated the PR according to your suggestion and previous comments. |
Test build #85209 has finished for PR 19993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Add some suggestions for function names and test utility.
* this is not true, an `IllegalArgumentException` is raised. | ||
* @param model | ||
*/ | ||
private[spark] def assertColOrCols(model: Params): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion for function name:
assertColOrCols --> checkMultiColumnParams
* @param paramsClass The Class to be checked | ||
* @param spark A `SparkSession` instance to use | ||
*/ | ||
def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion for function name:
checkMultiColumnParams --> testMultiColumnParams
// create fake input Dataset | ||
val feature1 = Array(-1.0, 0.0, 1.0) | ||
val feature2 = Array(1.0, 0.0, -1.0) | ||
val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the DataFrame here can be used for other transformers like StringIndexer.
How about add a Dataset as function parameter? And is it possible to use an instance obj: Params
rather than paramsClass: Class[_ <: Params]
as parameter, just to be more flexible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think?
I preferred to use paramsClass: Class[_ <: Params]
because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think?
Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can send column names as parameter if necessary. We need to ensure the test utility can be used by most transformers with multiple column support.
I think that the type check is performed later, thus it is not a problem here.
I don't quite get it, in either transform
or fit
the data type will be checked and they will trigger exceptions.
I preferred to use paramsClass: Class[_ <: Params]
I'm only thinking about the case that default constructor is not sufficient to create a working Estimator/Transformer. If that's not a concern, then reflection is OK.
Test build #85253 has finished for PR 19993 at commit
|
Test build #86282 has finished for PR 19993 at commit
|
Since RC1 for 2.3 failed, it'd be great to get this into 2.3. @mgaido91 do you mind if I send my comments along with a PR to update this PR of yours? I'm rushing because of the time pressure to get this into 2.3 (to avoid a behavior change between 2.3 and 2.4). Thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just leaving early comments to note the main issue I see, especially since they are relevant to #20146 . I'll send a PR soon, later today.
@@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String | |||
|
|||
@Since("1.4.0") | |||
override def transformSchema(schema: StructType): StructType = { | |||
if (isBucketizeMultipleColumns()) { | |||
ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with trying to use a general method like this is that it's hard to capture model-specific requirements. This currently misses checking to make sure that exactly one (not just <= 1) of each pair is available, plus that all of the single-column OR all of the multi-column Params are available. (The same issue occurs in #20146 ) It will also be hard to check these items and account for defaults.
I'd argue that it's not worth trying to use generic checking functions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my initial implementation (with @hhbyyh's comments) was more generic and checked what you said. After, @MLnick and @viirya asked to switch to a more generic approach which is the current you see. I'm fine with either of those, but I think we need to choose one way and go in that direction, otherwise we just loose time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I'll see if I can come up with something which is generic but handles these other checks.
@@ -166,6 +167,8 @@ private[ml] object Param { | |||
@DeveloperApi | |||
object ParamValidators { | |||
|
|||
private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's switch this to use the Logging trait, to match other MLlib patterns.
@jkbradley sure no problem, let me know how I can help. |
OK, sent: mgaido91#1 |
strengthened requirements about exclusive Params for single and multicolumn support
Test build #86416 has finished for PR 19993 at commit
|
Test build #86437 has finished for PR 19993 at commit
|
Test build #86439 has finished for PR 19993 at commit
|
("inputCols", Array("feature1", "feature2"))) | ||
ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"), | ||
("outputCols", Array("result1", "result2"))) | ||
ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only comment I have is that I believe this line is not testing what you may think.
As I read the checkSingleVsMultiColumnParams method, in this test case it will throw the error, not because both splits
and splitsArray
are set, but rather because both inputCol
& inputCols
are unset.
Actually it applies to the line above too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed.
Overall looks good with @jkbradley's changes. I just left a comment on the param test cases as I think they're not quite complete |
Well yes it would - but the method checks inputCols/inputCol first so will
always fail for that reason here, ie we aren’t actually testing the full
code path
…On Mon, 22 Jan 2018 at 16:43, Marco Gaido ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
<#19993 (comment)>:
> - test("Both inputCol and inputCols are set") {
- val bucket = new Bucketizer()
- .setInputCol("feature1")
- .setOutputCol("result")
- .setSplits(Array(-0.5, 0.0, 0.5))
- .setInputCols(Array("feature1", "feature2"))
-
- // When both are set, we ignore `inputCols` and just map the column specified by `inputCol`.
- assert(bucket.isBucketizeMultipleColumns() == false)
+ test("assert exception is thrown if both multi-column and single-column params are set") {
+ val df = Seq((0.5, 0.3), (0.5, -0.4)).toDF("feature1", "feature2")
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("inputCol", "feature1"),
+ ("inputCols", Array("feature1", "feature2")))
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"),
+ ("outputCols", Array("result1", "result2")))
+ ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)),
@MLnick <https://github.com/mlnick> actually it will fail for both
reasons. We can add more test cases to check each of these two cases if you
think it is needed.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19993 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB1zkYZ5V4SOlLliOtxQ_6CCvoBm4ks5tNJ6egaJpZM4RD1b4>
.
|
Test build #86525 has finished for PR 19993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple minor comments - but that can be left for a clean up in a follow up PR if need be.
I'd prefer to merge this to branch-2.3.
import org.apache.spark.ml.linalg.{Vector, Vectors} | ||
import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these are used any longer?
ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "feature1"), | ||
("splits", Array(-0.5, 0.0, 0.5))) | ||
|
||
// the following should fail because not all the params are set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically here we should probably also test the inputCols
+ outputCols
case (i.e. that not setting splitsArray
also throws an exception).
Test build #86589 has finished for PR 19993 at commit
|
+1 merge this to 2.3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…lti-column params are both set ## What changes were proposed in this pull request? Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception. The PR throws an exception in `Bucketizer`, instead of logging a warning. ## How was this patch tested? modified UT Author: Marco Gaido <marcogaido91@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #19993 from mgaido91/SPARK-22799. (cherry picked from commit cd3956d) Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
Merged to master / branch-2.3 |
Thanks @mgaido91 and @jkbradley for working on this and others for review |
What changes were proposed in this pull request?
Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception.
The PR throws an exception in
Bucketizer
, instead of logging a warning.How was this patch tested?
modified UT