Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22799][ML] Bucketizer should throw exception if single- and multi-column params are both set #19993

Closed
wants to merge 19 commits into from

Conversation

mgaido91
Copy link
Contributor

What changes were proposed in this pull request?

Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception.

The PR throws an exception in Bucketizer, instead of logging a warning.

How was this patch tested?

modified UT

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84970 has finished for PR 19993 at commit 8f3581c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Dec 15, 2017

Test build #84978 has finished for PR 19993 at commit 8f3581c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (isSet(inputCols) && isSet(inputCol) || isSet(inputCols) && isSet(outputCol) ||
isSet(inputCol) && isSet(outputCols)) {
throw new IllegalArgumentException("Both `inputCol` and `inputCols` are set, `Bucketizer` " +
"only supports setting either `inputCol` or `inputCols`.")
Copy link
Contributor

@WeichenXu123 WeichenXu123 Dec 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is better to add one more check:
if single column, only set splits param.
if multiple column, only set splitsArray param. and check:
inputCols.length == outputCols.length == splitsArray.length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. I will add these checks while doing the changes requested by @hhbyyh. Thanks.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91, Thanks for the PR.

Since we have several classes that need the improvement and more classes will support HasInputCols, I feel like we should develop some code that can be shared by other classes.
We also need common unit test methods for the case.

@mgaido91
Copy link
Contributor Author

@hhbyyh thanks for the review. I see that for some classes there are ongoing PRs. Thus I cannot change them now in order to have a common place and a common test. Should I wait for those PRs to be merged then? Or were you suggesting something different?
Thanks.

@hhbyyh
Copy link
Contributor

hhbyyh commented Dec 18, 2017

I would suggest to develop the common infrastructure and unit test first, then other PR can take it or we can send follow-up fix.

cc @MLnick for advice.

@@ -140,10 +140,10 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String
* by `inputCol`. A warning will be printed if both are set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should modify the document above too.

@mgaido91
Copy link
Contributor Author

@hhbyyh I created a common infrastructure. Please let me know if I have to modify it. Meanwhile, I'd like to discuss where to put the common UTs: do you have any specific idea about the right place? Thanks.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85123 has finished for PR 19993 at commit bb0c0d2.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85124 has finished for PR 19993 at commit 9f56800.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85128 has finished for PR 19993 at commit f593f5b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85130 has finished for PR 19993 at commit 2ecdc73.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update.

I like the current implementation that can be shared by other classes. For the test case, I think it makes sense to add something in checkParams.

cc @MLnick @jkbradley @srowen

} else {
false
ParamValidators.assertColOrCols(this)
if (isSet(inputCol) && isSet(splitsArray)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed isBucketizeMultipleColumns is invoked in many places and maybe we can put the checks in other places like transformSchema. It also makes the code consistent with function name.

model match {
case m: HasInputCols with HasInputCol if m.isSet(m.inputCols) && m.isSet(m.inputCol) =>
raiseIncompatibleParamsException("inputCols", "inputCol")
case m: HasOutputCols with HasInputCol if m.isSet(m.outputCols) && m.isSet(m.inputCol) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not necessarily be an error for some classes, but we can keep it for now.

* this is not true, an `IllegalArgumentException` is raised.
* @param model
*/
def assertColOrCols(model: Params): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]

}
}

def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]

}

def raiseIncompatibleParamsException(paramName1: String, paramName2: String): Unit = {
throw new IllegalArgumentException(s"Both `$paramName1` and `$paramName2` are set.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error message can be more straight forward. e.g. $paramName1 and $paramName2 cannot be set simultaneously.

@SparkQA
Copy link

SparkQA commented Dec 19, 2017

Test build #85133 has finished for PR 19993 at commit 26fe05e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mgaido91
Copy link
Contributor Author

@hhbyyh thanks for the comments. I already fixed all your comments, but I am waiting to push for the UT. Honestly I think that checkParam is not the best place. Checking the exception requires setting the parameters and invoking transform, thus I am not sure it is the best place, since it is a very generic one and at the moment we are setting no parameter there. What do you think?

@hhbyyh
Copy link
Contributor

hhbyyh commented Dec 20, 2017

To make it available for other classes, we need to support checking for both fit and transform, that means we also need a sample input Dataset, so we may have to add the explicit test in each of the test suite. But we can still create some infrastructure function for the explicit test to invoke.
E.g. we can create some function in object ParamsSuite or other places
checkMultiColumnParams(obj: Params, sampleData: Dataset[_])

@mgaido91
Copy link
Contributor Author

thanks @hhbyyh, I updated the PR according to your suggestion and previous comments.

@SparkQA
Copy link

SparkQA commented Dec 20, 2017

Test build #85209 has finished for PR 19993 at commit 9872bfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Add some suggestions for function names and test utility.

* this is not true, an `IllegalArgumentException` is raised.
* @param model
*/
private[spark] def assertColOrCols(model: Params): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion for function name:
assertColOrCols --> checkMultiColumnParams

* @param paramsClass The Class to be checked
* @param spark A `SparkSession` instance to use
*/
def checkMultiColumnParams(paramsClass: Class[_ <: Params], spark: SparkSession): Unit = {
Copy link
Contributor

@hhbyyh hhbyyh Dec 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion for function name:
checkMultiColumnParams --> testMultiColumnParams

// create fake input Dataset
val feature1 = Array(-1.0, 0.0, 1.0)
val feature2 = Array(1.0, 0.0, -1.0)
val df = feature1.zip(feature2).toSeq.toDF("feature1", "feature2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the DataFrame here can be used for other transformers like StringIndexer.

How about add a Dataset as function parameter? And is it possible to use an instance obj: Params rather than paramsClass: Class[_ <: Params] as parameter, just to be more flexible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why I created the dataframe inside the method was to control the names of the columns it has. Otherwise we can't ensure that those columns exist. I think that the type check is performed later, thus it is not a problem here. What do you think?

I preferred to use paramsClass: Class[_ <: Params] because I need a clean instance for each of the two checks: if an instance is passed I cannot enforce that it is clean, ie. some parameters weren't already set and I would need to copy it to create new instances as well, since otherwise the second check would be influenced by the first one. What do you think?

Thanks.

Copy link
Contributor

@hhbyyh hhbyyh Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can send column names as parameter if necessary. We need to ensure the test utility can be used by most transformers with multiple column support.

I think that the type check is performed later, thus it is not a problem here.

I don't quite get it, in either transform or fit the data type will be checked and they will trigger exceptions.

I preferred to use paramsClass: Class[_ <: Params]

I'm only thinking about the case that default constructor is not sufficient to create a working Estimator/Transformer. If that's not a concern, then reflection is OK.

@SparkQA
Copy link

SparkQA commented Dec 21, 2017

Test build #85253 has finished for PR 19993 at commit d0b8d06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2018

Test build #86282 has finished for PR 19993 at commit 25b9bd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Since RC1 for 2.3 failed, it'd be great to get this into 2.3. @mgaido91 do you mind if I send my comments along with a PR to update this PR of yours? I'm rushing because of the time pressure to get this into 2.3 (to avoid a behavior change between 2.3 and 2.4). Thanks in advance!

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just leaving early comments to note the main issue I see, especially since they are relevant to #20146 . I'll send a PR soon, later today.

@@ -201,9 +184,13 @@ final class Bucketizer @Since("1.4.0") (@Since("1.4.0") override val uid: String

@Since("1.4.0")
override def transformSchema(schema: StructType): StructType = {
if (isBucketizeMultipleColumns()) {
ParamValidators.checkExclusiveParams(this, "inputCol", "inputCols")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with trying to use a general method like this is that it's hard to capture model-specific requirements. This currently misses checking to make sure that exactly one (not just <= 1) of each pair is available, plus that all of the single-column OR all of the multi-column Params are available. (The same issue occurs in #20146 ) It will also be hard to check these items and account for defaults.

I'd argue that it's not worth trying to use generic checking functions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial implementation (with @hhbyyh's comments) was more generic and checked what you said. After, @MLnick and @viirya asked to switch to a more generic approach which is the current you see. I'm fine with either of those, but I think we need to choose one way and go in that direction, otherwise we just loose time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll see if I can come up with something which is generic but handles these other checks.

@@ -166,6 +167,8 @@ private[ml] object Param {
@DeveloperApi
object ParamValidators {

private val LOGGER = LoggerFactory.getLogger(ParamValidators.getClass)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's switch this to use the Logging trait, to match other MLlib patterns.

@mgaido91
Copy link
Contributor Author

@jkbradley sure no problem, let me know how I can help.

@jkbradley
Copy link
Member

OK, sent: mgaido91#1

strengthened requirements about exclusive Params for single and multicolumn support
@SparkQA
Copy link

SparkQA commented Jan 20, 2018

Test build #86416 has finished for PR 19993 at commit d9d25b0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2018

Test build #86437 has finished for PR 19993 at commit 8c162a3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 21, 2018

Test build #86439 has finished for PR 19993 at commit 7894609.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

("inputCols", Array("feature1", "feature2")))
ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "result1"),
("outputCols", Array("result1", "result2")))
ParamsSuite.testExclusiveParams(new Bucketizer, df, ("splits", Array(-0.5, 0.0, 0.5)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only comment I have is that I believe this line is not testing what you may think.

As I read the checkSingleVsMultiColumnParams method, in this test case it will throw the error, not because both splits and splitsArray are set, but rather because both inputCol & inputCols are unset.

Actually it applies to the line above too.

@jkbradley

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick actually it will fail for both reasons. We can add more test cases to check each of these two cases if you think it is needed.

@MLnick
Copy link
Contributor

MLnick commented Jan 22, 2018

Overall looks good with @jkbradley's changes. I just left a comment on the param test cases as I think they're not quite complete

@MLnick
Copy link
Contributor

MLnick commented Jan 22, 2018 via email

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86525 has finished for PR 19993 at commit ebc6d16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple minor comments - but that can be left for a clean up in a follow up PR if need be.

I'd prefer to merge this to branch-2.3.

import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are used any longer?

ParamsSuite.testExclusiveParams(new Bucketizer, df, ("outputCol", "feature1"),
("splits", Array(-0.5, 0.0, 0.5)))

// the following should fail because not all the params are set
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically here we should probably also test the inputCols + outputCols case (i.e. that not setting splitsArray also throws an exception).

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86589 has finished for PR 19993 at commit 2bc5cb4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WeichenXu123
Copy link
Contributor

+1 merge this to 2.3

Copy link
Contributor

@MLnick MLnick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

asfgit pushed a commit that referenced this pull request Jan 26, 2018
…lti-column params are both set

## What changes were proposed in this pull request?

Currently there is a mixed situation when both single- and multi-column are supported. In some cases exceptions are thrown, in others only a warning log is emitted. In this discussion https://issues.apache.org/jira/browse/SPARK-8418?focusedCommentId=16275049&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16275049, the decision was to throw an exception.

The PR throws an exception in `Bucketizer`, instead of logging a warning.

## How was this patch tested?

modified UT

Author: Marco Gaido <marcogaido91@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #19993 from mgaido91/SPARK-22799.

(cherry picked from commit cd3956d)
Signed-off-by: Nick Pentreath <nickp@za.ibm.com>
@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

Merged to master / branch-2.3

@MLnick
Copy link
Contributor

MLnick commented Jan 26, 2018

Thanks @mgaido91 and @jkbradley for working on this and others for review

@asfgit asfgit closed this in cd3956d Jan 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants