Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

Closed
wants to merge 27 commits into from

Conversation

mpjlu
Copy link

@mpjlu mpjlu commented Aug 11, 2016

What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

How was this patch tested?

Add Scala ut

@mpjlu mpjlu changed the title Fpr chi square [SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 11, 2016
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
new ChiSqSelectorModel(indices)
}
}

/**
* Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any link to document what this means? I had to double-check it means what I think it means and could only find indirect references. We might very briefly explain it as a bound on the likelihood that the feature only by chance appears to be predictive, given the data, and even give an indicative common value like 0.05.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @srowen , http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/, this is a link to introduce p-value for feature selection.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, that's worth putting in the comment.

@srowen
Copy link
Member

srowen commented Aug 11, 2016

This would also need to modify the implementation in .ml to somehow accommodate the new params.

@mpjlu
Copy link
Author

mpjlu commented Aug 11, 2016

Hi, @srowen , I can modify the implementation in .ml to accommodate the new params. Thanks.

@avulanov
Copy link
Contributor

@srowen I've checked our thread with @mengxr on that feature #1484.

@mpjlu It seems that FPR feature selection should not modify the code of existing ChiSqSelector, because FPR feature selection works on top of a scoring function rather than on top of another selector. Scoring function is a parameter, and it might be Chi-square. For example, please refer to Sklearn's FPR implementation mentioned. It uses ANOVA as a default scoring function http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html#sklearn.feature_selection.SelectFpr.

@srowen
Copy link
Member

srowen commented Aug 12, 2016

OK, makes sense @avulanov though I'm not sure why the model can't sort the indices if it requires this as an internal detail. No big deal. After this change it may not matter. Conceptually though, the output of chi-squared feature selection does certainly imply an ordering because it computes a p-value for each feature. It's useful info and there's a use for it here now.

@avulanov
Copy link
Contributor

Yes, it seems that index sort can be done inside the model. With regards to the sort by p-value, I have taken a brief look at chi-squared feature selection in sci-kit and Weka, and they don't seem to sort the output.

@mpjlu
Copy link
Author

mpjlu commented Aug 15, 2016

Hi @avulanov . In general, FPR feature selection should not modify the code of existing ChiSqSelector, as we have implemented in this PR. But if we need to reuse the ChiSqTestResult (Statistics.chiSqTest(data)), it is better to modify the code of ChiSqSelector.

In Scikit-learn, for each SelectKBest, SelectFpr, SelectPercentile and so on, create an object for it, as we implemented in this PR. The good point of this method is it is consistent across the LIB, all use the same Estimator/Model style. The disadvantage is it cannot reuse the results of score function. @srowen

@mpjlu
Copy link
Author

mpjlu commented Aug 18, 2016

Hi @srowen, I have added the parameter to control the feature selection type.
The usage is like this:
var selector = new ChiSqSelector()
var model = selector.fit(df) // by default, the selector is selection numTopFeatures (50)
var newModel = selector.selectKBest(10), or var newModel = selector.selectPercentile(5), or,,

You can fit the DataFrame one time, and generate the model multi times.

And the indices is sort in the model internally as we have discussed.

For pass the p-value to the model function, this update does not include it. Because for the KBest and Percentile selection, the fit function uses ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses ChiSqTestResult.p-value. So it maybe better to pass ChiSqTestResult to the model and expose to the caller. And I think it is better to submit another PR for "pass value to model and expose to the caller" problem. Because much codes will be changed for this problem, includes which data should be passed to the model, how to save the model, how to test the model.

@mpjlu mpjlu reopened this Aug 18, 2016
@mpjlu mpjlu changed the title [SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test [WIP][SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 18, 2016
@mpjlu mpjlu changed the title [WIP][SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 18, 2016
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
@Since("2.1.0")
def setNumTopFeatures(value: Int): this.type = {
set(selectorType, "KBest")
Copy link
Member

@srowen srowen Aug 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, let's make these values like "KBest" some fields in a private object Edit: oh, this was done below. They can be used here? and ideally they are hidden as an implementation detail.

@srowen
Copy link
Member

srowen commented Aug 18, 2016

I think this will require a little update to the Python API to match. Not sure about SparkR

@mpjlu
Copy link
Author

mpjlu commented Aug 18, 2016

Hi @srowen , I will update the Python API to match this changes. Now, the current Python API is not conflict with the changes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65357 has finished for PR 14597 at commit 6398f4c.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 14, 2016

Test build #65359 has finished for PR 14597 at commit 6cc4c92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last thing to finish this off is to add a bit of documentation -- a brief summary of the three possible usages in the class doc. I don't think we need additional examples. That would at least help make sure people see the functionality.

@@ -68,8 +99,23 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
def this() = this(Identifiable.randomUID("chiSqSelector"))

/** @group setParam */
@Since("1.6.0")
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
@Since("2.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just noticed this: it should still say "since 1.6.0"

@Since("1.3.0")
class ChiSqSelector @Since("1.3.0") (
@Since("1.3.0") val numTopFeatures: Int) extends Serializable {
@Since("2.1.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise the class is still "since 1.3.0"


.. versionadded:: 1.4.0
"""
def __init__(self, numTopFeatures):
def __init__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, does this actually remove the existing constructor or am I missing it? it should be possible to use the existing constructor that sets numTopFeatures, still.

Also I'm not sure what the relationship between versionadded and @since is, but seems like you're following the convention here.

var alpha: Double = 0.05
var selectorType = ChiSqSelectorType.KBest

@Since("1.3.0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing constructor should still have javadoc maybe pointing to the setNumTopFeatures method to say that's the effect it has

@mpjlu
Copy link
Author

mpjlu commented Sep 15, 2016

Thanks very much, I am in holiday now, will update the code this Sunday.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65559 has finished for PR 14597 at commit 1d2f67f.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2016

Test build #65563 has finished for PR 14597 at commit 6220dd5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public static final class UnsignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class UnsignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • public static final class SignedPrefixComparatorNullsLast extends RadixSortSupport
    • public static final class SignedPrefixComparatorDescNullsFirst extends RadixSortSupport
    • abstract sealed class NullOrdering
    • case class SortOrder(
    • case class Length(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • trait Command extends LeafNode
    • trait RunnableCommand extends logical.Command
    • case class CreateTable(
    • case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan]
    • class SetAccumulator[T] extends AccumulatorV2[T, java.util.Set[T]]

@@ -271,28 +271,75 @@ def transform(self, vector):
return JavaVectorTransformer.transform(self, vector)


class ChiSqSelectorType:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick do you have an opinion on the Python style?

}

/**
* Chi-Squared feature selection, which selects categorical features to use for predicting a
* categorical label.
* The selector supports three selection methods: KBest, Percentile and FPR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good start but I think we could say some more. I suggest something like ...

The selector supports three selection methods: KBest, Percentile, and FPR. KBest chooses the k top features according to a chi-squared test. Percentile is similar but chooses a fraction of all features instead of a fixed number. FPR chooses all features whose false positive rate meets some threshold.

Should this doc be applied to Python too?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srowen, I have updated the javadoc.

@SparkQA
Copy link

SparkQA commented Sep 19, 2016

Test build #65585 has finished for PR 14597 at commit 88d2143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 20, 2016

This is ready to go ... except that it needs a rebase now, sorry.

@mpjlu
Copy link
Author

mpjlu commented Sep 20, 2016

No problem. thanks very much @srowen .

@SparkQA
Copy link

SparkQA commented Sep 20, 2016

Test build #65658 has finished for PR 14597 at commit 24f26f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 21, 2016

Nice one. I like this change. Merged to master.

@asfgit asfgit closed this in b366f18 Sep 21, 2016
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 26, 2016
…hon API.

## What changes were proposed in this pull request?
apache#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI apache#11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes apache#15214 from yanboliang/spark-17017.
ghost pushed a commit to dbtsai/spark that referenced this pull request Sep 28, 2016
…ion docs for ChiSqSelector

## What changes were proposed in this pull request?

A follow up for apache#14597 to update feature selection docs about ChiSqSelector.

## How was this patch tested?

Generated html docs. It can be previewed at:

* ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector
* mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector

Author: Shuai Lin <linshuai2012@gmail.com>

Closes apache#15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants