[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

mpjlu · 2016-08-11T07:21:14Z

What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

How was this patch tested?

Add Scala ut

srowen · 2016-08-11T09:16:46Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
    new ChiSqSelectorModel(indices)
  }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.


Is there any link to document what this means? I had to double-check it means what I think it means and could only find indirect references. We might very briefly explain it as a bound on the likelihood that the feature only by chance appears to be predictive, given the data, and even give an indicative common value like 0.05.

Hi @srowen , http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/, this is a link to introduce p-value for feature selection.

Great, that's worth putting in the comment.

srowen · 2016-08-11T09:50:03Z

This would also need to modify the implementation in .ml to somehow accommodate the new params.

mpjlu · 2016-08-11T10:13:48Z

Hi, @srowen , I can modify the implementation in .ml to accommodate the new params. Thanks.

avulanov · 2016-08-12T10:19:50Z

@srowen I've checked our thread with @mengxr on that feature #1484.

We preserve the order of indexes to make the selection of features with one loop (i.e. linear time complexity). Here is the code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala#L74. The logic of feature selector, which is selection of N top features, does not imply that it will sort the features by their Chi-square value. A parameter must be introduced if it is required for some use-case.
We were planning to include Chi-square values in the model later if needed [MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484 (comment)

@mpjlu It seems that FPR feature selection should not modify the code of existing ChiSqSelector, because FPR feature selection works on top of a scoring function rather than on top of another selector. Scoring function is a parameter, and it might be Chi-square. For example, please refer to Sklearn's FPR implementation mentioned. It uses ANOVA as a default scoring function http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFpr.html#sklearn.feature_selection.SelectFpr.

srowen · 2016-08-12T10:24:21Z

OK, makes sense @avulanov though I'm not sure why the model can't sort the indices if it requires this as an internal detail. No big deal. After this change it may not matter. Conceptually though, the output of chi-squared feature selection does certainly imply an ordering because it computes a p-value for each feature. It's useful info and there's a use for it here now.

avulanov · 2016-08-12T15:44:40Z

Yes, it seems that index sort can be done inside the model. With regards to the sort by p-value, I have taken a brief look at chi-squared feature selection in sci-kit and Weka, and they don't seem to sort the output.

mpjlu · 2016-08-15T04:56:38Z

Hi @avulanov . In general, FPR feature selection should not modify the code of existing ChiSqSelector, as we have implemented in this PR. But if we need to reuse the ChiSqTestResult (Statistics.chiSqTest(data)), it is better to modify the code of ChiSqSelector.

In Scikit-learn, for each SelectKBest, SelectFpr, SelectPercentile and so on, create an object for it, as we implemented in this PR. The good point of this method is it is consistent across the LIB, all use the same Estimator/Model style. The disadvantage is it cannot reuse the results of score function. @srowen

…s, Percentile, and Fpr selector

…ntile and FPR selector

mpjlu · 2016-08-18T05:43:21Z

Hi @srowen, I have added the parameter to control the feature selection type.
The usage is like this:
var selector = new ChiSqSelector()
var model = selector.fit(df) // by default, the selector is selection numTopFeatures (50)
var newModel = selector.selectKBest(10), or var newModel = selector.selectPercentile(5), or,,
You can fit the DataFrame one time, and generate the model multi times.

And the indices is sort in the model internally as we have discussed.

For pass the p-value to the model function, this update does not include it. Because for the KBest and Percentile selection, the fit function uses ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses ChiSqTestResult.p-value. So it maybe better to pass ChiSqTestResult to the model and expose to the caller. And I think it is better to submit another PR for "pass value to model and expose to the caller" problem. Because much codes will be changed for this problem, includes which data should be passed to the model, how to save the model, how to test the model.

srowen · 2016-08-18T10:36:42Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  @Since("2.1.0")
+  def setNumTopFeatures(value: Int): this.type = {
+    set(selectorType, "KBest")


Ideally, let's make these values like "KBest" some fields in a private object Edit: oh, this was done below. They can be used here? and ideally they are hidden as an implementation detail.

srowen · 2016-08-18T11:01:09Z

I think this will require a little update to the Python API to match. Not sure about SparkR

mpjlu · 2016-08-18T11:15:32Z

Hi @srowen , I will update the Python API to match this changes. Now, the current Python API is not conflict with the changes.

SparkQA · 2016-09-14T09:27:26Z

Test build #65357 has finished for PR 14597 at commit 6398f4c.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-09-14T09:58:07Z

Test build #65359 has finished for PR 14597 at commit 6cc4c92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

The last thing to finish this off is to add a bit of documentation -- a brief summary of the three possible usages in the class doc. I don't think we need additional examples. That would at least help make sure people see the functionality.

srowen · 2016-09-15T08:17:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -68,8 +99,23 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
  def this() = this(Identifiable.randomUID("chiSqSelector"))

  /** @group setParam */
-  @Since("1.6.0")
-  def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value)
+  @Since("2.1.0")


Oh, I just noticed this: it should still say "since 1.6.0"

srowen · 2016-09-15T08:18:06Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

-@Since("1.3.0")
-class ChiSqSelector @Since("1.3.0") (
-  @Since("1.3.0") val numTopFeatures: Int) extends Serializable {
+@Since("2.1.0")


Likewise the class is still "since 1.3.0"

srowen · 2016-09-15T08:21:38Z

python/pyspark/mllib/feature.py


    .. versionadded:: 1.4.0
    """
-    def __init__(self, numTopFeatures):
+    def __init__(self):


Hm, does this actually remove the existing constructor or am I missing it? it should be possible to use the existing constructor that sets numTopFeatures, still.

Also I'm not sure what the relationship between versionadded and @since is, but seems like you're following the convention here.

srowen · 2016-09-15T08:23:12Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+  var alpha: Double = 0.05
+  var selectorType = ChiSqSelectorType.KBest
+
+  @Since("1.3.0")


The existing constructor should still have javadoc maybe pointing to the setNumTopFeatures method to say that's the effect it has

mpjlu · 2016-09-15T09:51:44Z

Thanks very much, I am in holiday now, will update the code this Sunday.

SparkQA · 2016-09-18T09:49:16Z

Test build #65559 has finished for PR 14597 at commit 1d2f67f.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-09-18T10:23:59Z

Test build #65563 has finished for PR 14597 at commit 6220dd5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public static final class UnsignedPrefixComparatorNullsLast extends RadixSortSupport
- public static final class UnsignedPrefixComparatorDescNullsFirst extends RadixSortSupport
- public static final class SignedPrefixComparatorNullsLast extends RadixSortSupport
- public static final class SignedPrefixComparatorDescNullsFirst extends RadixSortSupport
- abstract sealed class NullOrdering
- case class SortOrder(
- case class Length(child: Expression) extends UnaryExpression with ImplicitCastInputTypes
- trait Command extends LeafNode
- trait RunnableCommand extends logical.Command
- case class CreateTable(
- case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan]
- class SetAccumulator[T] extends AccumulatorV2[T, java.util.Set[T]]

srowen · 2016-09-18T18:08:44Z

python/pyspark/mllib/feature.py

@@ -271,28 +271,75 @@ def transform(self, vector):
        return JavaVectorTransformer.transform(self, vector)


+class ChiSqSelectorType:


@MLnick do you have an opinion on the Python style?

srowen · 2016-09-18T18:11:00Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

 }

 /**
 * Chi-Squared feature selection, which selects categorical features to use for predicting a
 * categorical label.
+ * The selector supports three selection methods: KBest, Percentile and FPR.


This is a good start but I think we could say some more. I suggest something like ...

The selector supports three selection methods: KBest, Percentile, and FPR. KBest chooses the k top features according to a chi-squared test. Percentile is similar but chooses a fraction of all features instead of a fixed number. FPR chooses all features whose false positive rate meets some threshold.

Should this doc be applied to Python too?

Thanks @srowen, I have updated the javadoc.

SparkQA · 2016-09-19T03:55:22Z

Test build #65585 has finished for PR 14597 at commit 88d2143.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-20T13:02:03Z

This is ready to go ... except that it needs a rebase now, sorry.

mpjlu · 2016-09-20T14:52:15Z

No problem. thanks very much @srowen .

SparkQA · 2016-09-20T17:00:55Z

Test build #65658 has finished for PR 14597 at commit 24f26f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-09-21T09:18:17Z

Nice one. I like this change. Merged to master.

…hon API. ## What changes were proposed in this pull request? apache#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed: * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```. * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also. * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI apache#11620) * We should use lower case of the selector type names to follow MLlib convention. * Add ML Python API. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15214 from yanboliang/spark-17017.

…ion docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for apache#14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes apache#15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.

Peng, Meng added 2 commits August 10, 2016 13:40

add a chiSquare Selector based on False Positive Rate (FPR) test

2adebe8

Merge remote-tracking branch 'origin/master' into fprChiSquare

04053ca

mpjlu changed the title ~~Fpr chi square~~ [SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 11, 2016

srowen reviewed Aug 11, 2016
View reviewed changes

srowen mentioned this pull request Aug 14, 2016

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature #14449

Closed

Peng, Meng added 3 commits August 16, 2016 21:36

Configure the ChiSqSelector to reuse ChiSqTestResult by numTopFeature…

7623563

…s, Percentile, and Fpr selector

Config the ChiSqSelector to reuse the ChiSqTestResult by KBest, Perce…

3d6aecb

…ntile and FPR selector

Merge branch 'master' into fprChiSquare2

026ac85

mpjlu closed this Aug 17, 2016

mpjlu force-pushed the fprChiSquare branch from 04053ca to e28a8c5 Compare August 17, 2016 02:46

add Since annotation

5305709

mpjlu reopened this Aug 18, 2016

mpjlu changed the title ~~[SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test~~ [WIP][SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 18, 2016

mpjlu changed the title ~~[WIP][SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test~~ [SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test Aug 18, 2016

srowen reviewed Aug 18, 2016
View reviewed changes

Not reuse the ChiSqTestResult to be consistent with other methods

1e8d83a

srowen requested changes Sep 15, 2016

View reviewed changes

Peng, Meng added 2 commits September 18, 2016 15:43

add javadoc

1d2f67f

fix mima conflict

6220dd5

srowen requested changes Sep 18, 2016

View reviewed changes

Peng, Meng added 2 commits September 19, 2016 09:34

Merge remote-tracking branch 'origin/master' into fprChiSquare

ce3f8fb

change javadoc

88d2143

fix mima conflict

24f26f2

asfgit closed this in b366f18 Sep 21, 2016

yanboliang mentioned this pull request Sep 23, 2016

[SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector and add ML Python API. #15214

Closed

lins05 mentioned this pull request Sep 25, 2016

[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector #15236

Closed

yanboliang mentioned this pull request Sep 29, 2016

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

mpjlu commented Aug 11, 2016 •

edited

Loading

srowen Aug 11, 2016

mpjlu Aug 11, 2016

srowen Aug 11, 2016

srowen commented Aug 11, 2016

mpjlu commented Aug 11, 2016

avulanov commented Aug 12, 2016

srowen commented Aug 12, 2016

avulanov commented Aug 12, 2016

mpjlu commented Aug 15, 2016

mpjlu commented Aug 18, 2016

srowen Aug 18, 2016 •

edited

Loading

srowen commented Aug 18, 2016

mpjlu commented Aug 18, 2016

SparkQA commented Sep 14, 2016

SparkQA commented Sep 14, 2016

srowen left a comment

srowen Sep 15, 2016

srowen Sep 15, 2016

srowen Sep 15, 2016

srowen Sep 15, 2016

mpjlu commented Sep 15, 2016

SparkQA commented Sep 18, 2016

SparkQA commented Sep 18, 2016

srowen Sep 18, 2016

srowen Sep 18, 2016

mpjlu Sep 19, 2016

SparkQA commented Sep 19, 2016

srowen commented Sep 20, 2016

mpjlu commented Sep 20, 2016

SparkQA commented Sep 20, 2016

srowen commented Sep 21, 2016

		@@ -271,28 +271,75 @@ def transform(self, vector):
		return JavaVectorTransformer.transform(self, vector)


		class ChiSqSelectorType:

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

Conversation

mpjlu commented Aug 11, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Aug 11, 2016

mpjlu commented Aug 11, 2016

avulanov commented Aug 12, 2016

srowen commented Aug 12, 2016

avulanov commented Aug 12, 2016

mpjlu commented Aug 15, 2016

mpjlu commented Aug 18, 2016

srowen Aug 18, 2016 • edited Loading

Choose a reason for hiding this comment

srowen commented Aug 18, 2016

mpjlu commented Aug 18, 2016

SparkQA commented Sep 14, 2016

SparkQA commented Sep 14, 2016

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpjlu commented Sep 15, 2016

SparkQA commented Sep 18, 2016

SparkQA commented Sep 18, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 19, 2016

srowen commented Sep 20, 2016

mpjlu commented Sep 20, 2016

SparkQA commented Sep 20, 2016

srowen commented Sep 21, 2016

mpjlu commented Aug 11, 2016 •

edited

Loading

srowen Aug 18, 2016 •

edited

Loading