-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597
Conversation
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( | |||
new ChiSqSelectorModel(indices) | |||
} | |||
} | |||
|
|||
/** | |||
* Creates a ChiSquared feature selector by False Positive Rate (FPR) test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any link to document what this means? I had to double-check it means what I think it means and could only find indirect references. We might very briefly explain it as a bound on the likelihood that the feature only by chance appears to be predictive, given the data, and even give an indicative common value like 0.05.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @srowen , http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/, this is a link to introduce p-value for feature selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, that's worth putting in the comment.
This would also need to modify the implementation in |
Hi, @srowen , I can modify the implementation in .ml to accommodate the new params. Thanks. |
@srowen I've checked our thread with @mengxr on that feature #1484.
@mpjlu It seems that FPR feature selection should not modify the code of existing |
OK, makes sense @avulanov though I'm not sure why the model can't sort the indices if it requires this as an internal detail. No big deal. After this change it may not matter. Conceptually though, the output of chi-squared feature selection does certainly imply an ordering because it computes a p-value for each feature. It's useful info and there's a use for it here now. |
Yes, it seems that index sort can be done inside the model. With regards to the sort by p-value, I have taken a brief look at chi-squared feature selection in sci-kit and Weka, and they don't seem to sort the output. |
Hi @avulanov . In general, FPR feature selection should not modify the code of existing ChiSqSelector, as we have implemented in this PR. But if we need to reuse the ChiSqTestResult (Statistics.chiSqTest(data)), it is better to modify the code of ChiSqSelector. In Scikit-learn, for each SelectKBest, SelectFpr, SelectPercentile and so on, create an object for it, as we implemented in this PR. The good point of this method is it is consistent across the LIB, all use the same Estimator/Model style. The disadvantage is it cannot reuse the results of score function. @srowen |
…s, Percentile, and Fpr selector
…ntile and FPR selector
Hi @srowen, I have added the parameter to control the feature selection type. And the indices is sort in the model internally as we have discussed. For pass the p-value to the model function, this update does not include it. Because for the KBest and Percentile selection, the fit function uses ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses ChiSqTestResult.p-value. So it maybe better to pass ChiSqTestResult to the model and expose to the caller. And I think it is better to submit another PR for "pass value to model and expose to the caller" problem. Because much codes will be changed for this problem, includes which data should be passed to the model, how to save the model, how to test the model. |
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value) | ||
@Since("2.1.0") | ||
def setNumTopFeatures(value: Int): this.type = { | ||
set(selectorType, "KBest") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, let's make these values like "KBest" some fields in a private object
Edit: oh, this was done below. They can be used here? and ideally they are hidden as an implementation detail.
I think this will require a little update to the Python API to match. Not sure about SparkR |
Hi @srowen , I will update the Python API to match this changes. Now, the current Python API is not conflict with the changes. |
Test build #65357 has finished for PR 14597 at commit
|
Test build #65359 has finished for PR 14597 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last thing to finish this off is to add a bit of documentation -- a brief summary of the three possible usages in the class doc. I don't think we need additional examples. That would at least help make sure people see the functionality.
@@ -68,8 +99,23 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str | |||
def this() = this(Identifiable.randomUID("chiSqSelector")) | |||
|
|||
/** @group setParam */ | |||
@Since("1.6.0") | |||
def setNumTopFeatures(value: Int): this.type = set(numTopFeatures, value) | |||
@Since("2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just noticed this: it should still say "since 1.6.0"
@Since("1.3.0") | ||
class ChiSqSelector @Since("1.3.0") ( | ||
@Since("1.3.0") val numTopFeatures: Int) extends Serializable { | ||
@Since("2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise the class is still "since 1.3.0"
|
||
.. versionadded:: 1.4.0 | ||
""" | ||
def __init__(self, numTopFeatures): | ||
def __init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, does this actually remove the existing constructor or am I missing it? it should be possible to use the existing constructor that sets numTopFeatures, still.
Also I'm not sure what the relationship between versionadded and @since
is, but seems like you're following the convention here.
var alpha: Double = 0.05 | ||
var selectorType = ChiSqSelectorType.KBest | ||
|
||
@Since("1.3.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing constructor should still have javadoc maybe pointing to the setNumTopFeatures method to say that's the effect it has
Thanks very much, I am in holiday now, will update the code this Sunday. |
Test build #65559 has finished for PR 14597 at commit
|
Test build #65563 has finished for PR 14597 at commit
|
@@ -271,28 +271,75 @@ def transform(self, vector): | |||
return JavaVectorTransformer.transform(self, vector) | |||
|
|||
|
|||
class ChiSqSelectorType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick do you have an opinion on the Python style?
} | ||
|
||
/** | ||
* Chi-Squared feature selection, which selects categorical features to use for predicting a | ||
* categorical label. | ||
* The selector supports three selection methods: KBest, Percentile and FPR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good start but I think we could say some more. I suggest something like ...
The selector supports three selection methods: KBest
, Percentile
, and FPR
. KBest
chooses the k top features according to a chi-squared test. Percentile
is similar but chooses a fraction of all features instead of a fixed number. FPR
chooses all features whose false positive rate meets some threshold.
Should this doc be applied to Python too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @srowen, I have updated the javadoc.
Test build #65585 has finished for PR 14597 at commit
|
This is ready to go ... except that it needs a rebase now, sorry. |
No problem. thanks very much @srowen . |
Test build #65658 has finished for PR 14597 at commit
|
Nice one. I like this change. Merged to master. |
…hon API. ## What changes were proposed in this pull request? apache#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed: * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```. * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also. * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI apache#11620) * We should use lower case of the selector type names to follow MLlib convention. * Add ML Python API. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes apache#15214 from yanboliang/spark-17017.
…ion docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for apache#14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes apache#15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
How was this patch tested?
Add Scala ut