-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706
Conversation
Test build #75990 has finished for PR 17706 at commit
|
@WeichenXu123 Thanks for the pr. Is there a JIRA? Why is testing "not applicable"? Seems you are correct on this, but could you please provide a good reference? |
Test build #76018 has started for PR 17706 at commit |
Jenkins test this please |
Test build #76021 has finished for PR 17706 at commit
|
LGTM, thanks for catching this. cc @dbtsai |
LGTM. Thanks! |
## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.** ## How was this patch tested? Tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17706 from WeichenXu123/mlor_center. (cherry picked from commit eb00378) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
@@ -1204,6 +1207,9 @@ class LogisticRegressionSuite | |||
-0.3180040, 0.9679074, -0.2252219, -0.4319914, | |||
0.2452411, -0.6046524, 0.1050710, 0.1180180), isTransposed = true) | |||
|
|||
model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we tested that the coefficients have zero mean using:
assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)
We should replace every instance of that test with this new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fix looks right to me. Let's add a test that failing the original implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LBFGS seems to automatically find a solution where the coefficients for each feature index sum to zero, so I'm not sure of a way to find a case where this does not happen, TBH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. In theory, LBFGS can not see this since all LBFGS knows is the objective function. I think if we feed it with different solutions by justing a constant vector, LBFGS will stop as well. I think maybe it's related to how we setup the initial condition leading to the solution we get.
Merged into master and 2.2 branch. Thanks. |
## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.** ## How was this patch tested? Tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#17706 from WeichenXu123/mlor_center.
What changes were proposed in this pull request?
When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the
coefficientMatrix
by the global coeffs means.In fact the
coefficientMatrix
should be centralized on each feature index itself.Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose
{ w0, w1, .. w(K-1) }
make up thecoefficientMatrix
,then
{ w0 + c, w1 + c, ... w(K - 1) + c}
will also be the equivalent solution.c
is an arbitrary vector ofnumFeatures
dimension.reference
https://core.ac.uk/download/pdf/6287975.pdf
So that we need to centralize the
coefficientMatrix
on each feature dimension separately.We can also confirm this through R library
glmnet
, that MLOR inglmnet
always generate coefficients result that the sum of each dimension is allzero
, when reg == 0.How was this patch tested?
Tests added.