[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

WeichenXu123 · 2017-04-20T14:27:59Z

What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the coefficientMatrix by the global coeffs means.

In fact the coefficientMatrix should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose { w0, w1, .. w(K-1) } make up the coefficientMatrix,
then { w0 + c, w1 + c, ... w(K - 1) + c} will also be the equivalent solution.
c is an arbitrary vector of numFeatures dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the coefficientMatrix on each feature dimension separately.

We can also confirm this through R library glmnet, that MLOR in glmnet always generate coefficients result that the sum of each dimension is all zero, when reg == 0.

How was this patch tested?

Tests added.

SparkQA · 2017-04-20T15:29:30Z

Test build #75990 has finished for PR 17706 at commit 1d0fb87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-04-20T18:00:26Z

@WeichenXu123 Thanks for the pr. Is there a JIRA? Why is testing "not applicable"? Seems you are correct on this, but could you please provide a good reference?

SparkQA · 2017-04-21T06:47:33Z

Test build #76018 has started for PR 17706 at commit a694d13.

WeichenXu123 · 2017-04-21T07:46:25Z

Jenkins test this please

SparkQA · 2017-04-21T08:49:34Z

Test build #76021 has finished for PR 17706 at commit a694d13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-04-21T16:58:44Z

LGTM, thanks for catching this. cc @dbtsai

dbtsai · 2017-04-21T17:56:48Z

LGTM. Thanks!

## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.** ## How was this patch tested? Tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #17706 from WeichenXu123/mlor_center. (cherry picked from commit eb00378) Signed-off-by: DB Tsai <dbtsai@dbtsai.com>

sethah · 2017-04-21T18:01:50Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

@@ -1204,6 +1207,9 @@ class LogisticRegressionSuite
      -0.3180040, 0.9679074, -0.2252219, -0.4319914,
      0.2452411, -0.6046524, 0.1050710, 0.1180180), isTransposed = true)

+    model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))


Before we tested that the coefficients have zero mean using:

assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)

We should replace every instance of that test with this new one.

The fix looks right to me. Let's add a test that failing the original implementation.

LBFGS seems to automatically find a solution where the coefficients for each feature index sum to zero, so I'm not sure of a way to find a case where this does not happen, TBH.

Interesting. In theory, LBFGS can not see this since all LBFGS knows is the objective function. I think if we feed it with different solutions by justing a constant vector, LBFGS will stop as well. I think maybe it's related to how we setup the initial condition leading to the solution we get.

dbtsai · 2017-04-21T18:02:08Z

Merged into master and 2.2 branch. Thanks.

## What changes were proposed in this pull request? When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result. BUT current implementation centralize the `coefficientMatrix` by the global coeffs means. In fact the `coefficientMatrix` should be centralized on each feature index itself. Because, according to the MLOR probability distribution function, it can be proven easily that: suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`, then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution. `c` is an arbitrary vector of `numFeatures` dimension. reference https://core.ac.uk/download/pdf/6287975.pdf So that we need to centralize the `coefficientMatrix` on each feature dimension separately. **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.** ## How was this patch tested? Tests added. Author: WeichenXu <WeichenXu123@outlook.com> Closes apache#17706 from WeichenXu123/mlor_center.

WeichenXu123 changed the title ~~fix MLOR coeffs centering when reg == 0~~ [ML] fix MLOR coeffs centering when reg == 0 Apr 20, 2017

init pr

1d0fb87

WeichenXu123 changed the title ~~[ML] fix MLOR coeffs centering when reg == 0~~ [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 Apr 21, 2017

update

a694d13

asfgit closed this in eb00378 Apr 21, 2017

sethah reviewed Apr 21, 2017

View reviewed changes

WeichenXu123 deleted the mlor_center branch April 22, 2017 01:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

WeichenXu123 commented Apr 20, 2017 •

edited

Loading

SparkQA commented Apr 20, 2017

sethah commented Apr 20, 2017

SparkQA commented Apr 21, 2017

WeichenXu123 commented Apr 21, 2017

SparkQA commented Apr 21, 2017

yanboliang commented Apr 21, 2017

dbtsai commented Apr 21, 2017

sethah Apr 21, 2017

dbtsai Apr 21, 2017

sethah Apr 21, 2017

dbtsai Apr 21, 2017

dbtsai commented Apr 21, 2017

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

Conversation

WeichenXu123 commented Apr 20, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Apr 20, 2017

sethah commented Apr 20, 2017

SparkQA commented Apr 21, 2017

WeichenXu123 commented Apr 21, 2017

SparkQA commented Apr 21, 2017

yanboliang commented Apr 21, 2017

dbtsai commented Apr 21, 2017

sethah Apr 21, 2017

Choose a reason for hiding this comment

dbtsai Apr 21, 2017

Choose a reason for hiding this comment

sethah Apr 21, 2017

Choose a reason for hiding this comment

dbtsai Apr 21, 2017

Choose a reason for hiding this comment

dbtsai commented Apr 21, 2017

WeichenXu123 commented Apr 20, 2017 •

edited

Loading