Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 #17706

Closed
wants to merge 2 commits into from

Conversation

WeichenXu123
Copy link
Contributor

@WeichenXu123 WeichenXu123 commented Apr 20, 2017

What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the coefficientMatrix by the global coeffs means.

In fact the coefficientMatrix should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose { w0, w1, .. w(K-1) } make up the coefficientMatrix,
then { w0 + c, w1 + c, ... w(K - 1) + c} will also be the equivalent solution.
c is an arbitrary vector of numFeatures dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the coefficientMatrix on each feature dimension separately.

We can also confirm this through R library glmnet, that MLOR in glmnet always generate coefficients result that the sum of each dimension is all zero, when reg == 0.

How was this patch tested?

Tests added.

@WeichenXu123 WeichenXu123 changed the title fix MLOR coeffs centering when reg == 0 [ML] fix MLOR coeffs centering when reg == 0 Apr 20, 2017
@SparkQA
Copy link

SparkQA commented Apr 20, 2017

Test build #75990 has finished for PR 17706 at commit 1d0fb87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Apr 20, 2017

@WeichenXu123 Thanks for the pr. Is there a JIRA? Why is testing "not applicable"? Seems you are correct on this, but could you please provide a good reference?

@WeichenXu123 WeichenXu123 changed the title [ML] fix MLOR coeffs centering when reg == 0 [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 Apr 21, 2017
@SparkQA
Copy link

SparkQA commented Apr 21, 2017

Test build #76018 has started for PR 17706 at commit a694d13.

@WeichenXu123
Copy link
Contributor Author

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2017

Test build #76021 has finished for PR 17706 at commit a694d13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, thanks for catching this. cc @dbtsai

@dbtsai
Copy link
Member

dbtsai commented Apr 21, 2017

LGTM. Thanks!

asfgit pushed a commit that referenced this pull request Apr 21, 2017
## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #17706 from WeichenXu123/mlor_center.

(cherry picked from commit eb00378)
Signed-off-by: DB Tsai <dbtsai@dbtsai.com>
@asfgit asfgit closed this in eb00378 Apr 21, 2017
@@ -1204,6 +1207,9 @@ class LogisticRegressionSuite
-0.3180040, 0.9679074, -0.2252219, -0.4319914,
0.2452411, -0.6046524, 0.1050710, 0.1180180), isTransposed = true)

model1.coefficientMatrix.colIter.foreach(v => assert(v.toArray.sum ~== 0.0 absTol eps))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we tested that the coefficients have zero mean using:

assert(model1.coefficientMatrix.toArray.sum ~== 0.0 absTol eps)

We should replace every instance of that test with this new one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks right to me. Let's add a test that failing the original implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LBFGS seems to automatically find a solution where the coefficients for each feature index sum to zero, so I'm not sure of a way to find a case where this does not happen, TBH.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. In theory, LBFGS can not see this since all LBFGS knows is the objective function. I think if we feed it with different solutions by justing a constant vector, LBFGS will stop as well. I think maybe it's related to how we setup the initial condition leading to the solution we get.

@dbtsai
Copy link
Member

dbtsai commented Apr 21, 2017

Merged into master and 2.2 branch. Thanks.

@WeichenXu123 WeichenXu123 deleted the mlor_center branch April 22, 2017 01:44
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.

In fact the `coefficientMatrix` should be centralized on each feature index itself.
Because, according to the MLOR probability distribution function, it can be proven easily that:
suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
`c` is an arbitrary vector of `numFeatures` dimension.
reference
https://core.ac.uk/download/pdf/6287975.pdf

So that we need to centralize the `coefficientMatrix` on each feature dimension separately.

**We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**

## How was this patch tested?

Tests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes apache#17706 from WeichenXu123/mlor_center.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants