Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-18901][ML]:Require in LR LogisticAggregator is redundant #17478

Closed
wants to merge 1 commit into from

Conversation

wangmiao1981
Copy link
Contributor

What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

add and merge have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Mar 30, 2017

Test build #75382 has finished for PR 17478 at commit 44305bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Mar 30, 2017

I guess I don't see the harm in keeping these checks. Yes, in this case we always call LogisticAggregator after we have gone through the same data with MultivariateOnlineSummarizer, but it may not always be the case. These checks are very small computation in comparison with the add and merge operations. I can imagine future developers finding that these are unchecked in the aggregator and trying to re-implement this logic.

If this were a performance fix, I'd support it. I don't think it is, and so I think the code is more readable and clear if we keep the checks.

@wangmiao1981
Copy link
Contributor Author

@sethah Thanks for your reply! Your suggestion makes sense to me. My intention was to close the JIRA by simple fix. How about we add a test for these checks and close the original JIRA? or you think just mark that JIRA as WON'T Fix? Thanks!

@sethah
Copy link
Contributor

sethah commented Mar 30, 2017

Yeah, I would support adding a unit test to the logistic aggregator (well, all aggregators) for these types of things. I do think it's better to keep them and add a couple tests, but I don't feel too strongly about it. Thoughts? Also, ping @hhbyyh

@hhbyyh
Copy link
Contributor

hhbyyh commented Mar 30, 2017

Thanks for @wangmiao1981 for the PR and @sethah for the comments. Maybe I should be more clear when I created the jira.

I would prefer to remove the require here permanently.

The relevant operation (especially add) could be invoked maybe billions of times (10M recoreds * 100 iterations) during a training. The check for feature dimension and weight should only be carried out once. But here they are conducted in each iteration.

And for the general code practice, I would not keep unnecessary code unless there's clear plan for it. That's just personal preference. cc @jkbradley

@sethah
Copy link
Contributor

sethah commented Mar 30, 2017

Checking the size is a constant time operation, but in add we do also do a linear time dot product. I do not think this affects performance. I don't exactly mind removing it, but not checking something because it's only used in conjunction with some other part of the code where it is checked does not seem a robust solution - at best it's quite unclear. I don't think it's a huge deal either way, but I'm in favor of keeping the checks.

@yanboliang
Copy link
Contributor

I agree with @hhbyyh in this case. The check for feature dimension should only be carried out once rather than running in each iteration, and actually we have checked this before iterations.

@yanboliang
Copy link
Contributor

Merged into master and branch-2.2. Thanks for all.

asfgit pushed a commit that referenced this pull request Apr 24, 2017
## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #17478 from wangmiao1981/logit.

(cherry picked from commit 90264ac)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
@yanboliang
Copy link
Contributor

@wangmiao1981 Would you mind to send a follow-up PR to address the same issue for LinearRegression and LinearSVC? Thanks.

@asfgit asfgit closed this in 90264ac Apr 24, 2017
@wangmiao1981
Copy link
Contributor Author

@yanboliang I will do it. Thanks!

ghost pushed a commit to dbtsai/spark that referenced this pull request Apr 25, 2017
…dant

## What changes were proposed in this pull request?

This is a follow-up PR of apache#17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 <wm624@hotmail.com>

Closes apache#17754 from wangmiao1981/followup.
asfgit pushed a commit that referenced this pull request Apr 25, 2017
…dant

## What changes were proposed in this pull request?

This is a follow-up PR of #17478.

## How was this patch tested?

Existing tests

Author: wangmiao1981 <wm624@hotmail.com>

Closes #17754 from wangmiao1981/followup.

(cherry picked from commit 387565c)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

In MultivariateOnlineSummarizer,

`add` and `merge` have check for weights and feature sizes. The checks in LR are redundant, which are removed from this PR.

## How was this patch tested?

Existing tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes apache#17478 from wangmiao1981/logit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants