Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-30398][ML] PCA/RegressionMetrics/RowMatrix avoid unnecessary computation #27059

Closed
wants to merge 3 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

use .ml.Summarizer instead of .mllib.MultivariateOnlineSummarizer to avoid computation of unused metrics

Why are the changes needed?

to avoid computation of unused metrics

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

@SparkQA
Copy link

SparkQA commented Dec 31, 2019

Test build #115988 has finished for PR 27059 at commit e34c4fb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -21,6 +21,7 @@ import scala.annotation.varargs

import org.apache.spark.annotation.Since
import org.apache.spark.api.java.{JavaDoubleRDD, JavaRDD}
import org.apache.spark.ml.stat.SummaryBuilderImpl._
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design wise, can we minimize the places where mllib calls ml? and is it possible to expose this without reaching into the "Impl" class? it looks a little funny

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because current aggregator is an inner class org.apache.spark.ml.stat.SummaryBuilderImpl.SummarizerBuffer,
I guess I need to move it outside of SummaryBuilderImpl

init

init

init

init
minimiaze import of ml
@SparkQA
Copy link

SparkQA commented Jan 2, 2020

Test build #116005 has finished for PR 27059 at commit a675881.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 2, 2020

Test build #116011 has finished for PR 27059 at commit 359a911.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the change here is breaking out the inner class right?

@srowen
Copy link
Member

srowen commented Jan 4, 2020

Merged to master

@srowen srowen closed this in c42fbc7 Jan 4, 2020
@zhengruifeng zhengruifeng deleted the pac_summarizer branch January 6, 2020 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants