Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code #17094

Closed
wants to merge 18 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Feb 28, 2017

What changes were proposed in this pull request?

JIRA: SPARK-19762

The larger changes in this patch are:

  • Adds a DifferentiableLossAggregator trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: merge, gradient, loss, weight from the aggregator subclasses.
  • Adds a RDDLossFunction which is intended to be the only implementation of Breeze's DiffFunction necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.
  • Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function.
  • Changes LinearRegression to use this new hierarchy as a proof of concept.
  • Adds the following new namespaces o.a.s.ml.optim.loss and o.a.s.ml.optim.aggregator

Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.

NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.

BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.

How was this patch tested?

Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.

  • DifferentiablLossAggregatorSuite
  • LeastSquaresAggregatorSuite
  • RDDLossFunctionSuite
  • DifferentiableRegularizationSuite

Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.

Note: timings are best of 3 runs.

numFeatures numPoints maxIter regParam elasticNetParam SPARK-19762 (sec) master (sec)
0 5000 1e+06 30 0 0 129.594 131.153
1 5000 1e+06 30 0.1 0 135.54 136.327
2 5000 1e+06 30 0.01 0.5 135.148 129.771
3 50000 100000 30 0 0 145.764 144.096

Follow ups

If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.

@sethah sethah changed the title [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code [SPARK-19762][ML][WIP] Hierarchy for consolidating ML aggregator/loss code Feb 28, 2017
Instance(1.5, 0.2, Vectors.dense(3.0, 0.2))
)

def assertEqual[T, Agg <: DifferentiableLossAggregator[T, Agg]](
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make private

extends DiffFunction[BDV[Double]] {

override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
val bcCoefficients = instances.context.broadcast(Vectors.dense(coefficients.data))
Copy link
Contributor Author

@sethah sethah Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use fromBreeze

/**
* Dummy aggregator that represents least squares cost with no intercept.
*/
class TestAggregator(numFeatures: Int)(coefficients: Vector)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move it into a companion object

@sethah
Copy link
Contributor Author

sethah commented Feb 28, 2017

ping @MLnick @jkbradley

@sethah
Copy link
Contributor Author

sethah commented Feb 28, 2017

Jenkins test this please.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73555 has finished for PR 17094 at commit 9a04d0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 28, 2017

Test build #73557 has finished for PR 17094 at commit 9a04d0b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Feb 28, 2017

also cc @hhbyyh

/** Merge two aggregators. The `this` object will be modified in place and returned. */
def merge(other: Agg): Agg = {
require(dim == other.dim, s"Dimensions mismatch when merging with another " +
s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this to use getClass.getName

@sethah sethah changed the title [SPARK-19762][ML][WIP] Hierarchy for consolidating ML aggregator/loss code [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code Mar 3, 2017
@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73819 has started for PR 17094 at commit f7e9169.

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73820 has started for PR 17094 at commit 46630d1.

@sethah sethah force-pushed the ml_aggregators branch 2 times, most recently from 76eda69 to d7dceeb Compare March 3, 2017 07:22
@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73821 has started for PR 17094 at commit 76eda69.

@sethah
Copy link
Contributor Author

sethah commented Mar 3, 2017

Removed WIP, think it's ready now :)

@SparkQA
Copy link

SparkQA commented Mar 3, 2017

Test build #73823 has started for PR 17094 at commit d7dceeb.

*
* @tparam T The type of the coefficients being regularized.
*/
trait DifferentiableRegularization[T] extends DiffFunction[T] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make these private

@sethah
Copy link
Contributor Author

sethah commented Mar 5, 2017

Jenkins test this please.

@SparkQA
Copy link

SparkQA commented Mar 5, 2017

Test build #73914 has finished for PR 17094 at commit d7dceeb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented May 17, 2017

ping! @MLnick @jkbradley @yanboliang @hhbyyh

Is there any interest in this? I actually think this cleanup will be a precursor to several different improvements (adding more optimized aggregators, adding optimization library) and that it will be very useful. IMO it's an important change. Otherwise, we keep slapping layers on the current implementation and the code length and complexity keeps growing.

I'm happy to take suggestions, make changes, or discuss it further. Thoughts?

@SparkQA
Copy link

SparkQA commented May 17, 2017

Test build #77033 has finished for PR 17094 at commit b55b7fe.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77034 has finished for PR 17094 at commit 9461c45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented May 18, 2017

In terms of the high level intention of this, agree we definitely need it and it should clean things up substantially. I will start taking a look through ASAP. Thanks!

@sethah
Copy link
Contributor Author

sethah commented May 18, 2017

Thanks @MLnick! I am happy to discuss splitting this into smaller bits as well, if it can make things easier.

@sethah
Copy link
Contributor Author

sethah commented May 25, 2017

cc @srowen also

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77360 has finished for PR 17094 at commit f8b84a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me. This cuts down duplication and adds tests, and as I understand paves the way for some further improvements.

bcFeaturesMean: Broadcast[Array[Double]])(bcCoefficients: Broadcast[Vector])
extends DifferentiableLossAggregator[Instance, LeastSquaresAggregator] {
require(labelStd > 0.0, s"${this.getClass.getName} requires the label standard" +
s"deviation to be positive.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a space before 'deviation' or at the end of the previous line

Datum,
Agg <: DifferentiableLossAggregator[Datum, Agg]] extends Serializable {

self: Agg =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a brief comment explaining what this self type does

0.0
}
case None =>
sum += coefficients(j) * coefficients(j)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is performance critical but in a few blocks like this an array index is dereferenced many times and could be saved off, if it mattered, to optimize a bit

@SparkQA
Copy link

SparkQA commented May 25, 2017

Test build #77376 has finished for PR 17094 at commit 29052d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this with Seth too and it looks pretty good to me. CCing @MLnick @jkbradley @yanboliang @hhbyyh for final comments

@srowen
Copy link
Member

srowen commented May 30, 2017

Merging tomorrow if there are no objections.

@MLnick
Copy link
Contributor

MLnick commented May 30, 2017

Overall looks good to me. I think it's a good step to clean up the codebase and reduce the duplicated code.

I think the impl is pretty well thought through. A few comments (that probably should be part of follow up):

  1. I'd like to think about ways the regularizer(s) could be made more like "function composition" since the loss and reg are both just DiffFunctions
  2. I think there should be scope for factoring out the loss functions into some sort of traits to make things cleaner

The one thing that doesn't feel quite right is the fact that the std scaling finds its way into L2Regularization - it sort of feels like the abstraction is leaking there. Not quite sure how to address it (perhaps we could look at something in line with the way OWLQN does it's L1 reg?).

@sethah
Copy link
Contributor Author

sethah commented May 30, 2017

@MLnick I completely agree about the leaky regularization abstraction. In fact, I think the function composition feature would make it easy to get rid of that problem. Consider:

In the standardized features case we want to compute dL/dBj_std where Bj_std = Bj / sigmaj. Define g(x) = sum(x_j^2) (l2 reg) and f(y) = y_j / sigma_j (standardization). Then we could do val h = g.compose(f) to give a loss function that provides the derivative of h as dg/df * df/dx. I hope that made some sense. Right now, it seems an over-engineered solution to me since we only use this function in a few places. It's definitely more elegant and more general, but I'd prefer to do that as a follow up or if we decide to keep building new algorithms using these abstractions.

Can you expand on your second point? Thanks!

@MLnick
Copy link
Contributor

MLnick commented May 31, 2017

Sure, makes sense. We can always consider it later. Or even an alternate version of it to have L2 and a subclass StandardizedL2 or whatever (that's more if we were to start thinking about exposing the building blocks to external algorithm developers).

For point (2), it's just that each loss function ("squared loss", "logistic", etc) can implement a Loss trait similar to the old org.apache.spark.mllib.optimization.Gradient approach. The Loss would then be an arg of the Aggregator I suppose and the add method could be further consolidated. Not sure if it adds that much value here because of the funky standardization stuff we do in LiR and LoR...

@MLnick
Copy link
Contributor

MLnick commented May 31, 2017

But even for the standard-scaling - it seems that could be expressed generically too with respect to scaling the coeff and gradient during the computation. Again, something perhaps for later.

@sethah
Copy link
Contributor Author

sethah commented May 31, 2017

Ok, yes all good points. I think since these are all private apis it gives us room for future changes. For now, I think we can get rid of a lot of code duplication and fill in some testing gaps with this change. I will be happy to drive the roadmap on this front as we move forward.

@srowen
Copy link
Member

srowen commented Jun 3, 2017

@sethah @MLnick am I reading right that this can be merged as a step forward?

@sethah
Copy link
Contributor Author

sethah commented Jun 3, 2017

@srowen Speaking for myself, I think the other concerns can be issued as follow ups, yes.

@srowen
Copy link
Member

srowen commented Jun 5, 2017

Merged to master

@MLnick
Copy link
Contributor

MLnick commented Jun 5, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants