[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code #17094

sethah · 2017-02-28T03:35:59Z

What changes were proposed in this pull request?

JIRA: SPARK-19762

The larger changes in this patch are:

Adds a DifferentiableLossAggregator trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods: merge, gradient, loss, weight from the aggregator subclasses.
Adds a RDDLossFunction which is intended to be the only implementation of Breeze's DiffFunction necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.
Factors out the regularization from the data part of the cost function, and treats regularization as a separate independent cost function which can be evaluated and added to the data cost function.
Changes LinearRegression to use this new hierarchy as a proof of concept.
Adds the following new namespaces o.a.s.ml.optim.loss and o.a.s.ml.optim.aggregator

Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.

NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.

BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.

How was this patch tested?

Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.

DifferentiablLossAggregatorSuite
LeastSquaresAggregatorSuite
RDDLossFunctionSuite
DifferentiableRegularizationSuite

Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.

Note: timings are best of 3 runs.

	numFeatures	numPoints	maxIter	regParam	elasticNetParam	SPARK-19762 (sec)	master (sec)
0	5000	1e+06	30	0	0	129.594	131.153
1	5000	1e+06	30	0.1	0	135.54	136.327
2	5000	1e+06	30	0.01	0.5	135.148	129.771
3	50000	100000	30	0	0	145.764	144.096

Follow ups

If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.

sethah · 2017-02-28T03:42:41Z

.../src/test/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregatorSuite.scala

+    Instance(1.5, 0.2, Vectors.dense(3.0, 0.2))
+  )
+
+  def assertEqual[T, Agg <: DifferentiableLossAggregator[T, Agg]](


make private

sethah · 2017-02-28T03:42:56Z

mllib/src/main/scala/org/apache/spark/ml/optim/loss/RDDLossFunction.scala

+  extends DiffFunction[BDV[Double]] {
+
+  override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = {
+    val bcCoefficients = instances.context.broadcast(Vectors.dense(coefficients.data))


use fromBreeze

sethah · 2017-02-28T03:44:39Z

.../src/test/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregatorSuite.scala

+/**
+ * Dummy aggregator that represents least squares cost with no intercept.
+ */
+class TestAggregator(numFeatures: Int)(coefficients: Vector)


move it into a companion object

sethah · 2017-02-28T03:48:28Z

ping @MLnick @jkbradley

sethah · 2017-02-28T03:49:37Z

Jenkins test this please.

SparkQA · 2017-02-28T04:27:26Z

Test build #73555 has finished for PR 17094 at commit 9a04d0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-28T04:44:21Z

Test build #73557 has finished for PR 17094 at commit 9a04d0b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-02-28T16:07:34Z

also cc @hhbyyh

sethah · 2017-02-28T16:08:30Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregator.scala

+  /** Merge two aggregators. The `this` object will be modified in place and returned. */
+  def merge(other: Agg): Agg = {
+    require(dim == other.dim, s"Dimensions mismatch when merging with another " +
+      s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.")


change this to use getClass.getName

SparkQA · 2017-03-03T07:12:36Z

Test build #73819 has started for PR 17094 at commit f7e9169.

SparkQA · 2017-03-03T07:17:31Z

Test build #73820 has started for PR 17094 at commit 46630d1.

SparkQA · 2017-03-03T07:22:32Z

Test build #73821 has started for PR 17094 at commit 76eda69.

sethah · 2017-03-03T07:25:13Z

Removed WIP, think it's ready now :)

SparkQA · 2017-03-03T07:27:33Z

Test build #73823 has started for PR 17094 at commit d7dceeb.

sethah · 2017-03-03T20:17:04Z

mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala

+ *
+ * @tparam T The type of the coefficients being regularized.
+ */
+trait DifferentiableRegularization[T] extends DiffFunction[T] {


make these private

sethah · 2017-03-05T00:10:17Z

Jenkins test this please.

SparkQA · 2017-03-05T01:03:51Z

Test build #73914 has finished for PR 17094 at commit d7dceeb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-05-17T23:21:49Z

ping! @MLnick @jkbradley @yanboliang @hhbyyh

Is there any interest in this? I actually think this cleanup will be a precursor to several different improvements (adding more optimized aggregators, adding optimization library) and that it will be very useful. IMO it's an important change. Otherwise, we keep slapping layers on the current implementation and the code length and complexity keeps growing.

I'm happy to take suggestions, make changes, or discuss it further. Thoughts?

SparkQA · 2017-05-17T23:36:06Z

Test build #77033 has finished for PR 17094 at commit b55b7fe.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-18T00:44:34Z

Test build #77034 has finished for PR 17094 at commit 9461c45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-05-18T09:14:41Z

In terms of the high level intention of this, agree we definitely need it and it should clean things up substantially. I will start taking a look through ASAP. Thanks!

sethah · 2017-05-18T18:14:54Z

Thanks @MLnick! I am happy to discuss splitting this into smaller bits as well, if it can make things easier.

sethah · 2017-05-25T09:58:21Z

cc @srowen also

SparkQA · 2017-05-25T10:58:30Z

Test build #77360 has finished for PR 17094 at commit f8b84a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Looks OK to me. This cuts down duplication and adds tests, and as I understand paves the way for some further improvements.

srowen · 2017-05-25T12:32:58Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LeastSquaresAggregator.scala

+    bcFeaturesMean: Broadcast[Array[Double]])(bcCoefficients: Broadcast[Vector])
+  extends DifferentiableLossAggregator[Instance, LeastSquaresAggregator] {
+  require(labelStd > 0.0, s"${this.getClass.getName} requires the label standard" +
+    s"deviation to be positive.")


Add a space before 'deviation' or at the end of the previous line

srowen · 2017-05-25T12:35:40Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/DifferentiableLossAggregator.scala

+    Datum,
+    Agg <: DifferentiableLossAggregator[Datum, Agg]] extends Serializable {
+
+  self: Agg =>


You could add a brief comment explaining what this self type does

srowen · 2017-05-25T12:36:34Z

mllib/src/main/scala/org/apache/spark/ml/optim/loss/DifferentiableRegularization.scala

+            0.0
+          }
+        case None =>
+          sum += coefficients(j) * coefficients(j)


Not sure if this is performance critical but in a few blocks like this an array index is dereferenced many times and could be saved off, if it mattered, to optimize a bit

SparkQA · 2017-05-25T18:26:50Z

Test build #77376 has finished for PR 17094 at commit 29052d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I reviewed this with Seth too and it looks pretty good to me. CCing @MLnick @jkbradley @yanboliang @hhbyyh for final comments

srowen · 2017-05-30T11:23:48Z

Merging tomorrow if there are no objections.

MLnick · 2017-05-30T20:03:39Z

Overall looks good to me. I think it's a good step to clean up the codebase and reduce the duplicated code.

I think the impl is pretty well thought through. A few comments (that probably should be part of follow up):

I'd like to think about ways the regularizer(s) could be made more like "function composition" since the loss and reg are both just DiffFunctions
I think there should be scope for factoring out the loss functions into some sort of traits to make things cleaner

The one thing that doesn't feel quite right is the fact that the std scaling finds its way into L2Regularization - it sort of feels like the abstraction is leaking there. Not quite sure how to address it (perhaps we could look at something in line with the way OWLQN does it's L1 reg?).

sethah · 2017-05-30T21:05:23Z

@MLnick I completely agree about the leaky regularization abstraction. In fact, I think the function composition feature would make it easy to get rid of that problem. Consider:

In the standardized features case we want to compute dL/dBj_std where Bj_std = Bj / sigmaj. Define g(x) = sum(x_j^2) (l2 reg) and f(y) = y_j / sigma_j (standardization). Then we could do val h = g.compose(f) to give a loss function that provides the derivative of h as dg/df * df/dx. I hope that made some sense. Right now, it seems an over-engineered solution to me since we only use this function in a few places. It's definitely more elegant and more general, but I'd prefer to do that as a follow up or if we decide to keep building new algorithms using these abstractions.

Can you expand on your second point? Thanks!

MLnick · 2017-05-31T14:28:25Z

Sure, makes sense. We can always consider it later. Or even an alternate version of it to have L2 and a subclass StandardizedL2 or whatever (that's more if we were to start thinking about exposing the building blocks to external algorithm developers).

For point (2), it's just that each loss function ("squared loss", "logistic", etc) can implement a Loss trait similar to the old org.apache.spark.mllib.optimization.Gradient approach. The Loss would then be an arg of the Aggregator I suppose and the add method could be further consolidated. Not sure if it adds that much value here because of the funky standardization stuff we do in LiR and LoR...

MLnick · 2017-05-31T14:37:39Z

But even for the standard-scaling - it seems that could be expressed generically too with respect to scaling the coeff and gradient during the computation. Again, something perhaps for later.

sethah · 2017-05-31T14:57:45Z

Ok, yes all good points. I think since these are all private apis it gives us room for future changes. For now, I think we can get rid of a lot of code duplication and fill in some testing gaps with this change. I will be happy to drive the roadmap on this front as we move forward.

srowen · 2017-06-03T08:57:48Z

@sethah @MLnick am I reading right that this can be merged as a step forward?

sethah · 2017-06-03T19:01:27Z

@srowen Speaking for myself, I think the other concerns can be issued as follow ups, yes.

srowen · 2017-06-05T09:32:41Z

Merged to master

MLnick · 2017-06-05T09:34:26Z

Sorry for not replying - yeah agree let's create a set of follow ups to this for potential improvements, and of course the migration of other models to the framework Thanks @sethah!

…

On Mon, 5 Jun 2017 at 11:33, Sean Owen ***@***.***> wrote: Merged to master — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#17094 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA_SB1S6__TgaP4rDkqbwBnnc90UpZ2lks5sA8tVgaJpZM4MN7cy> .

sethah changed the title ~~[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code~~ [SPARK-19762][ML][WIP] Hierarchy for consolidating ML aggregator/loss code Feb 28, 2017

sethah commented Feb 28, 2017

View reviewed changes

sethah mentioned this pull request Mar 3, 2017

[SPARK-19745][ML] SVCAggregator captures coefficients in its closure #17076

Closed

sethah changed the title ~~[SPARK-19762][ML][WIP] Hierarchy for consolidating ML aggregator/loss code~~ [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code Mar 3, 2017

sethah force-pushed the ml_aggregators branch from f7e9169 to 46630d1 Compare March 3, 2017 07:16

sethah force-pushed the ml_aggregators branch 2 times, most recently from 76eda69 to d7dceeb Compare March 3, 2017 07:22

sethah commented Mar 3, 2017

View reviewed changes

sethah added 9 commits May 17, 2017 15:59

consolidate ml aggregators

8e6713e

curried constructors

fecccde

self types and docs

ca27f42

aggregator test suite

3912526

loss function suite

0fc24ed

ls agg tests

ded7b3e

all tests passing, still need tests for regularization

5522758

regularization suite

1a48288

backing out svc changes

bd9ae57

#rebasingproblems

b55b7fe

sethah force-pushed the ml_aggregators branch from d7dceeb to b55b7fe Compare May 17, 2017 23:18

javadoc errors

9461c45

add data type parameter to loss function

f8b84a7

srowen reviewed May 25, 2017

View reviewed changes

address review

29052d3

srowen approved these changes May 27, 2017

View reviewed changes

sethah mentioned this pull request Jun 1, 2017

[WIP][SPARK-17134][ML] Use level 2 BLAS operations in LogisticAggregator #17894

Closed

asfgit closed this in 1665b5f Jun 5, 2017

sethah mentioned this pull request Jun 23, 2017

[SPARK-20988][ML] Logistic regression uses aggregator hierarchy #18305

Closed

yanboliang mentioned this pull request Aug 18, 2017

[SPARK-19762][ML][FOLLOWUP]Add necessary comments to L2Regularization. #18992

Closed

[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code #17094

[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code #17094

Conversation

sethah commented Feb 28, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Follow ups

sethah Feb 28, 2017

Choose a reason for hiding this comment

sethah Feb 28, 2017 • edited Loading

Choose a reason for hiding this comment

sethah Feb 28, 2017

Choose a reason for hiding this comment

sethah commented Feb 28, 2017

sethah commented Feb 28, 2017

SparkQA commented Feb 28, 2017

SparkQA commented Feb 28, 2017

sethah commented Feb 28, 2017

sethah Feb 28, 2017

Choose a reason for hiding this comment

SparkQA commented Mar 3, 2017

SparkQA commented Mar 3, 2017

SparkQA commented Mar 3, 2017

sethah commented Mar 3, 2017

SparkQA commented Mar 3, 2017

sethah Mar 3, 2017

Choose a reason for hiding this comment

sethah commented Mar 5, 2017

SparkQA commented Mar 5, 2017

sethah commented May 17, 2017

SparkQA commented May 17, 2017

SparkQA commented May 18, 2017

MLnick commented May 18, 2017

sethah commented May 18, 2017

sethah commented May 25, 2017

SparkQA commented May 25, 2017

srowen left a comment

Choose a reason for hiding this comment

srowen May 25, 2017

Choose a reason for hiding this comment

srowen May 25, 2017

Choose a reason for hiding this comment

srowen May 25, 2017

Choose a reason for hiding this comment

SparkQA commented May 25, 2017

srowen left a comment

Choose a reason for hiding this comment

srowen commented May 30, 2017

MLnick commented May 30, 2017

sethah commented May 30, 2017

MLnick commented May 31, 2017

MLnick commented May 31, 2017

sethah commented May 31, 2017

srowen commented Jun 3, 2017

sethah commented Jun 3, 2017

srowen commented Jun 5, 2017

MLnick commented Jun 5, 2017 via email

sethah commented Feb 28, 2017 •

edited

Loading

sethah Feb 28, 2017 •

edited

Loading