-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code #17094
Conversation
Instance(1.5, 0.2, Vectors.dense(3.0, 0.2)) | ||
) | ||
|
||
def assertEqual[T, Agg <: DifferentiableLossAggregator[T, Agg]]( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make private
extends DiffFunction[BDV[Double]] { | ||
|
||
override def calculate(coefficients: BDV[Double]): (Double, BDV[Double]) = { | ||
val bcCoefficients = instances.context.broadcast(Vectors.dense(coefficients.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use fromBreeze
/** | ||
* Dummy aggregator that represents least squares cost with no intercept. | ||
*/ | ||
class TestAggregator(numFeatures: Int)(coefficients: Vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move it into a companion object
ping @MLnick @jkbradley |
Jenkins test this please. |
Test build #73555 has finished for PR 17094 at commit
|
Test build #73557 has finished for PR 17094 at commit
|
also cc @hhbyyh |
/** Merge two aggregators. The `this` object will be modified in place and returned. */ | ||
def merge(other: Agg): Agg = { | ||
require(dim == other.dim, s"Dimensions mismatch when merging with another " + | ||
s"LeastSquaresAggregator. Expecting $dim but got ${other.dim}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this to use getClass.getName
Test build #73819 has started for PR 17094 at commit |
Test build #73820 has started for PR 17094 at commit |
76eda69
to
d7dceeb
Compare
Test build #73821 has started for PR 17094 at commit |
Removed WIP, think it's ready now :) |
Test build #73823 has started for PR 17094 at commit |
* | ||
* @tparam T The type of the coefficients being regularized. | ||
*/ | ||
trait DifferentiableRegularization[T] extends DiffFunction[T] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make these private
Jenkins test this please. |
Test build #73914 has finished for PR 17094 at commit
|
ping! @MLnick @jkbradley @yanboliang @hhbyyh Is there any interest in this? I actually think this cleanup will be a precursor to several different improvements (adding more optimized aggregators, adding optimization library) and that it will be very useful. IMO it's an important change. Otherwise, we keep slapping layers on the current implementation and the code length and complexity keeps growing. I'm happy to take suggestions, make changes, or discuss it further. Thoughts? |
Test build #77033 has finished for PR 17094 at commit
|
Test build #77034 has finished for PR 17094 at commit
|
In terms of the high level intention of this, agree we definitely need it and it should clean things up substantially. I will start taking a look through ASAP. Thanks! |
Thanks @MLnick! I am happy to discuss splitting this into smaller bits as well, if it can make things easier. |
cc @srowen also |
Test build #77360 has finished for PR 17094 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me. This cuts down duplication and adds tests, and as I understand paves the way for some further improvements.
bcFeaturesMean: Broadcast[Array[Double]])(bcCoefficients: Broadcast[Vector]) | ||
extends DifferentiableLossAggregator[Instance, LeastSquaresAggregator] { | ||
require(labelStd > 0.0, s"${this.getClass.getName} requires the label standard" + | ||
s"deviation to be positive.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a space before 'deviation' or at the end of the previous line
Datum, | ||
Agg <: DifferentiableLossAggregator[Datum, Agg]] extends Serializable { | ||
|
||
self: Agg => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a brief comment explaining what this self type does
0.0 | ||
} | ||
case None => | ||
sum += coefficients(j) * coefficients(j) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is performance critical but in a few blocks like this an array index is dereferenced many times and could be saved off, if it mattered, to optimize a bit
Test build #77376 has finished for PR 17094 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed this with Seth too and it looks pretty good to me. CCing @MLnick @jkbradley @yanboliang @hhbyyh for final comments
Merging tomorrow if there are no objections. |
Overall looks good to me. I think it's a good step to clean up the codebase and reduce the duplicated code. I think the impl is pretty well thought through. A few comments (that probably should be part of follow up):
The one thing that doesn't feel quite right is the fact that the std scaling finds its way into |
@MLnick I completely agree about the leaky regularization abstraction. In fact, I think the function composition feature would make it easy to get rid of that problem. Consider: In the standardized features case we want to compute dL/dBj_std where Bj_std = Bj / sigmaj. Define g(x) = sum(x_j^2) (l2 reg) and f(y) = y_j / sigma_j (standardization). Then we could do Can you expand on your second point? Thanks! |
Sure, makes sense. We can always consider it later. Or even an alternate version of it to have For point (2), it's just that each loss function ("squared loss", "logistic", etc) can implement a |
But even for the standard-scaling - it seems that could be expressed generically too with respect to scaling the coeff and gradient during the computation. Again, something perhaps for later. |
Ok, yes all good points. I think since these are all private apis it gives us room for future changes. For now, I think we can get rid of a lot of code duplication and fill in some testing gaps with this change. I will be happy to drive the roadmap on this front as we move forward. |
@srowen Speaking for myself, I think the other concerns can be issued as follow ups, yes. |
Merged to master |
Sorry for not replying - yeah agree let's create a set of follow ups to
this for potential improvements, and of course the migration of other
models to the framework
Thanks @sethah!
…On Mon, 5 Jun 2017 at 11:33, Sean Owen ***@***.***> wrote:
Merged to master
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17094 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AA_SB1S6__TgaP4rDkqbwBnnc90UpZ2lks5sA8tVgaJpZM4MN7cy>
.
|
What changes were proposed in this pull request?
JIRA: SPARK-19762
The larger changes in this patch are:
DifferentiableLossAggregator
trait which is intended to be used as a common parent trait to all Spark ML aggregator classes. It factors out the common methods:merge, gradient, loss, weight
from the aggregator subclasses.RDDLossFunction
which is intended to be the only implementation of Breeze'sDiffFunction
necessary in Spark ML, and can be used by all other algorithms. It takes the aggregator type as a type parameter, and maps the aggregator over an RDD. It additionally takes in a optional regularization loss function for applying the differentiable part of regularization.LinearRegression
to use this new hierarchy as a proof of concept.o.a.s.ml.optim.loss
ando.a.s.ml.optim.aggregator
Also note that none of these are public-facing changes. All of these classes are internal to Spark ML and remain that way.
NOTE: The large majority of the "lines added" and "lines deleted" are simply code moving around or unit tests.
BTW, I also converted LinearSVC to this framework as a way to prove that this new hierarchy is flexible enough for the other algorithms, but I backed those changes out because the PR is large enough as is.
How was this patch tested?
Test suites are added for the new components, and some test suites are also added to provide coverage where there wasn't any before.
Below are some performance testing numbers. Run on a 6 node virtual cluster with 44 cores and ~110G RAM, the dataset size is about 37G. These are not "large-scale" tests, but we really want to just make sure the iteration times don't increase with this patch. Notably we are doing the regularization a bit differently than before, but that should cost very little. I think there's very little risk otherwise, and these numbers don't show a difference. Of course I'm happy to add more tests as we think it's necessary, but I think the patch is ready for review now.
Note: timings are best of 3 runs.
Follow ups
If this design is accepted, we will convert the other ML algorithms that use this aggregator pattern to this new hierarchy in follow up PRs.