[SPARK-9612][ML] Add instance weight support for GBTs #25926

zhengruifeng · 2019-09-25T11:04:46Z

What changes were proposed in this pull request?

add weight support for GBTs by sampling data before passing it to trees and then passing weights to trees

in summary:
1, add setters of minWeightFractionPerNode & weightCol
2, update input types in private methods from RDD[LabeledPoint] to RDD[Instance]:
DecisionTreeRegressor.train, GradientBoostedTrees.run, GradientBoostedTrees.runWithValidation, GradientBoostedTrees.computeInitialPredictionAndError, GradientBoostedTrees.computeError,
GradientBoostedTrees.evaluateEachIteration, GradientBoostedTrees.boost, GradientBoostedTrees.updatePredictionError
3, add new private method GradientBoostedTrees.computeError(data, predError) to compute average error, since original predError.values.mean() do not take weights into account.
4, add new tests

Why are the changes needed?

GBTs should support sample weights like other algs

Does this PR introduce any user-facing change?

yes, new setters are added

How was this patch tested?

existing & added testsuites

dongjoon-hyun · 2019-09-25T11:50:39Z

Wow, @zhengruifeng . This is really a long standing JIRA. 👍

SparkQA · 2019-09-25T12:15:21Z

Test build #111345 has finished for PR 25926 at commit e1b3aa2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-26T03:29:13Z

Test build #111380 has finished for PR 25926 at commit 9ea6e00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-27T05:22:38Z

mllib/src/test/scala/org/apache/spark/ml/regression/GBTRegressorSuite.scala

+
+      MLTestingUtils.testArbitrarilyScaledWeights[GBTRegressionModel,
+        GBTRegressor](df.as[LabeledPoint], estimator,
+        MLTestingUtils.modelPredictionEquals(df, _ ~= _ relTol 0.1, 0.95))


Compared to DecisionTreeRegressorSuite, I need to limit the number of trees and loose the tolerance eps(0.99 -> 0.95) to pass the cases.
I wonder if it is due to accumulated errors among trees.

interesting, will need to take a closer look...

zhengruifeng · 2019-09-27T05:22:57Z

cc @srowen @imatiach-msft

imatiach-msft · 2019-09-27T14:10:32Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.


nit: update comment to

By default the weightCol is not set, so all instances have weight 1.0.

imatiach-msft · 2019-09-27T14:17:44Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala


    val withValidation = isDefined(validationIndicatorCol) && $(validationIndicatorCol).nonEmpty

-    // We copy and modify this from Classifier.extractLabeledPoints since GBT only supports
-    // 2 classes now.  This lets us provide a more precise error message.
-    val convert2LabeledPoint = (dataset: Dataset[_]) => {


the error message here was much nicer:

GBTClassifier currently only supports binary classification.

than the new one in extractInstances. Perhaps it would be nicer to keep this custom error message, or pass some part of the message to the extractInstances method.

imatiach-msft · 2019-09-27T14:19:32Z

mllib/src/main/scala/org/apache/spark/ml/regression/GBTRegressor.scala

+  /**
+   * Sets the value of param [[weightCol]].
+   * If this is not set or empty, we treat all instance weights as 1.0.
+   * Default is not set, so all instances have weight one.


nit: update comment, same as above

imatiach-msft · 2019-09-27T14:23:51Z

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

@@ -68,7 +68,7 @@ class GradientBoostedTrees private[spark] (
  def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = {
    val algo = boostingStrategy.treeStrategy.algo
    val (trees, treeWeights) = NewGBT.run(input.map { point =>
-      NewLabeledPoint(point.label, point.features.asML)
+      NewLabeledPoint(point.label, point.features.asML).toInstance


can we create an instance directly from label, features - seems a bit too much to create a temporary LabeledPoint object that's unused?

Yes, it is better to directly create Instance

imatiach-msft

added some initial comments, will need to take a closer look into how the weights are used in the boosting more

imatiach-msft · 2019-10-07T03:50:44Z

mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala

  override protected def train(
      dataset: Dataset[_]): GBTClassificationModel = instrumented { instr =>
-    val categoricalFeatures: Map[Int, Int] =
-      MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+    val categoricalFeatures = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))


nit: can this line be moved above where it is used:

val boostingStrategy = super.getOldBoostingStrategy(categoricalFeatures, OldAlgo.Classification)

probably good to move boostingStrategy below as well

imatiach-msft · 2019-10-07T04:04:41Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

-      val error = loss.computeError(pred, lp.label)
+    data.map { case Instance(label, _, features) =>
+      val pred = updatePrediction(features, 0.0, initTree, initTreeWeight)
+      val error = loss.computeError(pred, label)


hmm shouldn't the loss be weighted by the weight column value here? seems a bit strange to ignore the weight column here

oh, reading some of the other code this looks like unweighted error. That seems very confusing. I think we could improve this code structure a bit more.

what would be the problem with this returning weighted error and getting rid of the computeError function?

imatiach-msft · 2019-10-07T04:05:56Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

-        val newPred = updatePrediction(lp.features, pred, tree, treeWeight)
-        val newError = loss.computeError(newPred, lp.label)
+    data.zip(predictionAndError).map {
+      case (Instance(label, _, features), (pred, _)) =>


same thing here - it seems like we are ignoring the weight column but intuitively it seems like it should be included, could you explain the reasoning behind it being excluded here?

oh, reading some of the other code this looks like unweighted error. That seems very confusing. I think we could improve this code structure a bit more.

imatiach-msft · 2019-10-07T04:14:02Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

-      loss.computeError(predicted, lp.label)
-    }.mean()
+      (loss.computeError(predicted, label) * weight, weight)
+    }.treeReduce{ case ((err1, weight1), (err2, weight2)) =>


nit: spacing of { after treeReduce

imatiach-msft · 2019-10-07T04:26:21Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

-        prediction * localTreeWeights(idx)
+    val numTrees = trees.length
+
+    val (errSum, weightSum) = remappedData.mapPartitions { iter =>


trying to understand this code - why are the trees broadcast here but the treeWeights are not?

just to be clear the previous code is doing this as well, I just don't understand why the treeWeights aren't broadcast either

in this place, I just followed preivous impl. I am neutral on it.

sounds good

imatiach-msft · 2019-10-08T03:25:48Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

@@ -299,26 +317,25 @@ private[spark] object GradientBoostedTrees extends Logging {
    baseLearners(0) = firstTreeModel
    baseLearnerWeights(0) = firstTreeWeight

-    var predError: RDD[(Double, Double)] =
-      computeInitialPredictionAndError(input, firstTreeWeight, firstTreeModel, loss)
+    var predError = computeInitialPredictionAndError(input, firstTreeWeight, firstTreeModel, loss)
    predErrorCheckpointer.update(predError)


it would be nice if we could checkpoint the weighted instead of unweighted prediction error, which ties into the earlier comment on why methods like computeInitialPredictionAndError can't return the weighted prediction error

zhengruifeng · 2019-10-08T05:41:37Z

@imatiach-msft Thanks for reviewing!
As to the points on weighted prediction error:
After previous discussions, we should sample the data without weights, and pass the weights into the base model (decision tree).
So the input passed to a decsion tree, should contain the label (unweighted prediction error) and the instance weights (which will be also used in minWeightFractionPerNode). In this way, I guess we do not need to cache weighted error.
Moreover, the code predError.values.mean() with weighted predError is not equal to the average weighted error in this PR.

PS: If I recall correctly, XGBoost pass weighted gradients and hessions into base learner. It use minimum hession (min_child_weight) to limit tree growth, which is quite different from MLLIB.

SparkQA · 2019-10-08T06:51:39Z

Test build #111878 has finished for PR 25926 at commit f0d890a.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-08T09:30:50Z

Test build #111887 has finished for PR 25926 at commit 3c07243.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2019-10-15T15:34:43Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

+   * @param predError Prediction and error.
+   * @return Measure of model error on data
+   */
+  def computeError(


maybe change the name to computeWeightedError to make that clear, since the above methods are also computing error but unweighted

imatiach-msft · 2019-10-15T15:37:29Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

@@ -299,26 +317,25 @@ private[spark] object GradientBoostedTrees extends Logging {
    baseLearners(0) = firstTreeModel
    baseLearnerWeights(0) = firstTreeWeight

-    var predError: RDD[(Double, Double)] =
-      computeInitialPredictionAndError(input, firstTreeWeight, firstTreeModel, loss)
+    var predError = computeInitialPredictionAndError(input, firstTreeWeight, firstTreeModel, loss)
    predErrorCheckpointer.update(predError)


if we are going to keep the checkpointing of unweighted error as opposed to weighted error, then it would be nice to specify that in the name of the checkpointer:
predUnweigtedErrorCheckpointer
or alternatively add a comment to make that clear:

// Note: this is checkpointing the unweighted error predErrorCheckpointer.update(predError)

imatiach-msft · 2019-10-15T15:41:49Z

@zhengruifeng I see, it seems a bit confusing to have a lot of references to error as both weighted and unweighted inside the same function - for example I would prefer to only checkpoint weighted error, and computeError function doesn't suggest anything in the name about it specifically making the error weighted - but as long as we have good documentation and variable names in the code to help distinguish which variable is for what I think it should be fine

imatiach-msft

great feature! LGTM!

SparkQA · 2019-10-16T08:28:19Z

Test build #112147 has finished for PR 25926 at commit 02457a7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-16T09:36:44Z

@imatiach-msft Thanks for reviewing and your previous works on decision tree supporting sample weights!

srowen · 2019-10-16T17:11:38Z

@zhengruifeng @imatiach-msft did you have any other changes to make? there may still be some open comments, not sure if they were addressed

zhengruifeng · 2019-10-17T09:44:10Z

@srowen I think the only place may need more dicussion is that I need to loose the tolerance in test suites (compared with DecisionTreeSuites).

@imatiach-msft How do you think about it?

zhengruifeng · 2019-10-23T11:38:45Z

I manually tested this PR in repl in the past days, with some datasets in /data/mllib, set relative params to in normal ranges (for example weight in [1.0, 10.0], not extreme values (0.01, 1000) in the testsuits), and the results looked fine.

I think the error is accumulated among DecisionTrees, so I need to loose the tolerance in test suites (compared with DecisionTreeSuites).

I will merge this PR this week if no more comments.

SparkQA · 2019-10-23T12:16:53Z

Test build #112539 has finished for PR 25926 at commit 3000397.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-10-25T05:51:48Z

Merged to master, thanks @imatiach-msft @srowen for reviewing!

### What changes were proposed in this pull request? add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in Python side of ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? #25926 added ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on scala side. This PR will add ```setWeightCol``` and ```setMinWeightFractionPerNode``` in GBTs on python side ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? doc test Closes #26774 from huaxingao/spark-30146. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>

### What changes were proposed in this pull request? 1, fix `BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0` 2, reorg `RandomForest.runWithMetadata` btw ### Why are the changes needed? In GBT, Instance weights will be discarded if subsamplingRate<1 1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to find best split; 2, `BaggedPoint[TreePoint]` contains two weights: ```scala class BaggedPoint[Datum](val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double = 1.0) class TreePoint(val label: Double, val binnedFeatures: Array[Int], val weight: Double) ``` 3, only the var `sampleWeight` in `BaggedPoint` is used, the var `weight` in `TreePoint` is never used in finding splits; 4, The method `BaggedPoint.convertToBaggedRDD` was changed in #21632, it was only for decisiontree, so only the following code path was changed; ``` if (numSubsamples == 1 && subsamplingRate == 1.0) { convertToBaggedRDDWithoutSampling(input, extractSampleWeight) } ``` 5, In #25926, I made GBT support weights, but only test it with default `subsamplingRate==1`. GBT with `subsamplingRate<1` will convert treePoints to baggedPoints via ```scala convertToBaggedRDDSamplingWithoutReplacement(input, subsamplingRate, numSubsamples, seed) ``` in which the orignial weights from `weightCol` will be discarded and all `sampleWeight` are assigned default 1.0; ### Does this PR introduce any user-facing change? No ### How was this patch tested? updated testsuites Closes #27070 from zhengruifeng/gbt_sampling. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

zhengruifeng added the ML label Sep 25, 2019

zhengruifeng commented Sep 27, 2019

View reviewed changes

imatiach-msft reviewed Sep 27, 2019

View reviewed changes

imatiach-msft reviewed Oct 7, 2019

View reviewed changes

imatiach-msft reviewed Oct 8, 2019

View reviewed changes

zhengruifeng force-pushed the gbt_add_weight branch from 9ea6e00 to 038ff58 Compare October 8, 2019 06:39

imatiach-msft reviewed Oct 15, 2019

View reviewed changes

imatiach-msft approved these changes Oct 15, 2019

View reviewed changes

init

68a3169

zhengruifeng added 8 commits October 23, 2019 16:15

update

fc9c6cc

update

51f9b99

update test

d4a7a15

nit: mappartitions -> map

7ec15ed

address some comments

8e5fc7b

revert err msg in testsuite

e3c9372

update Instance scope

23f27a5

address comments

3000397

zhengruifeng force-pushed the gbt_add_weight branch from 02457a7 to 3000397 Compare October 23, 2019 11:08

zhengruifeng closed this in 091cbc3 Oct 25, 2019

zhengruifeng deleted the gbt_add_weight branch October 25, 2019 05:51

huaxingao mentioned this pull request Dec 6, 2019

[SPARK-30146][ML][PySpark] Add setWeightCol to GBTs in PySpark #26774

Closed

zhengruifeng mentioned this pull request Jan 2, 2020

[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 #27070

Closed

[SPARK-9612][ML] Add instance weight support for GBTs #25926

[SPARK-9612][ML] Add instance weight support for GBTs #25926

Conversation

zhengruifeng commented Sep 25, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

dongjoon-hyun commented Sep 25, 2019

SparkQA commented Sep 25, 2019

SparkQA commented Sep 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Sep 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengruifeng commented Oct 8, 2019

SparkQA commented Oct 8, 2019

SparkQA commented Oct 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Oct 15, 2019

imatiach-msft left a comment

Choose a reason for hiding this comment

SparkQA commented Oct 16, 2019

zhengruifeng commented Oct 16, 2019

srowen commented Oct 16, 2019

zhengruifeng commented Oct 17, 2019

zhengruifeng commented Oct 23, 2019

SparkQA commented Oct 23, 2019

zhengruifeng commented Oct 25, 2019

zhengruifeng commented Sep 25, 2019 •

edited

Loading