[SPARK-19591][ML][MLlib] Add sample weights to decision trees #21632

imatiach-msft · 2018-06-25T05:00:26Z

This is updated PR #16722 to latest master

What changes were proposed in this pull request?

This patch adds support for sample weights to DecisionTreeRegressor and DecisionTreeClassifier.

Note: This patch does not add support for sample weights to RandomForest. As discussed in the JIRA, we would like to add sample weights into the bagging process. This patch is large enough as is, and there are some additional considerations to be made for random forests. Since the machinery introduced here needs to be present regardless, I have opted to leave random forests for a follow up pr.

How was this patch tested?

The algorithms are tested to ensure that:
1. Arbitrary scaling of constant weights has no effect
2. Outliers with small weights do not affect the learned model
3. Oversampling and weighting are equivalent

Unit tests are also added to test other smaller components.

Summary of changes

Impurity aggregators now store weighted sufficient statistics. They also store a raw count, however, since this is needed to use minInstancesPerNode.
Impurity aggregators now also hold the raw count.
This patch maintains the meaning of minInstancesPerNode, in that the parameter still corresponds to raw, unweighted counts. It also adds a new parameter minWeightFractionPerNode which requires that nodes must contain at least minWeightFractionPerNode * weightedNumExamples total weight.
This patch modifies findSplitsForContinuousFeatures to use weighted sums. Unit tests are added.
TreePoint is modified to hold a sample weight
BaggedPoint is modified from:

private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) extends Serializable

to

private[spark] class BaggedPoint[Datum](
    val datum: Datum,
    val subsampleCounts: Array[Int],
    val sampleWeight: Double) extends Serializable

We do not simply multiply the counts by the weight and store that because we need the raw counts and the weight in order to use both minInstancesPerNode and minWeightPerNode

Note: many of the changed files are due simply to using Instance instead of LabeledPoint

imatiach-msft · 2018-06-25T05:01:56Z

@holdenk @sethah I've updated the PR to latest master (hopefully all of the tests still pass :) )

SparkQA · 2018-06-25T05:04:45Z

Test build #92283 has finished for PR 21632 at commit b5278e5.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-06-25T22:12:37Z

jenkins retest this pretty please :)

SparkQA · 2018-06-25T22:15:11Z

Test build #92314 has finished for PR 21632 at commit 64576d6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T02:00:12Z

Test build #93926 has finished for PR 21632 at commit 30424da.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T02:25:28Z

Test build #93928 has finished for PR 21632 at commit 4ad2833.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T02:40:58Z

Test build #93930 has finished for PR 21632 at commit 263b343.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T04:00:27Z

Test build #93940 has finished for PR 21632 at commit cf77ab2.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T04:28:39Z

Test build #93942 has finished for PR 21632 at commit 0ad3b08.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T05:10:12Z

Test build #93948 has finished for PR 21632 at commit 3189259.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-02T18:37:29Z

Test build #94015 has finished for PR 21632 at commit 981d707.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-03T22:12:44Z

Test build #94165 has finished for PR 21632 at commit 6326bdf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-08-03T22:14:48Z

Jenkins retest this please

imatiach-msft · 2018-08-03T22:14:57Z

looks like a random failure

SparkQA · 2018-08-03T22:30:17Z

Test build #94164 has finished for PR 21632 at commit a34b3cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-08-04T02:25:47Z

Test build #94180 has finished for PR 21632 at commit 6326bdf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-08-06T14:40:11Z

looks like a random test failure with hive client suite (not related to the PR), I'll try updating to latest master and rebuild...

SparkQA · 2018-08-06T19:26:39Z

Test build #94288 has finished for PR 21632 at commit ad28e44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2018-08-06T21:34:32Z

@holdenk @sethah @HyukjinKwon I have a successful build, I need to look into 2-3 wacky test results that changed since when @sethah opened his PR (see comments in my PR). In the mean time, would anyone be able to review the PR - are there any comments from the previous PR that were still not resolved and need to be made?

HyukjinKwon · 2018-08-07T01:40:09Z

cc also @jkbradley

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala

mllib/src/test/scala/org/apache/spark/ml/regression/DecisionTreeRegressorSuite.scala

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

SparkQA · 2019-01-15T06:19:47Z

Test build #101230 has finished for PR 21632 at commit 7d2f131.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

I still had a question about modelPredictionEquals and one other minor thing here but quite close now.

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala

SparkQA · 2019-01-22T06:24:16Z

Test build #101511 has finished for PR 21632 at commit a8ebf22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Some more minor comments. it's up to your judgment on whether to add a new overload to DecisionTreeMetadata to simplify the test code. It seems fine to me either way.

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

mllib/src/test/scala/org/apache/spark/ml/tree/impl/TreeTests.scala

mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala

…suite

imatiach-msft · 2019-01-23T05:00:21Z

"up to your judgment on whether to add a new overload to DecisionTreeMetadata to simplify the test code"
This is a tough decision to make; I would prefer not to modify the source code for the sake of tests, but modifying a lot of test code to call the DecisionTreeMetadata.buildMetadata with LabeledPoint converted to instances instead of LabeledPoint is bad too.
There are other options as well. I could make DecisionTreeMetadata.buildMetadata accept an RDD[_] and then dynamically figure out the type but this doesn't seem like a good choice either.
I could also create a wrapper around buildMetadata in the test code and then call that wrapper from all tests which should make maintaining code easier in the future (eg the conversion could have been done in the wrapper) but that would only introduce more changes - not less - to the PR, and it also creates another level of indirection which may make the test code more confusing.
The current code seems the slightly better choice of the four options listed above (and there may be other options as well), but if there is a strong preference toward one of the other choices I would be glad to update the PR.

SparkQA · 2019-01-23T05:38:09Z

Test build #101569 has finished for PR 21632 at commit a993ce3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-23T05:39:12Z

Test build #101570 has finished for PR 21632 at commit 6adeda8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

mllib/src/test/scala/org/apache/spark/mllib/tree/ImpuritySuite.scala

SparkQA · 2019-01-24T05:08:51Z

Test build #101612 has finished for PR 21632 at commit 7d6654e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-01-25T01:20:33Z

Merged to master

imatiach-msft · 2019-01-28T04:44:24Z

@srowen thank you for the merge and the thorough review. I have some doubts about the tolerance we decided for zero values:
val tolerance = Utils.EPSILON * unweightedNumSamples * unweightedNumSamples

https://github.com/apache/spark/pull/21632/files#diff-1fd1bc8d3fc9306c83cd65fbf3ca4bbeR1054

For a large number of unweighted samples I am worried that it might be too high. Note EPSILON=2.2E-16. I am wondering if I should change the tolerance to be:
val tolerance = Utils.EPSILON * unweightedNumSamples * (some constant)
What are your thoughts?

srowen · 2019-01-28T05:14:01Z

Is there a good reason to scale it by the square of the samples? if not, yeah, worth a follow-up. If there is a good reason, then is there a case in the tests here where epsilon becomes really large, like of the same order of magnitude as the expected values? I don't think the tests have ~1e8 samples. Up to your judgment.

imatiach-msft · 2019-01-28T16:34:38Z

I think I made a mistake and it should actually be:

val tolerance = Utils.EPSILON * (unweightedNumSamples + unweightedNumSamples)

or perhaps a larger threshold:

val tolerance = Utils.EPSILON * unweightedNumSamples * SomeLargeConstant

but I will need to verify by adding some debug to ensure that no zero features slip through for the sample tests, otherwise that tolerance would still be too low and the factor would need to be increased; my worry is that by using the square of the samples the tolerance would become too high with a very large number of samples and then some values would be included as zero feature values which we don't want

srowen · 2019-01-28T16:38:45Z

That's fine @imatiach-msft just open another PR for the same JIRA. We usually put [FOLLOWUP] in the title and link to the previous PR for discoverability.

imatiach-msft · 2019-01-29T04:35:55Z

@srowen thanks for the quick response, I've created a follow-up PR here:
#23682
In testing I've found the tolerance:
val tolerance = Utils.EPSILON * unweightedNumSamples * 100
to be good enough, not sure if I need to make it larger

…es - fix tolerance This is a follow-up to PR: #21632 ## What changes were proposed in this pull request? This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values). In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below. ## How was this patch tested? This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values. Eg in SBT: ``` ./build/sbt > project mllib > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights" ``` For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test): ``` val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) { print("should not print this") partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples)) } else { partValueCountMap } ``` Closes #23682 from imatiach-msft/ilmat/sample-weights-tol. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

This is updated PR apache#16722 to latest master ## What changes were proposed in this pull request? This patch adds support for sample weights to DecisionTreeRegressor and DecisionTreeClassifier. Note: This patch does not add support for sample weights to RandomForest. As discussed in the JIRA, we would like to add sample weights into the bagging process. This patch is large enough as is, and there are some additional considerations to be made for random forests. Since the machinery introduced here needs to be present regardless, I have opted to leave random forests for a follow up pr. ## How was this patch tested? The algorithms are tested to ensure that: 1. Arbitrary scaling of constant weights has no effect 2. Outliers with small weights do not affect the learned model 3. Oversampling and weighting are equivalent Unit tests are also added to test other smaller components. ## Summary of changes - Impurity aggregators now store weighted sufficient statistics. They also store a raw count, however, since this is needed to use minInstancesPerNode. - Impurity aggregators now also hold the raw count. - This patch maintains the meaning of minInstancesPerNode, in that the parameter still corresponds to raw, unweighted counts. It also adds a new parameter minWeightFractionPerNode which requires that nodes must contain at least minWeightFractionPerNode * weightedNumExamples total weight. - This patch modifies findSplitsForContinuousFeatures to use weighted sums. Unit tests are added. - TreePoint is modified to hold a sample weight - BaggedPoint is modified from: ``` Scala private[spark] class BaggedPoint[Datum](val datum: Datum, val subsampleWeights: Array[Double]) extends Serializable ``` to ``` Scala private[spark] class BaggedPoint[Datum]( val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double) extends Serializable ``` We do not simply multiply the counts by the weight and store that because we need the raw counts and the weight in order to use both minInstancesPerNode and minWeightPerNode **Note**: many of the changed files are due simply to using Instance instead of LabeledPoint Closes apache#21632 from imatiach-msft/ilmat/sample-weights. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

…es - fix tolerance This is a follow-up to PR: apache#21632 ## What changes were proposed in this pull request? This PR tunes the tolerance used for deciding whether to add zero feature values to a value-count map (where the key is the feature value and the value is the weighted count of those feature values). In the previous PR the tolerance scaled by the square of the unweighted number of samples, which is too aggressive for a large number of unweighted samples. Unfortunately using just "Utils.EPSILON * unweightedNumSamples" is not enough either, so I multiplied that by a factor tuned by the testing procedure below. ## How was this patch tested? This involved manually running the sample weight tests for decision tree regressor to see whether the tolerance was large enough to exclude zero feature values. Eg in SBT: ``` ./build/sbt > project mllib > testOnly *DecisionTreeRegressorSuite -- -z "training with sample weights" ``` For validation, I added a print inside the if in the code below and validated that the tolerance was large enough so that we would not include zero features (which don't exist in that test): ``` val valueCountMap = if (weightedNumSamples - partNumSamples > tolerance) { print("should not print this") partValueCountMap + (0.0 -> (weightedNumSamples - partNumSamples)) } else { partValueCountMap } ``` Closes apache#23682 from imatiach-msft/ilmat/sample-weights-tol. Authored-by: Ilya Matiach <ilmat@microsoft.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>

zhengruifeng · 2019-09-04T02:29:53Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -37,7 +37,7 @@ import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
 * Note: Marked as private and DeveloperApi since this may be made public in the future.
 */
 private[ml] trait DecisionTreeParams extends PredictorParams
-  with HasCheckpointInterval with HasSeed {
+  with HasCheckpointInterval with HasSeed with HasWeightCol {



@imatiach-msft @srowen Here params weightCol and minWeightFractionPerNode are introduced into DecisionTreeParams and also exposed to RF and GBT.
But RF and GBT do not support sample weighting for now. Is there any plan to support it? or we should put these params into DecisionTreeRegressorParams and DecisionTreeClassifierParams?

"Is there any plan to support it?"
yes, definitely, we should support it eventually, will look into adding it when I get a chance. There's already a JIRA ticket for that as well.

(tagging @zhengruifeng )

Thanks @imatiach-msft ! I just read corresponding tickets.
My concern is that we need to support weighting in RF & GBT in 3.0.0, otherwise two unused params will be added.

Could someone link to the JIRA ticket for class weights in RF / GBT?

I've been trying to track it down.

I found this : https://issues.apache.org/jira/browse/SPARK-9478

However, it is marked resolved. I do not see where the RF / GBT feature is tracked nor implemented.

It is appreciated.

You can see it's resolved as a duplicate, of https://issues.apache.org/jira/browse/SPARK-19591 for 3.0.0

### What changes were proposed in this pull request? 1, fix `BaggedPoint.convertToBaggedRDD` when `subsamplingRate < 1.0` 2, reorg `RandomForest.runWithMetadata` btw ### Why are the changes needed? In GBT, Instance weights will be discarded if subsamplingRate<1 1, `baggedPoint: BaggedPoint[TreePoint]` is used in the tree growth to find best split; 2, `BaggedPoint[TreePoint]` contains two weights: ```scala class BaggedPoint[Datum](val datum: Datum, val subsampleCounts: Array[Int], val sampleWeight: Double = 1.0) class TreePoint(val label: Double, val binnedFeatures: Array[Int], val weight: Double) ``` 3, only the var `sampleWeight` in `BaggedPoint` is used, the var `weight` in `TreePoint` is never used in finding splits; 4, The method `BaggedPoint.convertToBaggedRDD` was changed in #21632, it was only for decisiontree, so only the following code path was changed; ``` if (numSubsamples == 1 && subsamplingRate == 1.0) { convertToBaggedRDDWithoutSampling(input, extractSampleWeight) } ``` 5, In #25926, I made GBT support weights, but only test it with default `subsamplingRate==1`. GBT with `subsamplingRate<1` will convert treePoints to baggedPoints via ```scala convertToBaggedRDDSamplingWithoutReplacement(input, subsamplingRate, numSubsamples, seed) ``` in which the orignial weights from `weightCol` will be discarded and all `sampleWeight` are assigned default 1.0; ### Does this PR introduce any user-facing change? No ### How was this patch tested? updated testsuites Closes #27070 from zhengruifeng/gbt_sampling. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

HyukjinKwon mentioned this pull request Jul 16, 2018

[SPARK-19591][ML][MLlib] Add sample weights to decision trees #16722

Closed

imatiach-msft force-pushed the ilmat/sample-weights branch from 64576d6 to 30424da Compare August 2, 2018 01:56

imatiach-msft force-pushed the ilmat/sample-weights branch from 30424da to 4ad2833 Compare August 2, 2018 02:09

imatiach-msft force-pushed the ilmat/sample-weights branch from 60d9deb to 263b343 Compare August 2, 2018 02:26

imatiach-msft force-pushed the ilmat/sample-weights branch from 263b343 to cf77ab2 Compare August 2, 2018 03:56

imatiach-msft force-pushed the ilmat/sample-weights branch from cf77ab2 to 0ad3b08 Compare August 2, 2018 04:04

imatiach-msft force-pushed the ilmat/sample-weights branch from a34b3cd to 6326bdf Compare August 3, 2018 18:24

imatiach-msft force-pushed the ilmat/sample-weights branch from 6326bdf to ad28e44 Compare August 6, 2018 14:41

imatiach-msft commented Aug 14, 2018

View reviewed changes

imatiach-msft force-pushed the ilmat/sample-weights branch from ad28e44 to 8a18157 Compare September 19, 2018 03:15

srowen reviewed Jan 15, 2019

View reviewed changes

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala Show resolved Hide resolved

mllib/src/test/scala/org/apache/spark/ml/regression/RandomForestRegressorSuite.scala Outdated Show resolved Hide resolved

removed whitespace

a8ebf22

srowen requested changes Jan 22, 2019

View reviewed changes

imatiach-msft added 2 commits January 22, 2019 23:20

increased fraction within tolerance for decision tree regressor test …

a993ce3

…suite

updated based on comments

6adeda8

srowen requested changes Jan 23, 2019

View reviewed changes

updated based on comments

7d6654e

srowen approved these changes Jan 24, 2019

View reviewed changes

srowen closed this in b2d36f6 Jan 25, 2019

imatiach-msft mentioned this pull request Jan 29, 2019

[SPARK-19591][ML][MLlib][FOLLOWUP] Add sample weights to decision trees - fix tolerance #23682

Closed

zhengruifeng reviewed Sep 4, 2019

View reviewed changes

zhengruifeng mentioned this pull request Jan 2, 2020

[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 #27070

Closed

zhengruifeng mentioned this pull request Jan 8, 2020

[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest #27097

Closed

[SPARK-19591][ML][MLlib] Add sample weights to decision trees #21632

[SPARK-19591][ML][MLlib] Add sample weights to decision trees #21632

Conversation

imatiach-msft commented Jun 25, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Summary of changes

imatiach-msft commented Jun 25, 2018

SparkQA commented Jun 25, 2018

imatiach-msft commented Jun 25, 2018

SparkQA commented Jun 25, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 2, 2018

SparkQA commented Aug 3, 2018

imatiach-msft commented Aug 3, 2018

imatiach-msft commented Aug 3, 2018

SparkQA commented Aug 3, 2018

SparkQA commented Aug 4, 2018

imatiach-msft commented Aug 6, 2018

SparkQA commented Aug 6, 2018

imatiach-msft commented Aug 6, 2018

HyukjinKwon commented Aug 7, 2018

SparkQA commented Jan 15, 2019

srowen left a comment

Choose a reason for hiding this comment

SparkQA commented Jan 22, 2019

srowen left a comment

Choose a reason for hiding this comment

imatiach-msft commented Jan 23, 2019

SparkQA commented Jan 23, 2019

SparkQA commented Jan 23, 2019

SparkQA commented Jan 24, 2019

srowen commented Jan 25, 2019

imatiach-msft commented Jan 28, 2019

srowen commented Jan 28, 2019

imatiach-msft commented Jan 28, 2019

srowen commented Jan 28, 2019

imatiach-msft commented Jan 29, 2019

zhengruifeng Sep 4, 2019 • edited Loading

Choose a reason for hiding this comment

imatiach-msft Sep 5, 2019

Choose a reason for hiding this comment

imatiach-msft Sep 5, 2019

Choose a reason for hiding this comment

zhengruifeng Sep 5, 2019

Choose a reason for hiding this comment

ghost Sep 9, 2019 • edited by ghost Loading

Choose a reason for hiding this comment

srowen Sep 9, 2019

Choose a reason for hiding this comment

imatiach-msft commented Jun 25, 2018 •

edited

Loading

zhengruifeng Sep 4, 2019 •

edited

Loading

ghost Sep 9, 2019 •

edited by ghost

Loading