-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 #27070
Conversation
ping @imatiach-msft |
Test build #116016 has finished for PR 27070 at commit
|
Test build #116025 has finished for PR 27070 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me
val retaggedInput = input.retag(classOf[Instance]) | ||
DecisionTreeMetadata.buildMetadata(retaggedInput, strategy, numTrees, featureSubsetStrategy) | ||
} | ||
|
||
/** | ||
* Train a random forest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update doc, eg:
Train a random forest with metadata.
also add description for metadata param
@@ -91,8 +91,7 @@ class BaggedPointSuite extends SparkFunSuite with MLlibTestSparkContext { | |||
baggedRDD.map(_.subsampleCounts.map(_.toDouble)).collect() | |||
EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, expectedMean, | |||
expectedStddev, epsilon = 0.01) | |||
// should ignore weight function for now | |||
assert(baggedRDD.collect().forall(_.sampleWeight === 1.0)) | |||
assert(baggedRDD.collect().forall(_.sampleWeight === 2.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying to understand, why did the sample weight change in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this testsuite meet conditions: withReplacement=false, numSubsamples!=1,
it will call the modified convertToBaggedRDDSamplingWithoutReplacement
,
and the extractSampleWeight
here is (_: LabeledPoint) => 2.0
, so output baggedPoints will have sampleWeight==2.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@@ -577,7 +592,7 @@ private[spark] object RandomForest extends Logging with Serializable { | |||
|
|||
// transform nodeStatsAggregators array to (nodeIndex, nodeAggregateStats) pairs, | |||
// which can be combined with other partition using `reduceByKey` | |||
nodeStatsAggregators.view.zipWithIndex.map(_.swap).iterator | |||
nodeStatsAggregators.iterator.zipWithIndex.map(_.swap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDEA editor always shows warnings on the two lines, change them to avoid warnings.
Test build #116063 has finished for PR 27070 at commit
|
Test build #116072 has finished for PR 27070 at commit
|
@imatiach-msft Thanks very much for your reviewing! |
Merged to master! Thanks all for reviewing! |
What changes were proposed in this pull request?
1, fix
BaggedPoint.convertToBaggedRDD
whensubsamplingRate < 1.0
2, reorg
RandomForest.runWithMetadata
btwWhy are the changes needed?
In GBT, Instance weights will be discarded if subsamplingRate<1
1,
baggedPoint: BaggedPoint[TreePoint]
is used in the tree growth to find best split;2,
BaggedPoint[TreePoint]
contains two weights:3, only the var
sampleWeight
inBaggedPoint
is used, the varweight
inTreePoint
is never used in finding splits;4, The method
BaggedPoint.convertToBaggedRDD
was changed in #21632, it was only for decisiontree, so only the following code path was changed;5, In #25926, I made GBT support weights, but only test it with default
subsamplingRate==1
.GBT with
subsamplingRate<1
will convert treePoints to baggedPoints viain which the orignial weights from
weightCol
will be discarded and allsampleWeight
are assigned default 1.0;Does this PR introduce any user-facing change?
No
How was this patch tested?
updated testsuites