[SPARK-16957][MLlib] Use midpoints for split values. #17556

facaiy · 2017-04-07T04:28:28Z

What changes were proposed in this pull request?

Use midpoints for split values now, and maybe later to make it weighted.

How was this patch tested?

add unit test.
revise Split's unit test.

srowen · 2017-04-07T12:07:43Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -1009,10 +1009,24 @@ private[spark] object RandomForest extends Logging {
      // sort distinct values
      val valueCounts = valueCountMap.toSeq.sortBy(_._1).toArray

+      def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = {


Nit: cru -> cur? or current?

srowen · 2017-04-07T12:07:56Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -996,7 +996,7 @@ private[spark] object RandomForest extends Logging {
    require(metadata.isContinuous(featureIndex),
      "findSplitsForContinuousFeature can only be used to find splits for a continuous feature.")

-    val splits = if (featureSamples.isEmpty) {
+    val splits: Array[Double] = if (featureSamples.isEmpty) {


Was this needed?

The code block is too long and has 4 exits. Emphasizing its type perhaps is better to be understand, though splits is implied by return type.

srowen · 2017-04-07T12:10:04Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+      def weightedMean(pre: (Double, Int), cru: (Double, Int)): Double = {
+        val (preValue, preCount) = pre
+        val (curValue, curCount) = cru
+        (preValue * preCount + curValue * curCount) / (preCount + curCount)


I'm probably over-thinking this, but do we have a possible overflow issue in the denominator? like if both are near Int.MaxValue. One could be converted .toDouble just to make sure

I agree with you. fixed.

srowen · 2017-04-07T12:10:58Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+      } else if (possibleSplits <= numSplits) {
+        valueCounts
+          .sliding(2)
+          .map{x => weightedMean(x(0), x(1))}


Nit: use () instead of {}
There are more efficient ways of writing this but not as compact. I think it's OK unless someone suggests this is performance critical here

fixed.
Do you mean use scanLeft? It's a little complicate and obscure.

No not scanLeft, just manually building the result array and iterating because it's already known ahead of time how big it is.

srowen · 2017-04-07T13:42:38Z

It seems OK to me but @sethah or @jkbradley might be good as a second set of eyes. It does slightly alter behavior, but, it does seem like something that should work better in general.

SparkQA · 2017-04-09T06:49:39Z

Test build #3652 has started for PR 17556 at commit 9ca5750.

facaiy · 2017-04-10T06:35:35Z

is there something wrong with spark CI?

facaiy · 2017-04-10T06:40:55Z

Test Result (1 failure / +1)
    org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior.Peers in 2 racks

Does anyone know what is this?

srowen · 2017-04-10T09:06:08Z

Just a flaky test. Can't be related

SparkQA · 2017-04-10T11:30:56Z

Test build #3654 has finished for PR 17556 at commit 9ca5750.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-10T13:08:56Z

Test build #3655 has finished for PR 17556 at commit 9ca5750.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

facaiy · 2017-04-11T03:13:21Z

@srowen Hi, I forget unit tests in python and R. Where can I find document about creating develop environment? thanks.

srowen · 2017-04-11T06:44:34Z

http://spark.apache.org/docs/latest/building-spark.html

facaiy · 2017-04-13T03:33:27Z

I have ran all unit test case of MLlib in Python. However, I am not familiar with R, and I don't want waste too much time on deploying R's environment.

Could CI retest the pr? We can check if some unit tests are still broken. thanks.

SparkQA · 2017-04-13T09:50:09Z

Test build #3662 has finished for PR 17556 at commit b74702a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-04-13T10:30:01Z

It's looking good, and the R tests pass. I'll also ask @mengxr or maybe @dbtsai if they have any concerns about this change?

facaiy · 2017-04-13T12:49:53Z

many thanks, @srowen

sethah · 2017-04-13T15:01:21Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-      assert(splits === Array(1.0, 2.0))
+      assert(splits === Array(1.8, 2.2))


It's clearer IMO to do:

assert(splits === Array((2 * 8 + 1 * 2) / (8 + 2), (2 * 8 + 3 * 2) / (8 + 2)

sethah · 2017-04-13T15:01:54Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
-      assert(splits === Array(2.0, 3.0))
+      assert(splits === Array(2.0625, 3.5))


sethah · 2017-04-13T15:20:38Z

If we are attempting to match R GBM, it would be great to show, at least on the PR, that we get the same results.

sethah · 2017-04-13T16:43:32Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

+      )
+      val featureSamples = Array(0, 1, 0, 0, 1, 0, 1, 1).map(_.toDouble)
+      val splits = RandomForest.findSplitsForContinuousFeature(featureSamples, fakeMetadata, 0)
+      assert(splits === Array(0.5))


In this block, would you mind adding another test that exercises the possibleSplits > numSplits code path? It actually does get called below, but those tests are for other things and I think it's better to make it explicit what we are testing.

add new case.

sethah · 2017-04-13T16:45:49Z

Seems like a reasonable change. Just left some minor comments.

sethah · 2017-04-28T05:44:53Z

I don't mind the weighted midpoints. However, if for a continuous feature we find that many points have the exact same value, we are assuming we may find data points in the test set that are close to but not these same values. But since our train data was clustered at these particular values, perhaps it's not a good assumption. I could live with either method, but maybe a slight preference to match the other libraries.

srowen · 2017-04-28T08:20:23Z

@sethah what's the issue there ... train/test ought to be from the same distribution, in theory. The empirical distribution of the test data will of course be a little different, but what is the issue with that w.r.t. this change? From a theoretical perspective, picking the midpoint seems more justified than picking an endpoint, and a weighted mean moreso than a midpoint.

srowen · 2017-04-28T08:33:25Z

Ah OK I should think about this more first. Say you have a continuous predictor x and binary output y. Say the optimal split is found to be between 0.1 and 0.2, with 1 observation of 0.1 and 99 of 0.2. Right now the algorithm would pick a split value of 0.2; it certainly can't be > 0.2 or < 0.1 but it's highly unlikely that 0.1 or 0.2 are the actual optimal split value.

A weighted mean says the best split is at 0.199, really. It makes sense if you're attempting to make sure that P(0.1 <= x < 0.199) ~= P(0.199 <= x <= 0.2) -- about half the cases in this critical range fall above and below the split. But really the goal is to find x such that P(y=1 | x) is about 0.5. It's not the same thing but it's also not knowable from the training data.

But 0.15 isn't obviously better either. It would mean that, probably, almost all test values in this critical range are classified as positive, not about half.

facaiy · 2017-04-29T02:19:32Z

For a (train) sample of continuous series, say {x0, x1, x2, x3, ..., x100}. Now spark selects quantile as split point.

Suppose 10-quantiles is used, and x2 is 1st quantile, and x10 is 2nd quantile. It's believed that P(x < x2) ~= P(x2 < x < x10). However, x2 is not perfect. As the data is continuous, there exits one point z is the real point who satisfy P(x < z) == P(z < x < x10).

And it's reasonable that averaged midpoint between x2 and x3 is more appropriate, in my option.

facaiy · 2017-04-29T02:24:35Z

By the way, it's safe to use mean value as it is match the other libraries. If requested, I'd like to modify the PR.

srowen · 2017-04-29T08:22:04Z

The bucketing is not trying to to bucket into buckets of equal P(x). It's a condition on P(y | x). That said the right point isn't knowable from the training data, and splitting to balance P(x) on either side of the split within the bucket is perhaps the next-most principled thing to do.

To reach a conclusion though: if we have slightly more net preference for a simple average, we could merge that change for now and decide later to make it weighted.

facaiy · 2017-04-30T11:32:16Z

OK, weight has been removed when calculating.

srowen · 2017-05-01T09:42:46Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

-      // if possible splits is not enough or just enough, just return all possible splits
+      // perhaps weighted mean is better in the future, see SPARK-16957 and Github PR 17556.
+      def mean(pre: (Double, Int), cur: (Double, Int)): Double = {
+        val (preValue, preCount) = pre


Is it worth factoring a method for this? you could just write (preValue, _) = here, but, just dereferncing ._1 isn't so bad, and then, wondering if it saves much to make a method.

Yeah, we should get rid of this method.

srowen

The change looks OK even as-is

facaiy · 2017-05-02T02:32:15Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+        // if possible splits is not enough or just enough, just return all possible splits
+        val splits = for {
+          i <- 0 until possibleSplits
+        } yield (valueCounts(i)._1 + valueCounts(i + 1)._1) / 2


@srowen Is it more efficient than sliding?

Good idea. Maybe even just (0 until possibleSplits).map(...).toArray which is probably about the same thing anyway. You might write / 2.0 to be clear it's floating point division

srowen · 2017-05-02T09:22:52Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+        // if possible splits is not enough or just enough, just return all possible splits
+        val splits = for {
+          i <- 0 until possibleSplits
+        } yield (valueCounts(i)._1 + valueCounts(i + 1)._1) / 2


Good idea. Maybe even just (0 until possibleSplits).map(...).toArray which is probably about the same thing anyway. You might write / 2.0 to be clear it's floating point division

srowen · 2017-05-02T09:34:12Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

+            val pre = valueCounts(index - 1)
+            val cur = valueCounts(index)
+            // perhaps weighted mean will be used later, see SPARK-16957 and Github PR 17556.
+            splitsBuilder += (pre._1 + cur._1) / 2


Meh, could likewise be one line like above. No big deal.

Nice! revised.

facaiy · 2017-05-02T10:25:42Z

How about testing the pr, @SparkQA

SparkQA · 2017-05-02T11:27:09Z

Test build #3682 has finished for PR 17556 at commit 92df1c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah

One small nit. Otherwise, LGTM. Thanks @facaiy!

sethah · 2017-05-03T00:51:50Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala

@@ -1037,7 +1042,8 @@ private[spark] object RandomForest extends Logging {
          // makes the gap between currentCount and targetCount smaller,
          // previous value is a split threshold.
          if (previousGap < currentGap) {
-            splitsBuilder += valueCounts(index - 1)._1
+            // perhaps weighted mean will be used later, see SPARK-16957 and Github PR 17556.


Comments like these tend to just get left around and sit there forever. Unless we file a new JIRA that intends to decide on future behavior, I would like to remove this comment altogether. I'd prefer to just remove it and not create a follow up.

removed. Thanks for your help! @sethah @srowen

srowen · 2017-05-03T09:54:56Z

Merged to master. I'm not against putting it into 2.2, but I'm conscious we already even had an RC

sethah · 2017-05-03T15:33:34Z

Thanks @srowen!

facaiy added 5 commits April 7, 2017 12:02

TST: add test case

45b7493

ENH: use weighted midpoints

c49d3ae

BUG: constant feature, outOfIndex

387eb49

TST: modify split's test case

2e68f1e

CLN: move test case

6a5806f

srowen requested changes Apr 7, 2017

View reviewed changes

facaiy added 3 commits April 7, 2017 21:06

CLN: fix a typo

7ad590d

BUG: int, overflow

0aaed66

CLN: style mistake, { -> (

c07ffac

CLN: mv comment

9ca5750

TST: revise unit test in python

b74702a

sethah reviewed Apr 13, 2017

View reviewed changes

TST: explicitly show calculation

76f4ae8

facaiy added 2 commits April 29, 2017 10:03

TST: expSplits -> expectedSplits

1459b14

CLN: remove blank

a094029

facaiy added 2 commits April 30, 2017 19:17

ENH: weighted mean -> mean

7c50d4a

TST: revise unit test in scala

7bb11dd

facaiy changed the title ~~[SPARK-16957][MLlib] Use weighted midpoints for split values.~~ [SPARK-16957][MLlib] Use midpoints for split values. Apr 30, 2017

srowen reviewed May 1, 2017

View reviewed changes

srowen approved these changes May 1, 2017

View reviewed changes

facaiy added 4 commits May 2, 2017 10:10

CLN: mean method is removed

ae0e48e

CLN: trim whitespace at end of line

59866fa

CLN: refine, possibleSplits <= numSplits

10037ea

CLN: use possibleSplits

1cae998

facaiy commented May 2, 2017

View reviewed changes

srowen reviewed May 2, 2017

View reviewed changes

CLN: use map to replace for...yield

92df1c8

sethah reviewed May 3, 2017

View reviewed changes

CLN: remove comment

591d790

asfgit closed this in 7f96f2d May 3, 2017

facaiy deleted the ENH/decision_tree_overflow_and_precision_in_aggregation branch May 3, 2017 12:58

[SPARK-16957][MLlib] Use midpoints for split values. #17556

[SPARK-16957][MLlib] Use midpoints for split values. #17556

Conversation

facaiy commented Apr 7, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented Apr 7, 2017

SparkQA commented Apr 9, 2017

facaiy commented Apr 10, 2017

facaiy commented Apr 10, 2017

srowen commented Apr 10, 2017

SparkQA commented Apr 10, 2017

SparkQA commented Apr 10, 2017

facaiy commented Apr 11, 2017

srowen commented Apr 11, 2017

facaiy commented Apr 13, 2017 • edited Loading

SparkQA commented Apr 13, 2017

srowen commented Apr 13, 2017

facaiy commented Apr 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Apr 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Apr 13, 2017

sethah commented Apr 28, 2017

srowen commented Apr 28, 2017

srowen commented Apr 28, 2017

facaiy commented Apr 29, 2017 • edited Loading

facaiy commented Apr 29, 2017

srowen commented Apr 29, 2017 • edited Loading

facaiy commented Apr 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facaiy commented May 2, 2017

SparkQA commented May 2, 2017

sethah left a comment

Choose a reason for hiding this comment

sethah May 3, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

srowen commented May 3, 2017

sethah commented May 3, 2017

facaiy commented Apr 7, 2017 •

edited

Loading

facaiy commented Apr 13, 2017 •

edited

Loading

facaiy commented Apr 29, 2017 •

edited

Loading

srowen commented Apr 29, 2017 •

edited

Loading

sethah May 3, 2017 •

edited

Loading