[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

chouqin · 2014-10-13T06:55:26Z

DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:

Sort sample values for this feature
Get number of occurrence of each distinct value
Iterate the value count array computed in step 2 to choose splits.

After find splits, numSplits and numBins in metadata will be updated.

CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.

…splits

… dt-findsplits Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

AmplabJenkins · 2014-10-13T06:57:06Z

Can one of the admins verify this patch?

SparkQA · 2014-10-13T06:59:46Z

QA tests have started for PR 2780 at commit 092efcb.

This patch merges cleanly.

SparkQA · 2014-10-13T07:49:42Z

QA tests have finished for PR 2780 at commit 092efcb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-13T07:49:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21680/
Test FAILed.

chouqin · 2014-10-13T09:12:43Z

Jekins, retest this please.

SparkQA · 2014-10-13T09:14:40Z

QA tests have started for PR 2780 at commit 9e64699.

This patch merges cleanly.

SparkQA · 2014-10-13T09:19:51Z

QA tests have started for PR 2780 at commit 9e64699.

This patch merges cleanly.

chouqin · 2014-10-13T09:20:51Z

@jkbradley, RandomForestSuite fails because original splits are better fit for the training data(for example, 899.5 is a split threshold, which is close to 900.) I think this PR's method to choose splits is more reasonable than the original method in that the first threshold found by the original method will be the average value of the first two featureSamples.

For example, if featureSamples is Array(0, 1, 2, 3, 4, 5), find a split point using the original method will return 0.5, while this PR's method will return 2.

SparkQA · 2014-10-13T10:11:52Z

QA tests have finished for PR 2780 at commit 9e64699.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-13T10:11:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21684/
Test FAILed.

SparkQA · 2014-10-13T10:16:59Z

QA tests have finished for PR 2780 at commit 9e64699.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-13T10:17:02Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21685/
Test FAILed.

SparkQA · 2014-10-13T11:04:38Z

QA tests have started for PR 2780 at commit d353596.

This patch merges cleanly.

SparkQA · 2014-10-13T12:13:18Z

QA tests have finished for PR 2780 at commit d353596.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-10-18T00:00:34Z

@chouqin Sorry for the slow response!

About the RandomForestSuite failure: The change to fix the failure (maxBins) is OK with me. It is a somewhat brittle test. Good point about the first threshold being wasted.

About the histogram method’s speed: I would guess that the extra computation will not be that bad. Even if maxBins grows, I would expect the runtime of the whole algorithm to slow down as well, and the number of samples is capped at 10000. I will run some tests though to make sure.

About the histogram method’s references: The PLANET paper uses “equidepth” histograms, citing the paper below. Looking at that paper, “equidepth” means the same method which @manishamde implemented previously. I will look into this a little more to see if I find a match for the method you implemented.

PLANET paper: “PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce”
Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In International Conference on ACM Special Interest Group on Management of Data (SIGMOD), pages 251–262, 1999.

I’ll make a pass now and add comments.

jkbradley · 2014-10-18T08:21:05Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

@@ -36,6 +36,7 @@ import org.apache.spark.mllib.tree.model._
 import org.apache.spark.rdd.RDD
 import org.apache.spark.util.random.XORShiftRandom
 import org.apache.spark.SparkContext._
+import scala.collection.mutable.ArrayBuffer


Organize imports

jkbradley · 2014-10-18T08:22:56Z

@chouqin Thanks for this PR! This method should be a real improvement. I added some small comments inline.

My main concern right now is testing edge cases, like @manishamde suggested. Could you try to add some please? Thanks!

…splits

chouqin · 2014-10-19T02:25:36Z

@manishamde @jkbradley thanks for your comments, I have changed my code now. Do you have any more suggestions?

chouqin · 2014-10-19T02:35:57Z

Jekins, test this please.

jkbradley · 2014-10-19T19:27:28Z

@chouqin Thanks for the updates! The updates look good.

One more small comment: Could you please add explicit checks in the unit tests to make sure the returned splits are distinct? I should have thought of that earlier.

I'll try some timing tests to make sure the sampling does not take too long.

manishamde · 2014-10-19T20:43:15Z

@chouqin LGTM. 👍

manishamde · 2014-10-19T20:53:49Z

@jkbradley I read the paper by Sanku et al and other papers but they required a custom implementation. The sort method has worked OK so far but I was hoping somebody would implement a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I think such methods exist in other libraries such as Algebird and Tdigest. We should also look whether BlinkDB has attempted to tackle this problem.

chouqin · 2014-10-20T02:06:57Z

@jkbradley I updated unit test to check splits returned by findSplitsForContinuousFeature are distinct. I have run the unit test for it and it passed.

@mengxr It seems that Jenkins doesn't work, could you please tell it to do the test?

SparkQA · 2014-10-20T18:30:07Z

QA tests have started for PR 2780 at commit 18d0301.

This patch merges cleanly.

jkbradley · 2014-10-20T18:32:49Z

@chouqin Thanks for the update! LGTM once the tests pass.

@manishamde At some point, I hope the histogram functionality can be part of mllib/statistics/ especially once it gets fancy. More to-do items!

SparkQA · 2014-10-20T19:36:42Z

QA tests have finished for PR 2780 at commit 18d0301.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-10-20T20:13:27Z

Merged into master. Thanks!

chouqin added 13 commits October 9, 2014 19:47

Choose splits for continuous features in DecisionTree more adaptively

af7cb79

fix bug

3652823

fix bug

0cd744a

Merge branch 'master' of https://github.com/apache/spark into dt-find…

1b25a35

…splits

Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into…

9e7138e

… dt-findsplits Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

add comments and unit test

8f46af6

fix style

369f812

fix bug

c339a61

fix bug

2a8267a

fix bug

af6dc97

fix bug

ab303a4

fix bug

f69f47f

fix bug

092efcb

chouqin added 2 commits October 13, 2014 17:06

fix random forest unit test

3c72913

fix random forest unit test

9e64699

fix pyspark doc test

d353596

jkbradley reviewed Oct 18, 2014
View reviewed changes

chouqin added 3 commits October 18, 2014 23:17

Merge branch 'master' of https://github.com/apache/spark into dt-find…

9857039

…splits

adjust code based on comments and add more test cases

ffc920f

remove blank lines

8dc28ab

check explicitly findsplits return distinct splits

18d0301

asfgit closed this in eadc4c5 Oct 20, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

chouqin commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

chouqin commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

chouqin commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

jkbradley commented Oct 18, 2014

jkbradley Oct 18, 2014

jkbradley commented Oct 18, 2014

chouqin commented Oct 19, 2014

chouqin commented Oct 19, 2014

jkbradley commented Oct 19, 2014

manishamde commented Oct 19, 2014

manishamde commented Oct 19, 2014

chouqin commented Oct 20, 2014

SparkQA commented Oct 20, 2014

jkbradley commented Oct 20, 2014

SparkQA commented Oct 20, 2014

mengxr commented Oct 20, 2014

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

Conversation

chouqin commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

chouqin commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

chouqin commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

AmplabJenkins commented Oct 13, 2014

SparkQA commented Oct 13, 2014

SparkQA commented Oct 13, 2014

jkbradley commented Oct 18, 2014

jkbradley Oct 18, 2014

Choose a reason for hiding this comment

jkbradley commented Oct 18, 2014

chouqin commented Oct 19, 2014

chouqin commented Oct 19, 2014

jkbradley commented Oct 19, 2014

manishamde commented Oct 19, 2014

manishamde commented Oct 19, 2014

chouqin commented Oct 20, 2014

SparkQA commented Oct 20, 2014

jkbradley commented Oct 20, 2014

SparkQA commented Oct 20, 2014

mengxr commented Oct 20, 2014