Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780

Closed
wants to merge 20 commits into from

Conversation

chouqin
Copy link
Contributor

@chouqin chouqin commented Oct 13, 2014

DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:

  1. Sort sample values for this feature
  2. Get number of occurrence of each distinct value
  3. Iterate the value count array computed in step 2 to choose splits.

After find splits, numSplits and numBins in metadata will be updated.

CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have started for PR 2780 at commit 092efcb.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have finished for PR 2780 at commit 092efcb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21680/
Test FAILed.

@chouqin
Copy link
Contributor Author

chouqin commented Oct 13, 2014

Jekins, retest this please.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have started for PR 2780 at commit 9e64699.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have started for PR 2780 at commit 9e64699.

  • This patch merges cleanly.

@chouqin
Copy link
Contributor Author

chouqin commented Oct 13, 2014

@jkbradley, RandomForestSuite fails because original splits are better fit for the training data(for example, 899.5 is a split threshold, which is close to 900.) I think this PR's method to choose splits is more reasonable than the original method in that the first threshold found by the original method will be the average value of the first two featureSamples.

For example, if featureSamples is Array(0, 1, 2, 3, 4, 5), find a split point using the original method will return 0.5, while this PR's method will return 2.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have finished for PR 2780 at commit 9e64699.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21684/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have finished for PR 2780 at commit 9e64699.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21685/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have started for PR 2780 at commit d353596.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 13, 2014

QA tests have finished for PR 2780 at commit d353596.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@chouqin Sorry for the slow response!

About the RandomForestSuite failure: The change to fix the failure (maxBins) is OK with me. It is a somewhat brittle test. Good point about the first threshold being wasted.

About the histogram method’s speed: I would guess that the extra computation will not be that bad. Even if maxBins grows, I would expect the runtime of the whole algorithm to slow down as well, and the number of samples is capped at 10000. I will run some tests though to make sure.

About the histogram method’s references: The PLANET paper uses “equidepth” histograms, citing the paper below. Looking at that paper, “equidepth” means the same method which @manishamde implemented previously. I will look into this a little more to see if I find a match for the method you implemented.

  • PLANET paper: “PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce”
  • Paper they cite for histograms: G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In International Conference on ACM Special Interest Group on Management of Data (SIGMOD), pages 251–262, 1999.

I’ll make a pass now and add comments.

@@ -36,6 +36,7 @@ import org.apache.spark.mllib.tree.model._
import org.apache.spark.rdd.RDD
import org.apache.spark.util.random.XORShiftRandom
import org.apache.spark.SparkContext._
import scala.collection.mutable.ArrayBuffer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organize imports

@jkbradley
Copy link
Member

@chouqin Thanks for this PR! This method should be a real improvement. I added some small comments inline.

My main concern right now is testing edge cases, like @manishamde suggested. Could you try to add some please? Thanks!

@chouqin
Copy link
Contributor Author

chouqin commented Oct 19, 2014

@manishamde @jkbradley thanks for your comments, I have changed my code now. Do you have any more suggestions?

@chouqin
Copy link
Contributor Author

chouqin commented Oct 19, 2014

Jekins, test this please.

@jkbradley
Copy link
Member

@chouqin Thanks for the updates! The updates look good.

One more small comment: Could you please add explicit checks in the unit tests to make sure the returned splits are distinct? I should have thought of that earlier.

I'll try some timing tests to make sure the sampling does not take too long.

@manishamde
Copy link
Contributor

@chouqin LGTM. 👍

@manishamde
Copy link
Contributor

@jkbradley I read the paper by Sanku et al and other papers but they required a custom implementation. The sort method has worked OK so far but I was hoping somebody would implement a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I think such methods exist in other libraries such as Algebird and Tdigest. We should also look whether BlinkDB has attempted to tackle this problem.

@chouqin
Copy link
Contributor Author

chouqin commented Oct 20, 2014

@jkbradley I updated unit test to check splits returned by findSplitsForContinuousFeature are distinct. I have run the unit test for it and it passed.

@mengxr It seems that Jenkins doesn't work, could you please tell it to do the test?

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have started for PR 2780 at commit 18d0301.

  • This patch merges cleanly.

@jkbradley
Copy link
Member

@chouqin Thanks for the update! LGTM once the tests pass.

@manishamde At some point, I hope the histogram functionality can be part of mllib/statistics/ especially once it gets fancy. More to-do items!

@SparkQA
Copy link

SparkQA commented Oct 20, 2014

QA tests have finished for PR 2780 at commit 18d0301.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Oct 20, 2014

Merged into master. Thanks!

@asfgit asfgit closed this in eadc4c5 Oct 20, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants