-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively #2780
Conversation
… dt-findsplits Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
Can one of the admins verify this patch? |
QA tests have started for PR 2780 at commit
|
QA tests have finished for PR 2780 at commit
|
Test FAILed. |
Jekins, retest this please. |
QA tests have started for PR 2780 at commit
|
QA tests have started for PR 2780 at commit
|
@jkbradley, RandomForestSuite fails because original splits are better fit for the training data(for example, 899.5 is a split threshold, which is close to 900.) I think this PR's method to choose splits is more reasonable than the original method in that the first threshold found by the original method will be the average value of the first two For example, if |
QA tests have finished for PR 2780 at commit
|
Test FAILed. |
QA tests have finished for PR 2780 at commit
|
Test FAILed. |
QA tests have started for PR 2780 at commit
|
QA tests have finished for PR 2780 at commit
|
@chouqin Sorry for the slow response! About the RandomForestSuite failure: The change to fix the failure (maxBins) is OK with me. It is a somewhat brittle test. Good point about the first threshold being wasted. About the histogram method’s speed: I would guess that the extra computation will not be that bad. Even if maxBins grows, I would expect the runtime of the whole algorithm to slow down as well, and the number of samples is capped at 10000. I will run some tests though to make sure. About the histogram method’s references: The PLANET paper uses “equidepth” histograms, citing the paper below. Looking at that paper, “equidepth” means the same method which @manishamde implemented previously. I will look into this a little more to see if I find a match for the method you implemented.
I’ll make a pass now and add comments. |
@@ -36,6 +36,7 @@ import org.apache.spark.mllib.tree.model._ | |||
import org.apache.spark.rdd.RDD | |||
import org.apache.spark.util.random.XORShiftRandom | |||
import org.apache.spark.SparkContext._ | |||
import scala.collection.mutable.ArrayBuffer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organize imports
@chouqin Thanks for this PR! This method should be a real improvement. I added some small comments inline. My main concern right now is testing edge cases, like @manishamde suggested. Could you try to add some please? Thanks! |
@manishamde @jkbradley thanks for your comments, I have changed my code now. Do you have any more suggestions? |
Jekins, test this please. |
@chouqin Thanks for the updates! The updates look good. One more small comment: Could you please add explicit checks in the unit tests to make sure the returned splits are distinct? I should have thought of that earlier. I'll try some timing tests to make sure the sampling does not take too long. |
@chouqin LGTM. 👍 |
@jkbradley I read the paper by Sanku et al and other papers but they required a custom implementation. The sort method has worked OK so far but I was hoping somebody would implement a generic quantile approximation algorithm for Spark that is O(n) and requires limited memory. I think such methods exist in other libraries such as Algebird and Tdigest. We should also look whether BlinkDB has attempted to tackle this problem. |
@jkbradley I updated unit test to check splits returned by @mengxr It seems that Jenkins doesn't work, could you please tell it to do the test? |
QA tests have started for PR 2780 at commit
|
@chouqin Thanks for the update! LGTM once the tests pass. @manishamde At some point, I hope the histogram functionality can be part of mllib/statistics/ especially once it gets fancy. More to-do items! |
QA tests have finished for PR 2780 at commit
|
Merged into master. Thanks! |
DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:
After find splits,
numSplits
andnumBins
in metadata will be updated.CC: @mengxr @manishamde @jkbradley, please help me review this, thanks.