[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

jkbradley · 2014-08-14T20:12:21Z

DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.

exhibited for unordered features (multi-class classification with categorical features of low arity)
Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: @mengxr @manishamde @chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Optimization: Added TreePoint representation so we only call findBin once for each example, feature. Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones. BUG FIX: isSampleValid * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. Also: some commented-out debugging println calls in DecisionTree, to be removed later

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

* Updated doc * Made some methods private Changed timer to report time in seconds.

jkbradley · 2014-08-14T20:24:48Z

Jenkins test this please

mengxr · 2014-08-14T22:27:09Z

Jenkins, test this please.

SparkQA · 2014-08-14T22:30:22Z

QA tests have started for PR 1950. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18573/consoleFull

SparkQA · 2014-08-14T23:15:09Z

QA tests have started for PR 1950. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18577/consoleFull

SparkQA · 2014-08-14T23:23:58Z

QA results for PR 1950:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class TimeTracker extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18573/consoleFull

manishamde · 2014-08-14T23:24:24Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

        val shift = 1 + numFeatures * nodeIndex
        if (!sampleValid) {
          // Mark one bin as -1 is sufficient.
          arr(shift) = InvalidBinIndex
        } else {
          var featureIndex = 0
+          // TODO: Vectorize this


TODO can be removed.

SparkQA · 2014-08-15T00:06:03Z

QA results for PR 1950:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class TimeTracker extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18577/consoleFull

manishamde · 2014-08-15T00:23:13Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

+          val numFeatureCategories = strategy.categoricalFeaturesInfo(featureIndex)
+          val isSpaceSufficientForAllCategoricalSplits =
+            numBins > math.pow(2, numFeatureCategories.toInt - 1) - 1
+          val isUnorderedFeature =


I guess we can bundle this calculation as well in the your future commit. Let's create a map for feature id to features types which can be re-used by all internal methods.

Yep, I make a private metadata class for storing that and pass it around internally.

manishamde · 2014-08-15T02:44:52Z

👍

LGTM! I have some minor comments. I look forward to the performance improvements based upon these changes. The multiclass bug fix is a great catch. Finally, I see the timer functionality being added soon to other parts of Spark. :-)

mengxr · 2014-08-15T04:23:04Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

    logDebug("numBins = " + numBins)

+    timer.start("init")
+    val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, bins).cache()


.cache() -> .persist(StorageLevel.MEMORY_AND_DISK)? There is a computation/storage trade-off here, maybe worth testing.

I'll try testing this.

Wow, that can help a lot. I've only tested on my laptop, but there, spark-perf tests using 500K examples and 500 features with tree depth 5 took 292 sec using cache() and 112 sec using mem + disk. I'll make this update.
On a smaller example (100K examples, 500 features, depth 5), the difference was 80 sec vs. 21 sec.

mengxr · 2014-08-15T04:24:29Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impl/TreePoint.scala

+    val numFeatures = labeledPoint.features.size
+    val numBins = bins(0).size
+    val arr = new Array[Int](numFeatures)
+    var featureIndex = 0 // offset by 1 for label


Where does offset by 1 happen?

That was an old comment; I removed it, thanks!

jkbradley · 2014-08-15T16:29:48Z

@chouqin @manishamde @mengxr Thank you for the comments! I'll make those fixes and get the other PRs done ASAP.

…disk, not just memory. Details: DecisionTree * Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK) ** This gave major performance improvements on small tests. E.g., 500K examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 when using disk as well. * Change for to while loops * Small cleanups TimeTracker * Removed useless timing in DecisionTree TreePoint * Renamed features to binnedFeatures

jkbradley · 2014-08-15T19:31:09Z

@chouqin @manishamde @mengxr Hopefully is good now. Thanks again! By the way, I tested this on spark-perf on small tests and got small speedups (2x), but that was before the persist() change. Am running larger tests now. Also, I compared learning with scikit-learn on a few real datasets to make sure the learning algs behave similarly; they get similar accuracy and tree structures.

SparkQA · 2014-08-15T19:35:20Z

QA tests have started for PR 1950 at commit 6b5651e.

This patch merges cleanly.

SparkQA · 2014-08-15T19:36:06Z

QA tests have finished for PR 1950 at commit 6b5651e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- exec "$FWDIR"/bin/spark-submit --class $CLASS "$
- trait TaskCompletionListener extends EventListener
- class AvroWrapperToJavaConverter extends Converter[Any, Any]
- exec "$FWDIR"/bin/spark-submit --class $CLASS "$

SparkQA · 2014-08-15T19:50:04Z

QA tests have started for PR 1950 at commit 5f2dec2.

This patch merges cleanly.

manishamde · 2014-08-15T20:25:02Z

@jkbradley Sounds good. 2x speedups are non-trivial. :-) Looking forward to large-scale test results. Also, thanks for verifying with scikit-learn.

SparkQA · 2014-08-15T20:42:51Z

QA tests have finished for PR 1950 at commit 5f2dec2.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-15T21:50:56Z

LGTM. I'm merging this into master and branch-1.1. Thanks @chouqin and @manishamde for reviewing the code!

…dered feature bug fix DecisionTree improvements: (1) TreePoint representation to avoid binning multiple times (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features (3) Timing for DecisionTree internals Details: (1) TreePoint representation to avoid binning multiple times [https://issues.apache.org/jira/browse/SPARK-3022] Added private[tree] TreePoint class for representing binned feature values. The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times. (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features [https://issues.apache.org/jira/browse/SPARK-3041] isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. (3) Timing for DecisionTree internals Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code. Prints timing info via logDebug. CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043 Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1950 from jkbradley/dt-opt1 and squashes the following commits: 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals (cherry picked from commit c703229) Signed-off-by: Xiangrui Meng <meng@databricks.com>

…dered feature bug fix DecisionTree improvements: (1) TreePoint representation to avoid binning multiple times (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features (3) Timing for DecisionTree internals Details: (1) TreePoint representation to avoid binning multiple times [https://issues.apache.org/jira/browse/SPARK-3022] Added private[tree] TreePoint class for representing binned feature values. The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times. (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features [https://issues.apache.org/jira/browse/SPARK-3041] isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. (3) Timing for DecisionTree internals Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code. Prints timing info via logDebug. CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043 Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1950 from jkbradley/dt-opt1 and squashes the following commits: 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals

jkbradley added 10 commits August 5, 2014 11:17

timing for DecisionTree internals

a95bc22

Merge remote-tracking branch 'upstream/master' into dt-timing

511ec85

Merge remote-tracking branch 'upstream/master' into dt-timing

bcf874a

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

Merge remote-tracking branch 'upstream/master' into dt-timing

f61e9d2

Optimizing DecisionTree

3211f02

* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)

Merge remote-tracking branch 'upstream/master' into dt-opt1

a87e08f

Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. R…

8464a6e

…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable

TreePoint

e66f1b1

* Updated doc * Made some methods private Changed timer to report time in seconds.

Print timing info to logDebug.

d036089

mengxr mentioned this pull request Aug 14, 2014

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

Closed

Added more debug info on binning error. Added some docs.

430d782

manishamde reviewed Aug 14, 2014
View reviewed changes

manishamde reviewed Aug 15, 2014
View reviewed changes

mengxr reviewed Aug 15, 2014
View reviewed changes

jkbradley added 2 commits August 15, 2014 09:46

Merge remote-tracking branch 'upstream/master' into dt-opt1

2d2aaaf

Fixed scalastyle issue in TreePoint

5f2dec2

asfgit closed this in c703229 Aug 15, 2014

jkbradley deleted the dt-opt1 branch August 26, 2014 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

jkbradley commented Aug 14, 2014

jkbradley commented Aug 14, 2014

mengxr commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

manishamde Aug 14, 2014

SparkQA commented Aug 15, 2014

manishamde Aug 15, 2014

jkbradley Aug 15, 2014

manishamde Aug 15, 2014

manishamde commented Aug 15, 2014

mengxr Aug 15, 2014

jkbradley Aug 15, 2014

jkbradley Aug 15, 2014

mengxr Aug 15, 2014

jkbradley Aug 15, 2014

jkbradley commented Aug 15, 2014

jkbradley commented Aug 15, 2014

SparkQA commented Aug 15, 2014

SparkQA commented Aug 15, 2014

SparkQA commented Aug 15, 2014

manishamde commented Aug 15, 2014

SparkQA commented Aug 15, 2014

mengxr commented Aug 15, 2014

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

Conversation

jkbradley commented Aug 14, 2014

jkbradley commented Aug 14, 2014

mengxr commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

SparkQA commented Aug 14, 2014

Choose a reason for hiding this comment

SparkQA commented Aug 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

manishamde commented Aug 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkbradley commented Aug 15, 2014

jkbradley commented Aug 15, 2014

SparkQA commented Aug 15, 2014

SparkQA commented Aug 15, 2014

SparkQA commented Aug 15, 2014

manishamde commented Aug 15, 2014

SparkQA commented Aug 15, 2014

mengxr commented Aug 15, 2014