-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950
Conversation
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
* Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
Optimization: Added TreePoint representation so we only call findBin once for each example, feature. Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones. BUG FIX: isSampleValid * isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. Also: some commented-out debugging println calls in DecisionTree, to be removed later
…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable
Jenkins test this please |
Jenkins, test this please. |
QA tests have started for PR 1950. This patch merges cleanly. |
QA tests have started for PR 1950. This patch merges cleanly. |
QA results for PR 1950: |
val shift = 1 + numFeatures * nodeIndex | ||
if (!sampleValid) { | ||
// Mark one bin as -1 is sufficient. | ||
arr(shift) = InvalidBinIndex | ||
} else { | ||
var featureIndex = 0 | ||
// TODO: Vectorize this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO can be removed.
QA results for PR 1950: |
val numFeatureCategories = strategy.categoricalFeaturesInfo(featureIndex) | ||
val isSpaceSufficientForAllCategoricalSplits = | ||
numBins > math.pow(2, numFeatureCategories.toInt - 1) - 1 | ||
val isUnorderedFeature = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can bundle this calculation as well in the your future commit. Let's create a map for feature id to features types which can be re-used by all internal methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I make a private metadata class for storing that and pass it around internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
👍 LGTM! I have some minor comments. I look forward to the performance improvements based upon these changes. The multiclass bug fix is a great catch. Finally, I see the timer functionality being added soon to other parts of Spark. :-) |
logDebug("numBins = " + numBins) | ||
|
||
timer.start("init") | ||
val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, bins).cache() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.cache()
-> .persist(StorageLevel.MEMORY_AND_DISK)
? There is a computation/storage trade-off here, maybe worth testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try testing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that can help a lot. I've only tested on my laptop, but there, spark-perf tests using 500K examples and 500 features with tree depth 5 took 292 sec using cache() and 112 sec using mem + disk. I'll make this update.
On a smaller example (100K examples, 500 features, depth 5), the difference was 80 sec vs. 21 sec.
val numFeatures = labeledPoint.features.size | ||
val numBins = bins(0).size | ||
val arr = new Array[Int](numFeatures) | ||
var featureIndex = 0 // offset by 1 for label |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does offset by 1
happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was an old comment; I removed it, thanks!
@chouqin @manishamde @mengxr Thank you for the comments! I'll make those fixes and get the other PRs done ASAP. |
…disk, not just memory. Details: DecisionTree * Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK) ** This gave major performance improvements on small tests. E.g., 500K examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 when using disk as well. * Change for to while loops * Small cleanups TimeTracker * Removed useless timing in DecisionTree TreePoint * Renamed features to binnedFeatures
@chouqin @manishamde @mengxr Hopefully is good now. Thanks again! By the way, I tested this on spark-perf on small tests and got small speedups (2x), but that was before the persist() change. Am running larger tests now. Also, I compared learning with scikit-learn on a few real datasets to make sure the learning algs behave similarly; they get similar accuracy and tree structures. |
QA tests have started for PR 1950 at commit
|
QA tests have finished for PR 1950 at commit
|
QA tests have started for PR 1950 at commit
|
@jkbradley Sounds good. 2x speedups are non-trivial. :-) Looking forward to large-scale test results. Also, thanks for verifying with scikit-learn. |
QA tests have finished for PR 1950 at commit
|
LGTM. I'm merging this into master and branch-1.1. Thanks @chouqin and @manishamde for reviewing the code! |
…dered feature bug fix DecisionTree improvements: (1) TreePoint representation to avoid binning multiple times (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features (3) Timing for DecisionTree internals Details: (1) TreePoint representation to avoid binning multiple times [https://issues.apache.org/jira/browse/SPARK-3022] Added private[tree] TreePoint class for representing binned feature values. The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times. (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features [https://issues.apache.org/jira/browse/SPARK-3041] isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. (3) Timing for DecisionTree internals Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code. Prints timing info via logDebug. CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043 Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes #1950 from jkbradley/dt-opt1 and squashes the following commits: 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals (cherry picked from commit c703229) Signed-off-by: Xiangrui Meng <meng@databricks.com>
…dered feature bug fix DecisionTree improvements: (1) TreePoint representation to avoid binning multiple times (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features (3) Timing for DecisionTree internals Details: (1) TreePoint representation to avoid binning multiple times [https://issues.apache.org/jira/browse/SPARK-3022] Added private[tree] TreePoint class for representing binned feature values. The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times. (2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features [https://issues.apache.org/jira/browse/SPARK-3041] isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories. * exhibited for unordered features (multi-class classification with categorical features of low arity) * Fix: Index bins correctly for unordered categorical features. (3) Timing for DecisionTree internals Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code. Prints timing info via logDebug. CC: mengxr manishamde chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043 Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Closes apache#1950 from jkbradley/dt-opt1 and squashes the following commits: 5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint 6b5651e [Joseph K. Bradley] Updates based on code review. 1 major change: persisting to memory + disk, not just memory. 2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 430d782 [Joseph K. Bradley] Added more debug info on binning error. Added some docs. d036089 [Joseph K. Bradley] Print timing info to logDebug. e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private 8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up. Removed debugging println calls from DecisionTree. Made TreePoint extend Serialiable a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1 0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree 3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging) f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing 511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals
Details:
(1) TreePoint representation to avoid binning multiple times
[https://issues.apache.org/jira/browse/SPARK-3022]
Added private[tree] TreePoint class for representing binned feature values.
The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times.
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
[https://issues.apache.org/jira/browse/SPARK-3041]
isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
(3) Timing for DecisionTree internals
Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.
CC: @mengxr @manishamde @chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043