Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

Closed
wants to merge 14 commits into from

Conversation

jkbradley
Copy link
Member

DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached. This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.

  • exhibited for unordered features (multi-class classification with categorical features of low arity)
  • Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: @mengxr @manishamde @chouqin Very similar update, with one bug fix. Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
* Added TreePoint representation to avoid calling findBin multiple times.
* (not working yet, but debugging)
Optimization: Added TreePoint representation so we only call findBin once for each example, feature.

Also, calculateGainsForAllNodeSplits now only searches over actual splits, not empty/unused ones.

BUG FIX: isSampleValid
* isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

Also: some commented-out debugging println calls in DecisionTree, to be removed later
…emoved debugging println calls from DecisionTree. Made TreePoint extend Serialiable
* Updated doc
* Made some methods private

Changed timer to report time in seconds.
@jkbradley
Copy link
Member Author

Jenkins test this please

@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1950. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18573/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA tests have started for PR 1950. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18577/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 14, 2014

QA results for PR 1950:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class TimeTracker extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18573/consoleFull

val shift = 1 + numFeatures * nodeIndex
if (!sampleValid) {
// Mark one bin as -1 is sufficient.
arr(shift) = InvalidBinIndex
} else {
var featureIndex = 0
// TODO: Vectorize this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO can be removed.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA results for PR 1950:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class TimeTracker extends Serializable {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18577/consoleFull

val numFeatureCategories = strategy.categoricalFeaturesInfo(featureIndex)
val isSpaceSufficientForAllCategoricalSplits =
numBins > math.pow(2, numFeatureCategories.toInt - 1) - 1
val isUnorderedFeature =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can bundle this calculation as well in the your future commit. Let's create a map for feature id to features types which can be re-used by all internal methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I make a private metadata class for storing that and pass it around internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@manishamde
Copy link
Contributor

👍

LGTM! I have some minor comments. I look forward to the performance improvements based upon these changes. The multiclass bug fix is a great catch. Finally, I see the timer functionality being added soon to other parts of Spark. :-)

logDebug("numBins = " + numBins)

timer.start("init")
val treeInput = TreePoint.convertToTreeRDD(retaggedInput, strategy, bins).cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.cache() -> .persist(StorageLevel.MEMORY_AND_DISK)? There is a computation/storage trade-off here, maybe worth testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try testing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that can help a lot. I've only tested on my laptop, but there, spark-perf tests using 500K examples and 500 features with tree depth 5 took 292 sec using cache() and 112 sec using mem + disk. I'll make this update.
On a smaller example (100K examples, 500 features, depth 5), the difference was 80 sec vs. 21 sec.

val numFeatures = labeledPoint.features.size
val numBins = bins(0).size
val arr = new Array[Int](numFeatures)
var featureIndex = 0 // offset by 1 for label
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does offset by 1 happen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was an old comment; I removed it, thanks!

@jkbradley
Copy link
Member Author

@chouqin @manishamde @mengxr Thank you for the comments! I'll make those fixes and get the other PRs done ASAP.

…disk, not just memory.

Details:

DecisionTree
* Changed: .cache() -> .persist(StorageLevel.MEMORY_AND_DISK)
** This gave major performance improvements on small tests.  E.g., 500K examples, 500 features, depth 5, on MacBook, took 292 sec with cache() and 112 when using disk as well.
* Change for to while loops
* Small cleanups

TimeTracker
* Removed useless timing in DecisionTree

TreePoint
* Renamed features to binnedFeatures
@jkbradley
Copy link
Member Author

@chouqin @manishamde @mengxr Hopefully is good now. Thanks again! By the way, I tested this on spark-perf on small tests and got small speedups (2x), but that was before the persist() change. Am running larger tests now. Also, I compared learning with scikit-learn on a few real datasets to make sure the learning algs behave similarly; they get similar accuracy and tree structures.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1950 at commit 6b5651e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have finished for PR 1950 at commit 6b5651e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • exec "$FWDIR"/bin/spark-submit --class $CLASS "$
    • trait TaskCompletionListener extends EventListener
    • class AvroWrapperToJavaConverter extends Converter[Any, Any]
    • exec "$FWDIR"/bin/spark-submit --class $CLASS "$

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have started for PR 1950 at commit 5f2dec2.

  • This patch merges cleanly.

@manishamde
Copy link
Contributor

@jkbradley Sounds good. 2x speedups are non-trivial. :-) Looking forward to large-scale test results. Also, thanks for verifying with scikit-learn.

@SparkQA
Copy link

SparkQA commented Aug 15, 2014

QA tests have finished for PR 1950 at commit 5f2dec2.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 15, 2014

LGTM. I'm merging this into master and branch-1.1. Thanks @chouqin and @manishamde for reviewing the code!

@asfgit asfgit closed this in c703229 Aug 15, 2014
asfgit pushed a commit that referenced this pull request Aug 15, 2014
…dered feature bug fix

DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #1950 from jkbradley/dt-opt1 and squashes the following commits:

5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals

(cherry picked from commit c703229)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@jkbradley jkbradley deleted the dt-opt1 branch August 26, 2014 17:41
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…dered feature bug fix

DecisionTree improvements:
(1) TreePoint representation to avoid binning multiple times
(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features
(3) Timing for DecisionTree internals

Details:

(1) TreePoint representation to avoid binning multiple times

[https://issues.apache.org/jira/browse/SPARK-3022]

Added private[tree] TreePoint class for representing binned feature values.

The input RDD of LabeledPoint is converted to the TreePoint representation initially and then cached.  This avoids the previous problem of re-computing bins multiple times.

(2) Bug fix: isSampleValid indexed bins incorrectly for unordered categorical features

[https://issues.apache.org/jira/browse/SPARK-3041]

isSampleValid used to treat unordered categorical features incorrectly: It treated the bins as if indexed by featured values, rather than by subsets of values/categories.
* exhibited for unordered features (multi-class classification with categorical features of low arity)
* Fix: Index bins correctly for unordered categorical features.

(3) Timing for DecisionTree internals

Added tree/impl/TimeTracker.scala class which is private[tree] for now, for timing key parts of DT code.
Prints timing info via logDebug.

CC: mengxr manishamde chouqin  Very similar update, with one bug fix.  Many apologies for the conflicting update, but I hope that a few more optimizations I have on the way (which depend on this update) will prove valuable to you: SPARK-3042 and SPARK-3043

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes apache#1950 from jkbradley/dt-opt1 and squashes the following commits:

5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
d036089 [Joseph K. Bradley] Print timing info to logDebug.
e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants