[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

jkbradley · 2014-09-10T04:07:14Z

This PR includes some code simplifications and re-organization which will be helpful for implementing random forests. The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.

Also added 2 bug fixes:

maxMemoryUsage calculation
over-allocation of space for bins in DTStatsAggregator for unordered features.

Relation to RFs:

Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
The return code doneTraining may be generalized to include cases such as nodes ready for local training.

Details:

No longer pre-allocate parentImpurities array in main train() method.

parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
These were not really needed. They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.

No longer using Node.build since tree structure is constructed on-the-fly.

Did not eliminate since it is public (Developer) API. Marked as deprecated.

Eliminated pre-allocated nodes array in main train() method.

Nodes are constructed and added to the tree structure as needed during training.
Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes. Only one element of that array is needed at a time, so we do not the array.

findBestSplits() now returns 2 items:

rootNode (newly created root node on first iteration, same root node on later iterations)
doneTraining (indicating if all nodes at that level were leafs)

Updated DecisionTreeSuite. Notes:

Improved test "Second level node building with vs. without groups"
** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
Related update: Added Node.deepCopy (private[tree]), used for test suite

CC: @mengxr

No longer pre-allocate parentImpurities array in main train() method. * parentImpurities values are now stored in individual nodes (in Node.stats.impurity). No longer using Node.build since tree structure is constructed on-the-fly. * Did not eliminate since it is public (Developer) API. Also: Updated DecisionTreeSuite test "Second level node building with vs. without groups" * generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.

* Nodes are constructed and added to the tree structure as needed during training. Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes. Only one element of that array is needed at a time, so we do not the array. findBestSplits() now returns 2 items: * rootNode (newly created root node on first iteration, same root node on later iterations) * doneTraining (indicating if all nodes at that level were leafs) Also: * Added Node.deepCopy (private[tree]), used for test suite * Updated test suite (same functionality)

jkbradley · 2014-09-10T04:07:53Z

I will wait until [https://github.com//pull/2332] is merged, and then will update this with the merge.

SparkQA · 2014-09-10T04:51:58Z

QA tests have started for PR 2341 at commit 306120f.

This patch merges cleanly.

SparkQA · 2014-09-10T05:44:52Z

QA tests have finished for PR 2341 at commit 306120f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-09-10T15:20:59Z

Failure is unrelated to this PR.

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala

SparkQA · 2014-09-11T02:11:30Z

QA tests have started for PR 2341 at commit 306120f.

This patch merges cleanly.

SparkQA · 2014-09-11T02:52:44Z

QA tests have started for PR 2341 at commit 5c4ac33.

This patch merges cleanly.

SparkQA · 2014-09-11T03:12:50Z

QA tests have finished for PR 2341 at commit 306120f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-11T04:02:45Z

QA tests have finished for PR 2341 at commit 5c4ac33.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

chouqin · 2014-09-11T04:03:06Z

@jkbradley Thanks for your nice work, I have read your code and just have one question:

Can we allocate a root node before the loop in train(), and allocate left and child node for the next level after choosing split, then DecisionTree.findBestSplits can just return doneTraining. During the choose splits step which iterate all the nodes, it just set fields of current node, and don't allocate a new node. This seems to be more easier to understand for me, because it handle all levels in a same way.

What's more, I think after choosing a best split and allocate left and right child nodes, we can set impurity of left and right child, which can avoid recomputation of impurity in calculateGainForSplit, this saving may be useless when (number of splits * number of features * number of classes) is small though.

This is just a suggestion, ignore this if you don't think it helps :).

jkbradley · 2014-09-11T05:45:39Z

@chouqin Thanks for looking at the PR! I wanted to allocate a root node beforehand, but the problem is that the member data in Node is not all mutable. Let me know, though, if you see a way around it.

Caching the impurity sounds good; I'll try to incorporate that.

chouqin · 2014-09-11T06:14:23Z

Can we change the fields from val to var? leftNode and rightNode are vars, I wonder if we can change other fields too?

jkbradley · 2014-09-11T16:11:18Z

I hesitate to change a public API, but I agree it makes more sense. I'll make that change since it's just a Developer API.

jkbradley · 2014-09-11T16:22:08Z

Actually, trying to treat all levels equally sounds like it might fit well with this JIRA [https://issues.apache.org/jira/browse/SPARK-3158], so I might delay until then. It also might make sense to cache the impurity in the nodes allocated for the next level. I will update that JIRA with these to-do items and postpone these updates. Currently, I would like to prioritize random forests [https://issues.apache.org/jira/browse/SPARK-1545], and later on follow up with these optimizations. Does that sound reasonable?

chouqin · 2014-09-12T00:15:10Z

Sounds reasonable to me, go ahead with random forests first please.

…o fixed bug with over-allocating space in DTStatsAggregator for unordered features.

jkbradley · 2014-09-12T01:36:31Z

I just pushed 2 small (but important) bug fixes onto this PR.

SparkQA · 2014-09-12T01:38:20Z

QA tests have started for PR 2341 at commit 07dd1ee.

This patch merges cleanly.

SparkQA · 2014-09-12T02:38:10Z

QA tests have finished for PR 2341 at commit 07dd1ee.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class CreateTableAsSelect(
- case class CreateTableAsSelect(

mengxr · 2014-09-12T06:52:00Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala


    // Calculate level for single group construction

    // Max memory usage for aggregates
-    val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024
+    val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L


It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC problems on the driver.

I'll go for 10GB since 1GB memory is not that large.

mengxr · 2014-09-12T07:26:03Z

mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala

-      } else {
-        level += 1
+      if (doneTraining) {
+        break = true


Shall we remove break and only use doneTraining?

There still needs to be a temp value since I can't write:
var topNode
var doneTraining
(topNode, doneTraining) = findBestSplits(...)
I believe the LHS of the above line needs to be newly declared vals. Is there a way around that?

mengxr · 2014-09-12T08:37:41Z

LGTM except minor inline comments. I'm merging this in and could you make the changes with your next update? Thanks!

jkbradley added 6 commits September 9, 2014 10:42

Merge remote-tracking branch 'upstream/master' into dt-spark-3160

d4dbb99

Marked Node.build as deprecated

d4d7864

Added topNode doc in DecisionTree and scalastyle fix

eaa1dcf

Fixed typo in DecisionTreeModel.scala doc

306120f

jkbradley added 2 commits September 10, 2014 18:18

Added check in Strategy to make sure minInstancesPerNode >= 1

5c4ac33

jkbradley added 2 commits September 11, 2014 17:40

Merge remote-tracking branch 'upstream/master' into dt-spark-3160

debe072

Fixed overflow bug with computing maxMemoryUsage in DecisionTree. Als…

07dd1ee

…o fixed bug with over-allocating space in DTStatsAggregator for unordered features.

jkbradley changed the title ~~[SPARK-3160] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays~~ [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays Sep 12, 2014

jkbradley changed the title ~~[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays~~ [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. Sep 12, 2014

mengxr reviewed Sep 12, 2014
View reviewed changes

asfgit closed this in b8634df Sep 12, 2014

jkbradley deleted the dt-spark-3160 branch October 8, 2014 21:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

jkbradley commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

chouqin commented Sep 11, 2014

jkbradley commented Sep 11, 2014

chouqin commented Sep 11, 2014

jkbradley commented Sep 11, 2014

jkbradley commented Sep 11, 2014

chouqin commented Sep 12, 2014

jkbradley commented Sep 12, 2014

SparkQA commented Sep 12, 2014

SparkQA commented Sep 12, 2014

mengxr Sep 12, 2014

jkbradley Sep 13, 2014

mengxr Sep 12, 2014

jkbradley Sep 13, 2014

mengxr commented Sep 12, 2014

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

Conversation

jkbradley commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 10, 2014

SparkQA commented Sep 10, 2014

jkbradley commented Sep 10, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

SparkQA commented Sep 11, 2014

chouqin commented Sep 11, 2014

jkbradley commented Sep 11, 2014

chouqin commented Sep 11, 2014

jkbradley commented Sep 11, 2014

jkbradley commented Sep 11, 2014

chouqin commented Sep 12, 2014

jkbradley commented Sep 12, 2014

SparkQA commented Sep 12, 2014

SparkQA commented Sep 12, 2014

mengxr Sep 12, 2014

Choose a reason for hiding this comment

jkbradley Sep 13, 2014

Choose a reason for hiding this comment

mengxr Sep 12, 2014

Choose a reason for hiding this comment

jkbradley Sep 13, 2014

Choose a reason for hiding this comment

mengxr commented Sep 12, 2014