Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341

Closed
wants to merge 10 commits into from

Conversation

jkbradley
Copy link
Member

This PR includes some code simplifications and re-organization which will be helpful for implementing random forests. The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.

Also added 2 bug fixes:

  • maxMemoryUsage calculation
  • over-allocation of space for bins in DTStatsAggregator for unordered features.

Relation to RFs:

  • Since RFs will be deeper and will therefore be more likely sparse (not full trees), it could be a cost savings to avoid pre-allocating a full tree.
  • The associated re-organization also reduces bookkeeping, which will make RFs easier to implement.
  • The return code doneTraining may be generalized to include cases such as nodes ready for local training.

Details:

No longer pre-allocate parentImpurities array in main train() method.

  • parentImpurities values are now stored in individual nodes (in Node.stats.impurity).
  • These were not really needed. They were used in calculateGainForSplit(), but they can be calculated anyways using parentNodeAgg.

No longer using Node.build since tree structure is constructed on-the-fly.

  • Did not eliminate since it is public (Developer) API. Marked as deprecated.

Eliminated pre-allocated nodes array in main train() method.

  • Nodes are constructed and added to the tree structure as needed during training.
  • Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes. Only one element of that array is needed at a time, so we do not the array.

findBestSplits() now returns 2 items:

  • rootNode (newly created root node on first iteration, same root node on later iterations)
  • doneTraining (indicating if all nodes at that level were leafs)

Updated DecisionTreeSuite. Notes:

  • Improved test "Second level node building with vs. without groups"
    ** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
  • Related update: Added Node.deepCopy (private[tree]), used for test suite

CC: @mengxr

No longer pre-allocate parentImpurities array in main train() method.
* parentImpurities values are now stored in individual nodes (in Node.stats.impurity).

No longer using Node.build since tree structure is constructed on-the-fly.
* Did not eliminate since it is public (Developer) API.

Also: Updated DecisionTreeSuite test "Second level node building with vs. without groups"
* generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
* Nodes are constructed and added to the tree structure as needed during training.

Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes.  Only one element of that array is needed at a time, so we do not the array.

findBestSplits() now returns 2 items:
* rootNode (newly created root node on first iteration, same root node on later iterations)
* doneTraining (indicating if all nodes at that level were leafs)

Also:
* Added Node.deepCopy (private[tree]), used for test suite
* Updated test suite (same functionality)
@jkbradley
Copy link
Member Author

I will wait until [https://github.com//pull/2332] is merged, and then will update this with the merge.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have started for PR 2341 at commit 306120f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 10, 2014

QA tests have finished for PR 2341 at commit 306120f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member Author

Failure is unrelated to this PR.

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
	mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala
	mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2341 at commit 306120f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2341 at commit 5c4ac33.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2341 at commit 306120f.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2341 at commit 5c4ac33.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@chouqin
Copy link
Contributor

chouqin commented Sep 11, 2014

@jkbradley Thanks for your nice work, I have read your code and just have one question:

Can we allocate a root node before the loop in train(), and allocate left and child node for the next level after choosing split, then DecisionTree.findBestSplits can just return doneTraining. During the choose splits step which iterate all the nodes, it just set fields of current node, and don't allocate a new node. This seems to be more easier to understand for me, because it handle all levels in a same way.

What's more, I think after choosing a best split and allocate left and right child nodes, we can set impurity of left and right child, which can avoid recomputation of impurity in calculateGainForSplit, this saving may be useless when (number of splits * number of features * number of classes) is small though.

This is just a suggestion, ignore this if you don't think it helps :).

@jkbradley
Copy link
Member Author

@chouqin Thanks for looking at the PR! I wanted to allocate a root node beforehand, but the problem is that the member data in Node is not all mutable. Let me know, though, if you see a way around it.

Caching the impurity sounds good; I'll try to incorporate that.

@chouqin
Copy link
Contributor

chouqin commented Sep 11, 2014

Can we change the fields from val to var? leftNode and rightNode are vars, I wonder if we can change other fields too?

@jkbradley
Copy link
Member Author

I hesitate to change a public API, but I agree it makes more sense. I'll make that change since it's just a Developer API.

@jkbradley
Copy link
Member Author

Actually, trying to treat all levels equally sounds like it might fit well with this JIRA [https://issues.apache.org/jira/browse/SPARK-3158], so I might delay until then. It also might make sense to cache the impurity in the nodes allocated for the next level. I will update that JIRA with these to-do items and postpone these updates. Currently, I would like to prioritize random forests [https://issues.apache.org/jira/browse/SPARK-1545], and later on follow up with these optimizations. Does that sound reasonable?

@chouqin
Copy link
Contributor

chouqin commented Sep 12, 2014

Sounds reasonable to me, go ahead with random forests first please.

@jkbradley jkbradley changed the title [SPARK-3160] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays Sep 12, 2014
@jkbradley jkbradley changed the title [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays [SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. Sep 12, 2014
@jkbradley
Copy link
Member Author

I just pushed 2 small (but important) bug fixes onto this PR.

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have started for PR 2341 at commit 07dd1ee.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have finished for PR 2341 at commit 07dd1ee.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class CreateTableAsSelect(
    • case class CreateTableAsSelect(


// Calculate level for single group construction

// Max memory usage for aggregates
val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024
val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC problems on the driver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll go for 10GB since 1GB memory is not that large.

} else {
level += 1
if (doneTraining) {
break = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove break and only use doneTraining?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still needs to be a temp value since I can't write:
var topNode
var doneTraining
(topNode, doneTraining) = findBestSplits(...)
I believe the LHS of the above line needs to be newly declared vals. Is there a way around that?

@mengxr
Copy link
Contributor

mengxr commented Sep 12, 2014

LGTM except minor inline comments. I'm merging this in and could you make the changes with your next update? Thanks!

@asfgit asfgit closed this in b8634df Sep 12, 2014
@jkbradley jkbradley deleted the dt-spark-3160 branch October 8, 2014 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants