-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3160] [SPARK-3494] [mllib] DecisionTree: eliminate pre-allocated nodes, parentImpurities arrays. Memory calc bug fix. #2341
Conversation
No longer pre-allocate parentImpurities array in main train() method. * parentImpurities values are now stored in individual nodes (in Node.stats.impurity). No longer using Node.build since tree structure is constructed on-the-fly. * Did not eliminate since it is public (Developer) API. Also: Updated DecisionTreeSuite test "Second level node building with vs. without groups" * generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
* Nodes are constructed and added to the tree structure as needed during training. Moved tree construction from main train() method into findBestSplitsPerGroup() since there is no need to keep the (split, gain) array for an entire level of nodes. Only one element of that array is needed at a time, so we do not the array. findBestSplits() now returns 2 items: * rootNode (newly created root node on first iteration, same root node on later iterations) * doneTraining (indicating if all nodes at that level were leafs) Also: * Added Node.deepCopy (private[tree]), used for test suite * Updated test suite (same functionality)
I will wait until [https://github.com//pull/2332] is merged, and then will update this with the merge. |
QA tests have started for PR 2341 at commit
|
QA tests have finished for PR 2341 at commit
|
Failure is unrelated to this PR. |
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
QA tests have started for PR 2341 at commit
|
QA tests have started for PR 2341 at commit
|
QA tests have finished for PR 2341 at commit
|
QA tests have finished for PR 2341 at commit
|
@jkbradley Thanks for your nice work, I have read your code and just have one question: Can we allocate a root node before the loop in What's more, I think after choosing a best split and allocate left and right child nodes, we can set impurity of left and right child, which can avoid recomputation of impurity in This is just a suggestion, ignore this if you don't think it helps :). |
@chouqin Thanks for looking at the PR! I wanted to allocate a root node beforehand, but the problem is that the member data in Node is not all mutable. Let me know, though, if you see a way around it. Caching the impurity sounds good; I'll try to incorporate that. |
Can we change the fields from |
I hesitate to change a public API, but I agree it makes more sense. I'll make that change since it's just a Developer API. |
Actually, trying to treat all levels equally sounds like it might fit well with this JIRA [https://issues.apache.org/jira/browse/SPARK-3158], so I might delay until then. It also might make sense to cache the impurity in the nodes allocated for the next level. I will update that JIRA with these to-do items and postpone these updates. Currently, I would like to prioritize random forests [https://issues.apache.org/jira/browse/SPARK-1545], and later on follow up with these optimizations. Does that sound reasonable? |
Sounds reasonable to me, go ahead with random forests first please. |
…o fixed bug with over-allocating space in DTStatsAggregator for unordered features.
I just pushed 2 small (but important) bug fixes onto this PR. |
QA tests have started for PR 2341 at commit
|
QA tests have finished for PR 2341 at commit
|
|
||
// Calculate level for single group construction | ||
|
||
// Max memory usage for aggregates | ||
val maxMemoryUsage = strategy.maxMemoryInMB * 1024 * 1024 | ||
val maxMemoryUsage = strategy.maxMemoryInMB * 1024L * 1024L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also useful to set an upper bound here (e.g., 1GB) to avoid memory/GC problems on the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll go for 10GB since 1GB memory is not that large.
} else { | ||
level += 1 | ||
if (doneTraining) { | ||
break = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove break
and only use doneTraining
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There still needs to be a temp value since I can't write:
var topNode
var doneTraining
(topNode, doneTraining) = findBestSplits(...)
I believe the LHS of the above line needs to be newly declared vals. Is there a way around that?
LGTM except minor inline comments. I'm merging this in and could you make the changes with your next update? Thanks! |
This PR includes some code simplifications and re-organization which will be helpful for implementing random forests. The main changes are that the nodes and parentImpurities arrays are no longer pre-allocated in the main train() method.
Also added 2 bug fixes:
Relation to RFs:
Details:
No longer pre-allocate parentImpurities array in main train() method.
No longer using Node.build since tree structure is constructed on-the-fly.
Eliminated pre-allocated nodes array in main train() method.
findBestSplits() now returns 2 items:
Updated DecisionTreeSuite. Notes:
** generateOrderedLabeledPoints() modified so that it really does require 2 levels of internal nodes.
CC: @mengxr