-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding colless_index method and tests #2266
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2266 +/- ##
=======================================
Coverage 93.29% 93.29%
=======================================
Files 27 27
Lines 26073 26089 +16
Branches 1167 1172 +5
=======================================
+ Hits 24325 24341 +16
Misses 1718 1718
Partials 30 30
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I found a good resource on balance metrics here (it's really thorough, and there's lots of metrics). According to them, this metric is only defined for binary trees with a single root, but I think the simplest thing to do is raise a ValueError if we encounter a tree that isn't one of these. How about this? def colless_index(self):
if self.num_roots != 1:
raise ValueError("Colless index not defined for multiroot trees")
num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
total = 0
for u in self.nodes(order="postorder"):
num_children = 0
for v in self.children(u):
num_leaves[u] += num_leaves[v]
num_children += 1
if num_children == 0:
num_leaves[u] = 1
elif num_children != 2:
raise ValueError("Colless index not defined for nonbinary trees")
else:
total += abs(
num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
)
return total and for a definition we have def colless_index_definition(tree):
is_binary = all(
tree.num_children(u) == 2 for u in tree.nodes() if tree.is_internal(u)
)
if tree.num_roots != 1 or not is_binary:
raise ValueError("Colless index only defined for binary trees with one root")
# We have to use len(list(tree.leaves(u))) here because tree.num_leaves() actually
# returns the number of *samples*.
return sum(
abs(
len(list(tree.leaves(tree.left_child(u))))
- len(list(tree.leaves(tree.right_child(u))))
)
for u in tree.nodes()
if tree.is_internal(u)
) It gets a bit ugly here for a few reasons:
I do think it's better to just raise an error here if there's any doubt about how the metric is defined when we have polytomies and multiple roots. |
Thanks @jeromekelleher , very interesting resource! I agree to raise an error when the tree is not binary following this resource, unlike Shao and Sokal (1990) who defined Colless index in that case by using only the binary nodes for the computation. Because in the extreme case where many nodes are polytomies, the index ends up being not representative of the imbalance at all with the Shao and Sokal definition, as only few of its nodes are used for the computation. I have two questions for the moment on your code:
|
Yes, I guess that'll work. We still have to do a full traversal to get those numbers though, so it's not any more efficient.
Basically, yes. Think of it as efficiency of human understanding vs efficiency of computer implementation. Humans and computers think differently, so it's very helpful to write things down both ways. In particular, deciding what the right behaviour in corner cases should be is really helped by having the simplest mathematical definition that you can (as, ideally, the corner case behaviour falls naturally out of that). This happened in the path_length computation in #2249 where I was wondering what the semantics should be for the virtual root - by having a nice mathematical definition for path_length, this was really easy. |
Let me know if/when you want a review here, otherwise I assume you both have it covered. |
@jeromekelleher I compared those two functions in term of time of execution ( import time
import tskit
# First method
def colless_index(self):
if self.num_roots != 1:
raise ValueError("Colless index not defined for multiroot trees")
num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
total = 0
for u in self.nodes(order="postorder"):
num_children = 0
for v in self.children(u):
num_leaves[u] += num_leaves[v]
num_children += 1
if num_children == 0:
num_leaves[u] = 1
elif num_children != 2:
raise ValueError("Colless index not defined for nonbinary trees")
else:
total += abs(
num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
)
return total
# Second method
def colless_index2(tree):
if tree.num_roots != 1:
raise ValueError("Colless index not defined for multiroot trees")
total = 0
for u in tree.nodes():
if tree.is_internal(u):
if tree.num_children(u) == 2 :
total+= abs(tree.num_samples(tree.right_child(u))-tree.num_samples(tree.left_child(u)))
else:
raise ValueError("Colless index not defined for nonbinary trees")
return(total)
# Big random tree
random = tskit.Tree.generate_random_binary(100000)
start = time.perf_counter()
print(colless_index(random))
stop = time.perf_counter()
print(f"colless_index = {stop-start:.2g}")
start = time.perf_counter()
print(colless_index2(random))
stop = time.perf_counter()
print(f"colless_index2 = {stop-start:.2g}") The answer was:
I tried this code multiple times and the answer were about the same. So apparently both methods give the same results and So I guess |
fbf92e6
to
69815f9
Compare
The relative performance here isn't so important @jeremyguez because both are written in Python, and the Python overhead will dominate for most trees. The main issue though is that we can't use The speed of the "definition" version is pretty unimportant though, unless it's catestrophically slow it won't matter for the trees we're using the in tests. The import numba
@numba.njit
def _colless_index(postorder, left_child, right_sib):
num_leaves = np.zeros_like(left_child)
total = 0
for u in postorder:
v = left_child[u]
while v != -1:
num_leaves[u] += num_leaves[v]
v = right_sib[v]
v = left_child[u]
if v == -1:
num_leaves[u] = 1
else:
# NB assuming tree is binary! We'd probably check this before
# invoking the method.
total += abs(num_leaves[right_sib[v]] - num_leaves[v])
return total
def colless_index_numba(tree):
if tree.num_roots != 1:
raise ValueError("Colless index not defined for multiroot trees")
# TODO check tree.is_binary, somehow
return _colless_index(tree.postorder(), tree.left_child_array, tree.right_child_array) (This is just out of interest by the way - we don't want to use numba in tskit as it's a heavy dependency. It is an awesome technology though) |
@jeromekelleher |
69815f9
to
b360f2f
Compare
Thanks @jeremyguez - all we can do for the TestDefinion is check if the trees are binary first, and then either assert they get the same value or raise a ValueError. |
b360f2f
to
99d246a
Compare
I pushed a version doing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few final tweaks and good to go
python/tskit/trees.py
Outdated
Returns the Colless imbalance index for this tree. | ||
This is defined as the sum of all differences between number of | ||
leaves under right sub-node and left sub-node for each binary node. | ||
Non-binary nodes are not taken in account, thus a star tree has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring is a bit out of date here. Should say something like "The Colless index is undefined for non-binary trees and trees with multiple roots. This method will raise a ValueError if the tree is not singly-rooted and binary."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I changed it in the last commit.
I hope the English definition is clear enough: "This is defined as the sum of all differences between number of leaves under right sub-node and left sub-node for each node."
99d246a
to
099100f
Compare
Great stuff, thanks @jeremyguez! ps. We have a Slack workspace tskit-dev and it would be great if you could join up. Can you send me an email please if you'd like to join? jerome.kelleher@bdi.ox.ac.uk |
Here is a first try at Colless index. @jeromekelleher