Topology indices (balance and imbalance) #2245
Replies: 10 comments 9 replies
-
Thanks for this @jeremyguez, it's very interesting. At first glance it's not obvious to me how we'd turn these into efficient incremental algorithms as (some of them at least) propagate information down the tree rather than up. We could certainly implement them efficiently as Tree methods though. For example, we could implement the Sackin index something like this: def Sackin_by_traversal(tree):
stack = [(tree.virtual_root, -1)]
total_depth = 0
while len(stack) > 0:
u, depth = stack.pop()
if tree.is_leaf(u):
total_depth += depth
else:
for v in tree.children(u):
stack.append((v, depth + 1))
return total_depth This would be straightforward to implement in C, and has a straightforward O(n) cost. |
Beta Was this translation helpful? Give feedback.
-
I think def B2_by_traversal(tree, base):
# Note that this will take into account the number of roots also, by considering
# them as children of the virtual root.
stack = [(tree.virtual_root, 1)]
total_proba = 0
while len(stack) > 0:
u, path_product = stack.pop()
if tree.is_leaf(u):
total_proba -= path_product * math.log(path_product, base)
else:
path_product *= 1 / tree.num_children(u)
for v in tree.children(u):
stack.append((v, path_product))
return total_proba |
Beta Was this translation helpful? Give feedback.
-
Thinking some more about this, I think we should implement these as tree methods, and not worry about doing incremental versions. The single-tree version will still be useful. |
Beta Was this translation helpful? Give feedback.
-
I've made a first pass at the framework for implementing and testing these metrics in #2246 @jeremyguez. Would you be interested in helping out with doing part of the implementation? Getting the definition in and going through the examples for the tests would be a great start. If you'd like to get a bit of experience with the C side of things I'd be happy to provide pointers there too. This will make the implementations many times faster. |
Beta Was this translation helpful? Give feedback.
-
Thanks @jeromekelleher for your answers and nice implementations of some of the metrics. I would gladly try to help out with this. I'll start with the definitions and tests for the four indices, then I'd be interested in experiencing the C side. By the way, I was wondering if there already is a Tree method for the path length function I defined in the B1 index section (I didn't find one there). If not, would it be useful to have one? It could be generalized for computing the number of edges between any two nodes. |
Beta Was this translation helpful? Give feedback.
-
OK @jeremyguez - I've opened a batch of issues to track the specifics here (#2249, #2250, #2251, #2252, #2253, #2254, #2255, #2256). It would be great if we could tackle these one-by-one in PRs, once #2246 is merged. |
Beta Was this translation helpful? Give feedback.
-
Great to see tskit is getting tree (im)balance indices! Just wanted to point out a fairly recent paper by Lemant et al.. Here they've come up with (quoting the title) robust and universal tree balance indices. The most interesting thing being that these are comparable between trees with different numbers of leaves. They even show the relationship with Colless' and Sackin's index. But of course, as always with these types of things. The old ones are the ones people are actually using. Cheers, |
Beta Was this translation helpful? Give feedback.
-
Very interesting, thanks @GertjanBisschop. I also found this website which gives information on the full zoo of metrics (from the preprint Tree balance indices: A comprehensive survey) If anyone wants to help out with adding some more indices then great, most seem pretty easy to implement. I guess the only think we need to look out for at this point is naming, if anyone wants to do a bit of looking ahead to make some suggestions about what our full name space for balance indices would be, that would be helpful. |
Beta Was this translation helpful? Give feedback.
-
I start here with a beginning of a name space. Many indices remain to be added, I will do some edits later for adding indices or correcting depending on your comments. I also added general topology methods and methods related to polytomies, since these are related to (im)balance indices. General methodsI add here a few general methods for tree topology analysis. I don't know if they are all worth adding, but just putting some ideas.
(Im)balance indicesBalance indices
Imbalance indicesSackin index and its derivativesTwo possibilities here: having one method
Fusco index and its derivatives
Colless index and its derivatives
Other imbalance indices
Other topology indicesPolytomies
|
Beta Was this translation helpful? Give feedback.
-
Out of curiosity, I implemented the Sackin index above using numba to see how the performance compares to the C implementation in #2258: import time
import tskit
import msprime
import numba
@numba.njit()
def _sackin_index(virtual_root, left_child, right_sib):
stack = []
root = left_child[virtual_root]
while root != -1:
stack.append((root, 0))
root = right_sib[root]
total_depth = 0
while len(stack) > 0:
u, depth = stack.pop()
v = left_child[u]
if v == -1:
total_depth += depth
else:
depth += 1
while v != -1:
stack.append((v, depth))
v = right_sib[v]
return total_depth
def sackin_index(tree):
return _sackin_index(tree.virtual_root, tree.left_child_array, tree.right_sib_array)
ts = msprime.sim_ancestry(10, random_seed=2)
tree = ts.first()
# Warmup jit and test
assert tree.sackin_index() == sackin_index(tree)
ts = tskit.load("/home/jk/work/covid-tsinfer-experiment/covid-norecomb-small-indels.trees")
print(ts)
tree = ts.first()
before = time.perf_counter()
s1 = tree.sackin_index()
d1 = time.perf_counter() - before
before = time.perf_counter()
s2 = sackin_index(tree)
d2 = time.perf_counter() - before
print(f"C = {d1:.2g}")
print(f"numba = {d2:.2g}") We get
So, on a large (130K nodes) tree we see performance is about 3X worse using numba than the C code (with a fraction of the code/boilerplate). With this in mind, I wonder if what we should aim for here is to implement a few of the key metrics in C for tskit, and perhaps see the full zoo of indices as something appropriate for the future "phylokit" package which implements phlyogenetics stuff using tskit as the underlying data structure and accelerates using numba. |
Beta Was this translation helpful? Give feedback.
-
Hello all,
During Probgen 2022 in Oxford, I talked with @benjeffery and @jeromekelleher about the idea of adding imbalance indices to tskit. I share here a few indices that I coded during my PhD using tskit. This code can probably be optimized (e.g. by taking advantage of the tree sequence structure), but @benjeffery advised to start with the simplest implementation.
Many topology indices could be interesting to add, but for the moment I show here only two common balance indices and two imbalance indices. The four implemented indices here are presented in details in Shao and Sokal (1990).
B1 and B2 indices are balance indices, meaning the higher their value, the more balanced the tree. Colless and Sackin indices are imbalance indices (the higher the value, the more imbalanced the tree).
I would take gladly any correction or advice for improvement. (About taking advantage of the tree sequence structure, I am not sure it will help in all cases, as often a change in one node is changing the computation for all nodes above it to the root. This might be a general question about the best way to compute tree topology indices in tree sequences.)
1. B1 index
This index is the sum of the values 1/m, with m computed for all n nodes (excluding the root) by taking the maximum path length between leaves under n and the root. For the exact formula see Shao and Sokal (1990).
2. B2 index
This index is computed by associating to each leaf a probability p of reaching it assuming a random walk from the root (we start from the root and choose randomly a path at each node). The index value is the Shannon entropy of all p probabilities. For the exact formula see Shao and Sokal (1990).
3. Sackin index
This index is computed by counting for each leaf the number of nodes to reach the root and summing up all values. For the exact formula see Shao and Sokal (1990).
4. Colless index
For each node, we compute the absolute value of the difference between the number of leaves under both sub-nodes (for a binary tree they are two). The index is the sum of all values. For a non-binary tree, a correction should be done (See Shao and Sokal (1990)). For the moment, the correction used is the difference between the maximum and minimum number of leaves under all sub-nodes.
Beta Was this translation helpful? Give feedback.
All reactions