Topology indices (balance and imbalance) #2245

jeremyguez · 2022-05-09T15:57:50Z

jeremyguez
May 9, 2022

Hello all,

During Probgen 2022 in Oxford, I talked with @benjeffery and @jeromekelleher about the idea of adding imbalance indices to tskit. I share here a few indices that I coded during my PhD using tskit. This code can probably be optimized (e.g. by taking advantage of the tree sequence structure), but @benjeffery advised to start with the simplest implementation.

Many topology indices could be interesting to add, but for the moment I show here only two common balance indices and two imbalance indices. The four implemented indices here are presented in details in Shao and Sokal (1990).

B1 and B2 indices are balance indices, meaning the higher their value, the more balanced the tree. Colless and Sackin indices are imbalance indices (the higher the value, the more imbalanced the tree).

I would take gladly any correction or advice for improvement. (About taking advantage of the tree sequence structure, I am not sure it will help in all cases, as often a change in one node is changing the computation for all nodes above it to the root. This might be a general question about the best way to compute tree topology indices in tree sequences.)

1. B1 index

This index is the sum of the values 1/m, with m computed for all n nodes (excluding the root) by taking the maximum path length between leaves under n and the root. For the exact formula see Shao and Sokal (1990).

import numpy as np
import tskit

def path_length (tree, ancestry, child):

"""
Computes number of edges between child and ancestry.
I did not find this function in the Tree API so I did this one.
There probably is a better way to do this.

:param tskit.Tree tree: the tree in which the computation is done.
:param int ancestry: the ancestry node to which number of edges is computed.
:param int child: the child node from which number of edges is computed.
:return: the number of edges between child and ancestry.
:rtype: int

"""
    
    counter = 0
    
    parent = child
    
    while parent != ancestry:
        parent = tree.parent(child)
        child = parent
        counter += 1
        
    return(counter)




def B1_index(tree_seq):
	
"""
Computes the balance index B1 (see Shao and Sokal 1990)

:param tskit.TreeSequence tree_seq: the tree sequence in which to compute the B1 index.
:return: B1 value for each tree in the tree sequence.
:rtype: list of float

"""	
	b1_list = []
	
	
	for tree in tree_seq.trees():
		
		max_list = []
	
		for n in tree.nodes():
		
			if n != tree.root and tree.is_leaf(n) == False:

				depth_list =  [] # will contain the path length from all leaves under n to the root
				for leaf in tree.leaves(n):
					depth_list.append(path_length(tree, n, leaf))

				max_list.append(max(depth_list))
				
		b1_list.append(np.sum([1/m for m in max_list]))		
		
	return(b1_list)

2. B2 index

This index is computed by associating to each leaf a probability p of reaching it assuming a random walk from the root (we start from the root and choose randomly a path at each node). The index value is the Shannon entropy of all p probabilities. For the exact formula see Shao and Sokal (1990).

def B2_index (tree_seq, base):

"""
Computes the balance index B2 (see Shao and Sokal 1990)

:param tskit.TreeSequence tree_seq: the tree sequence in which to compute the B1 index.
:param int base: the base used for the log in Shannon entropy computation.
:return: B2 value for each tree in the tree sequence.
:rtype: list of float

"""	
	
	b2_list = []
	
	for tree in tree_seq.trees():
			
		prob_list = []
		
		for l in tree.leaves():
			
			random_walk = [] 
			p = l
			
			while p != tree.root: 
				p = tree.parent(l)
				l = p
				random_walk.append(1/tree.num_children(p))
			 
			prob_list.append(np.prod(random_walk)) #computing probability to reach l with a random walk starting from root
			
			
		b2_list.append(-np.sum([prob*math.log(prob,base) for prob in prob_list])) # Shannon entropy
		
		
	return(b2_list)

3. Sackin index

This index is computed by counting for each leaf the number of nodes to reach the root and summing up all values. For the exact formula see Shao and Sokal (1990).

def Sackin_index (tree_seq):
	
"""
Computes the Sackin imbalance index (see Shao and Sokal 1990)

:param tskit.TreeSequence tree_seq: the tree sequence in which to compute the Sackin index.
:return: Sackin imbalance value for each tree in the tree sequence.
:rtype: list of int

"""	
	sackin =  []
		
	for tree in tree_seq.trees():			
		depth_list = []
		
		for l in tree.leaves():			
			depth_list.append(tree.depth(l))
						
		sackin.append(np.sum(depth_list))	
		
	return(sackin)

4. Colless index

For each node, we compute the absolute value of the difference between the number of leaves under both sub-nodes (for a binary tree they are two). The index is the sum of all values. For a non-binary tree, a correction should be done (See Shao and Sokal (1990)). For the moment, the correction used is the difference between the maximum and minimum number of leaves under all sub-nodes.

def Colless (tree_seq):

"""
Computes the Colless imbalance index (see Shao and Sokal 1990)

:param tskit.TreeSequence tree_seq: the tree sequence in which to compute the Colless index.
:return: Colless imbalance value for each tree in the tree sequence.
:rtype: list of int

"""	
	
	colless = []
	
	for tree in tree_seq.trees():	
		
		diff_list = []
	
		for n in tree.nodes():
			
			if tree.is_leaf(n) == False :
				leaves_list = [] # will contain the number of leaves under each sub-nodes
				for c in tree.children(n):
					leaves_list.append(tree.num_samples(c))	
				if len(leaves_list) > 0:
					diff_list.append(max(leaves_list)-min(leaves_list))
			
		colless.append(np.sum(diff_list))
		
	return(colless)

jeromekelleher · 2022-05-09T16:39:04Z

jeromekelleher
May 9, 2022
Maintainer

Thanks for this @jeremyguez, it's very interesting. At first glance it's not obvious to me how we'd turn these into efficient incremental algorithms as (some of them at least) propagate information down the tree rather than up. We could certainly implement them efficiently as Tree methods though. For example, we could implement the Sackin index something like this:

def Sackin_by_traversal(tree):
    stack = [(tree.virtual_root, -1)]
    total_depth = 0
    while len(stack) > 0:
        u, depth = stack.pop()
        if tree.is_leaf(u):
            total_depth += depth
        else:
            for v in tree.children(u):
                stack.append((v, depth + 1))
    return total_depth

This would be straightforward to implement in C, and has a straightforward O(n) cost.

0 replies

jeromekelleher · 2022-05-09T16:56:46Z

jeromekelleher
May 9, 2022
Maintainer

I think B2_index can be structured in the same way, here's a quick version (not properly tested):

def B2_by_traversal(tree, base):
    # Note that this will take into account the number of roots also, by considering
    # them as children of the virtual root.
    stack = [(tree.virtual_root, 1)]
    total_proba = 0
    while len(stack) > 0:
        u, path_product = stack.pop()
        if tree.is_leaf(u):
            total_proba -= path_product * math.log(path_product, base)
        else:
            path_product *= 1 / tree.num_children(u)
            for v in tree.children(u):
                stack.append((v, path_product))
    return total_proba

0 replies

jeromekelleher · 2022-05-09T19:26:10Z

jeromekelleher
May 9, 2022
Maintainer

Thinking some more about this, I think we should implement these as tree methods, and not worry about doing incremental versions. The single-tree version will still be useful.

0 replies

jeromekelleher · 2022-05-09T20:10:31Z

jeromekelleher
May 9, 2022
Maintainer

I've made a first pass at the framework for implementing and testing these metrics in #2246 @jeremyguez. Would you be interested in helping out with doing part of the implementation? Getting the definition in and going through the examples for the tests would be a great start. If you'd like to get a bit of experience with the C side of things I'd be happy to provide pointers there too.

This will make the implementations many times faster.

0 replies

jeremyguez · 2022-05-10T09:03:42Z

jeremyguez
May 10, 2022
Author

Thanks @jeromekelleher for your answers and nice implementations of some of the metrics. I would gladly try to help out with this. I'll start with the definitions and tests for the four indices, then I'd be interested in experiencing the C side.

By the way, I was wondering if there already is a Tree method for the path length function I defined in the B1 index section (I didn't find one there). If not, would it be useful to have one? It could be generalized for computing the number of edges between any two nodes.

2 replies

jeromekelleher May 10, 2022
Maintainer

Thanks @jeromekelleher for your answers and nice implementations of some of the metrics. I would gladly try to help out with this. I'll start with the definitions and tests for the four indices, then I'd be interested in experiencing the C side.

Great! Please have a look at #2246 - if you think this is a good framework for adding in these methods then we can go ahead and add them. I'll open some specific issues to track these.

jeromekelleher May 10, 2022
Maintainer

By the way, I was wondering if there already is a Tree method for the path length function I defined in the B1 index section (I didn't find one there). If not, would it be useful to have one? It could be generalized for computing the number of edges between any two nodes.

No, we haven't implemented this and I think it would definitely be useful. The general case would be most useful I think. I'll open an issue to track.

jeromekelleher · 2022-05-10T10:51:18Z

jeromekelleher
May 10, 2022
Maintainer

OK @jeremyguez - I've opened a batch of issues to track the specifics here (#2249, #2250, #2251, #2252, #2253, #2254, #2255, #2256). It would be great if we could tackle these one-by-one in PRs, once #2246 is merged.

0 replies

GertjanBisschop · 2022-05-12T09:05:16Z

GertjanBisschop
May 12, 2022
Maintainer

Great to see tskit is getting tree (im)balance indices! Just wanted to point out a fairly recent paper by Lemant et al.. Here they've come up with (quoting the title) robust and universal tree balance indices. The most interesting thing being that these are comparable between trees with different numbers of leaves. They even show the relationship with Colless' and Sackin's index. But of course, as always with these types of things. The old ones are the ones people are actually using.

Cheers,
gertjan

1 reply

jeremyguez May 13, 2022
Author

Thanks @GertjanBisschop for this article, very interesting! I agree that their indices should be added at some point.

jeromekelleher · 2022-05-12T11:17:18Z

jeromekelleher
May 12, 2022
Maintainer

Very interesting, thanks @GertjanBisschop. I also found this website which gives information on the full zoo of metrics (from the preprint Tree balance indices: A comprehensive survey)

If anyone wants to help out with adding some more indices then great, most seem pretty easy to implement. I guess the only think we need to look out for at this point is naming, if anyone wants to do a bit of looking ahead to make some suggestions about what our full name space for balance indices would be, that would be helpful.

2 replies

jeremyguez May 13, 2022
Author

I can help with this, is there some place in particular where this name space should be set?

jeromekelleher May 13, 2022
Maintainer

Here is as good a place as any - ideally we'd create a list of all the metrics and the proposed method names + parameters (if any), and that would let us see whether we need adopt any particular conventions early on, so we don't have to go back and rename/deprecate things later. Thanks!

jeremyguez · 2022-05-15T00:37:13Z

jeremyguez
May 15, 2022
Author

I start here with a beginning of a name space. Many indices remain to be added, I will do some edits later for adding indices or correcting depending on your comments. I also added general topology methods and methods related to polytomies, since these are related to (im)balance indices.

General methods

I add here a few general methods for tree topology analysis. I don't know if they are all worth adding, but just putting some ideas.

path_length(u, v): outputs path length between node u and node v. Already done and merged here.
longest_path(): outputs the longest path in the tree (maybe as a tuple of two nodes).
num_sib(u): outputs the number of siblings of node u.
distrib_leaves(u): outputs the distribution of leaves in the children of a specified node u.
rm_nodes([u_list], rm_subtree, rm_lineage): outputs a tree without the specified nodes in [u_list]. If rm_subtree is true, the subtrees under the specified nodes are also removed. If rm_subtree is false, only the specified nodes are removed, resulting in a multi-rooted tree if some of the specified nodes are internal. rm_lineage follows the same idea, but is used to specify whether the whole lineages to which the nodes pertain are removed or not.
keep_nodes(([u_list], keep_subtree, keep_lineage): outputs a tree with only the nodes in [u_list]. If keep_subtree is true, the subtrees under the specified nodes are also kept. If keep_subtree is false, only the specified nodes are kept. keep_lineage follows the same idea, but is used to specify whether the whole lineages to which the nodes pertain are kept or not.
all_subtrees(): outputs all subtrees of a given tree.

(Im)balance indices

Balance indices

b1_index(): outputs the B1 index. Issue here.
b2_index(base): outputs the B2 index using the specified base for the Shannon entropy computation. Issue here.
furnas_rank(): outputs the Furnas rank.
rooted_quartet_index(): outputs the rooted quartet index.

Imbalance indices

Sackin index and its derivatives

Two possibilities here: having one method sackin_index() with a parameter for choosing which index in the Sackin family, or having a method for each index. I give here the second solution. This question of one method to rule them all vs. one method per index, applies for all indices families (e.g., Fusco index and its derivatives...).

sackin_index(): outputs the Sackin index which is the sum of all leaves depths. Already done and merged here.
avg_leaf_depth():outputs the average leaf depth, which is the normalization of the Sackin index by the number of leaves. For the name, we could also have something like sackin_index_norm(). (Another possibility is to add a parameter in sackin_index() allowing this normalization, see intro of the section).
var_leaf_depth(): outputs the variance of leaves depths.

Fusco index and its derivatives

fusco_index(): outputs the Fusco index, also described here.
purvis_index(): outputs the Purvis index which is actually a correction of the Fusco index to make it independent of the number of leaves.

Colless index and its derivatives

colless_index(): outputs the Colless index. Ongoing work here.
i2_index(): outputs the I2 index, which is a version of the Colless index.
colless_like_index(): outputs the Colless-like index, a corrected Colless index which can be used on non-binary trees.
colless_corrected_index(): outputs the corrected Colless index.
quadratic_colless_index(): outputs the quadratic Colless index.
rodgers_index(): outputs the Rodgers J index which counts the number of nodes that are imbalanced according to Colless definition.

Other imbalance indices

colijn_plazzotta_rank(): outputs the Colijn_Plazzotta rank.

Other topology indices

Polytomies

is_binary(): outputs whether the tree is binary.
highest_arity(): outputs the highest arity in the tree.
distrib_arity(): outputs the distribution of arities in the tree (e.g. a dictionary like this {2: 2, 3:1, 4:3} would mean that the tree has two binary nodes, one ternary nodes and three quaternary nodes).

1 reply

jeromekelleher May 15, 2022
Maintainer

Fabulous, thanks @jeremyguez!

jeromekelleher · 2022-05-15T12:39:28Z

jeromekelleher
May 15, 2022
Maintainer

Out of curiosity, I implemented the Sackin index above using numba to see how the performance compares to the C implementation in #2258:

import time
import tskit
import msprime

import numba

@numba.njit()
def _sackin_index(virtual_root, left_child, right_sib):
    stack = []
    root = left_child[virtual_root]
    while root != -1:
        stack.append((root, 0))
        root = right_sib[root]
    total_depth = 0
    while len(stack) > 0:
        u, depth = stack.pop()
        v = left_child[u]
        if v == -1:
            total_depth += depth
        else:
            depth += 1
            while v != -1:
                stack.append((v, depth))
                v = right_sib[v]
    return total_depth


def sackin_index(tree):
    return _sackin_index(tree.virtual_root, tree.left_child_array, tree.right_sib_array)

ts = msprime.sim_ancestry(10, random_seed=2)
tree = ts.first()

# Warmup jit and test
assert tree.sackin_index() == sackin_index(tree)


ts = tskit.load("/home/jk/work/covid-tsinfer-experiment/covid-norecomb-small-indels.trees")
print(ts)
tree = ts.first()

before = time.perf_counter()
s1 = tree.sackin_index()
d1 = time.perf_counter() - before

before = time.perf_counter()
s2 = sackin_index(tree)
d2 = time.perf_counter() - before

print(f"C     = {d1:.2g}")
print(f"numba = {d2:.2g}")

We get

╔════════════════════════╗
║TreeSequence            ║
╠═══════════════╤════════╣
║Trees          │       1║
╟───────────────┼────────╢
║Sequence Length│   29904║
╟───────────────┼────────╢
║Time Units     │ unknown║
╟───────────────┼────────╢
║Sample Nodes   │  129509║
╟───────────────┼────────╢
║Total Size     │33.7 MiB║
╚═══════════════╧════════╝
╔═══════════╤══════╤══════════╤════════════╗
║Table      │Rows  │Size      │Has Metadata║
╠═══════════╪══════╪══════════╪════════════╣
║Edges      │129510│   4.0 MiB│          No║
╟───────────┼──────┼──────────┼────────────╢
║Individuals│     0│  24 Bytes│          No║
╟───────────┼──────┼──────────┼────────────╢
║Migrations │     0│   8 Bytes│          No║
╟───────────┼──────┼──────────┼────────────╢
║Mutations  │134571│   4.8 MiB│          No║
╟───────────┼──────┼──────────┼────────────╢
║Nodes      │129511│  22.8 MiB│         Yes║
╟───────────┼──────┼──────────┼────────────╢
║Populations│     0│   8 Bytes│          No║
╟───────────┼──────┼──────────┼────────────╢
║Provenances│   264│ 143.2 KiB│          No║
╟───────────┼──────┼──────────┼────────────╢
║Sites      │ 20110│1002.9 KiB│         Yes║
╚═══════════╧══════╧══════════╧════════════╝

C     = 0.0011
numba = 0.0032

So, on a large (130K nodes) tree we see performance is about 3X worse using numba than the C code (with a fraction of the code/boilerplate).

With this in mind, I wonder if what we should aim for here is to implement a few of the key metrics in C for tskit, and perhaps see the full zoo of indices as something appropriate for the future "phylokit" package which implements phlyogenetics stuff using tskit as the underlying data structure and accelerates using numba.

3 replies

benjeffery May 16, 2022
Maintainer

With this in mind, I wonder if what we should aim for here is to implement a few of the key metrics in C for tskit, and perhaps see the full
zoo of indices as something appropriate for the future "phylokit" package which implements phlyogenetics stuff using tskit as the
underlying data structure and accelerates using numba.

I'm almost certain this is the way to go as long as we sign-post effectively.

jeremyguez May 18, 2022
Author

So I guess we should decide which indices will be implemented in tskit and which will go to "phylokit". I would say the four well known and widely used B1, B2, Sackin and Colless would be good in tskit + maybe the Colless-like that is adapted for non-binary trees + one index that is proven to be completely independent from number of leaves (as it is a very important property, so it would be good to have it in C).
What do you think ?

jeromekelleher May 18, 2022
Maintainer

Yep, sounds like exactly the right plan.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topology indices (balance and imbalance) #2245

{{title}}

Replies: 10 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Topology indices (balance and imbalance) #2245

jeremyguez May 9, 2022

1. B1 index

2. B2 index

3. Sackin index

4. Colless index

Replies: 10 comments · 9 replies

jeromekelleher May 9, 2022 Maintainer

jeromekelleher May 9, 2022 Maintainer

jeromekelleher May 9, 2022 Maintainer

jeromekelleher May 9, 2022 Maintainer

jeremyguez May 10, 2022 Author

jeromekelleher May 10, 2022 Maintainer

jeromekelleher May 10, 2022 Maintainer

jeromekelleher May 10, 2022 Maintainer

GertjanBisschop May 12, 2022 Maintainer

jeremyguez May 13, 2022 Author

jeromekelleher May 12, 2022 Maintainer

jeremyguez May 13, 2022 Author

jeromekelleher May 13, 2022 Maintainer

jeremyguez May 15, 2022 Author

General methods

(Im)balance indices

Balance indices

Imbalance indices

Sackin index and its derivatives

Fusco index and its derivatives

Colless index and its derivatives

Other imbalance indices

Other topology indices

Polytomies

jeromekelleher May 15, 2022 Maintainer

jeromekelleher May 15, 2022 Maintainer

benjeffery May 16, 2022 Maintainer

jeremyguez May 18, 2022 Author

jeromekelleher May 18, 2022 Maintainer

jeremyguez
May 9, 2022

Replies: 10 comments 9 replies

jeromekelleher
May 9, 2022
Maintainer

jeromekelleher
May 9, 2022
Maintainer

jeromekelleher
May 9, 2022
Maintainer

jeromekelleher
May 9, 2022
Maintainer

jeremyguez
May 10, 2022
Author

jeromekelleher May 10, 2022
Maintainer

jeromekelleher May 10, 2022
Maintainer

jeromekelleher
May 10, 2022
Maintainer

GertjanBisschop
May 12, 2022
Maintainer

jeremyguez May 13, 2022
Author

jeromekelleher
May 12, 2022
Maintainer

jeremyguez May 13, 2022
Author

jeromekelleher May 13, 2022
Maintainer

jeremyguez
May 15, 2022
Author

jeromekelleher May 15, 2022
Maintainer

jeromekelleher
May 15, 2022
Maintainer

benjeffery May 16, 2022
Maintainer

jeremyguez May 18, 2022
Author

jeromekelleher May 18, 2022
Maintainer