Adding colless_index method and tests #2266

jeremyguez · 2022-05-11T15:12:20Z

Here is a first try at Colless index. @jeromekelleher

python/tskit/trees.py

python/tests/test_balance_metrics.py

codecov · 2022-05-11T15:24:20Z

Codecov Report

Merging #2266 (099100f) into main (4e7a85b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #2266   +/-   ##
=======================================
  Coverage   93.29%   93.29%           
=======================================
  Files          27       27           
  Lines       26073    26089   +16     
  Branches     1167     1172    +5     
=======================================
+ Hits        24325    24341   +16     
  Misses       1718     1718           
  Partials       30       30

Flag	Coverage Δ
c-tests	`92.23% <ø> (ø)`
lwt-tests	`89.05% <ø> (ø)`
python-c-tests	`71.90% <6.25%> (-0.08%)`	⬇️
python-tests	`98.88% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.04% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e7a85b...099100f. Read the comment docs.

jeromekelleher · 2022-05-11T19:58:22Z

I found a good resource on balance metrics here (it's really thorough, and there's lots of metrics). According to them, this metric is only defined for binary trees with a single root, but I think the simplest thing to do is raise a ValueError if we encounter a tree that isn't one of these. How about this?

def colless_index(self):
        if self.num_roots != 1:
            raise ValueError("Colless index not defined for multiroot trees")
        num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
        total = 0
        for u in self.nodes(order="postorder"):
            num_children = 0
            for v in self.children(u):
                num_leaves[u] += num_leaves[v]
                num_children += 1
            if num_children == 0:
                num_leaves[u] = 1
            elif num_children != 2:
                raise ValueError("Colless index not defined for nonbinary trees")
            else:
                total += abs(
                    num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
                )
        return total

and for a definition we have

def colless_index_definition(tree):
    is_binary = all(
        tree.num_children(u) == 2 for u in tree.nodes() if tree.is_internal(u)
    )
    if tree.num_roots != 1 or not is_binary:
        raise ValueError("Colless index only defined for binary trees with one root")

    # We have to use len(list(tree.leaves(u))) here because tree.num_leaves() actually
    # returns the number of *samples*.
    return sum(
        abs(
            len(list(tree.leaves(tree.left_child(u))))
            - len(list(tree.leaves(tree.right_child(u))))
        )
        for u in tree.nodes()
        if tree.is_internal(u)
    )

It gets a bit ugly here for a few reasons:

We don't have a short/efficient way of saying "is this a binary tree" (it was discussed in passing here. In principle we should be able to track this efficiently when building the tree, but it's work to do it.)
The num_leaves function doesn't do what it sounds like it should do (and should be removed from the API really)

I do think it's better to just raise an error here if there's any doubt about how the metric is defined when we have polytomies and multiple roots.

jeremyguez · 2022-05-13T15:26:10Z

Thanks @jeromekelleher , very interesting resource!

I agree to raise an error when the tree is not binary following this resource, unlike Shao and Sokal (1990) who defined Colless index in that case by using only the binary nodes for the computation. Because in the extreme case where many nodes are polytomies, the index ends up being not representative of the imbalance at all with the Shao and Sokal definition, as only few of its nodes are used for the computation.

I have two questions for the moment on your code:

Why isn't it possible to test for number of leaves == number of nodes - number of leaves + 1 to check if the tree is binary?
I have some trouble understanding why do we need a method and a definition, and why are they different? Is this to test that both ways of computing give the same results (the most efficient and the simplest)?

jeromekelleher · 2022-05-13T16:57:51Z

Why isn't it possible to test for number of leaves == number of nodes - number of leaves + 1 to check if the tree is binary?

Yes, I guess that'll work. We still have to do a full traversal to get those numbers though, so it's not any more efficient.

I have some trouble understanding why do we need a method and a definition, and why are they different? Is this to test that both ways of computing give the same results (the most efficient and the simplest)?

Basically, yes. Think of it as efficiency of human understanding vs efficiency of computer implementation. Humans and computers think differently, so it's very helpful to write things down both ways. In particular, deciding what the right behaviour in corner cases should be is really helped by having the simplest mathematical definition that you can (as, ideally, the corner case behaviour falls naturally out of that). This happened in the path_length computation in #2249 where I was wondering what the semantics should be for the virtual root - by having a nice mathematical definition for path_length, this was really easy.

benjeffery · 2022-05-14T21:09:05Z

Let me know if/when you want a review here, otherwise I assume you both have it covered.

jeremyguez · 2022-05-18T08:47:39Z

@jeromekelleher
Thank you for your detailed answer, very useful!

I compared those two functions in term of time of execution (colless_index() and colless_index2()):

import time
import tskit

# First method
def colless_index(self):
        if self.num_roots != 1:
            raise ValueError("Colless index not defined for multiroot trees")
        num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
        total = 0
        for u in self.nodes(order="postorder"):
            num_children = 0
            for v in self.children(u):
                num_leaves[u] += num_leaves[v]
                num_children += 1
            if num_children == 0:
                num_leaves[u] = 1
            elif num_children != 2:
                raise ValueError("Colless index not defined for nonbinary trees")
            else:
                total += abs(
                    num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
                )
        return total

# Second method
def colless_index2(tree):
    if tree.num_roots != 1:
        raise ValueError("Colless index not defined for multiroot trees")

    total = 0

    for u in tree.nodes():
        if tree.is_internal(u):
            if tree.num_children(u) == 2 :
                total+= abs(tree.num_samples(tree.right_child(u))-tree.num_samples(tree.left_child(u)))
            else:
                raise ValueError("Colless index not defined for nonbinary trees")
    return(total)


# Big random tree
random = tskit.Tree.generate_random_binary(100000)

start = time.perf_counter()
print(colless_index(random))
stop = time.perf_counter()
print(f"colless_index = {stop-start:.2g}")

start = time.perf_counter()
print(colless_index2(random))
stop = time.perf_counter()
print(f"colless_index2 = {stop-start:.2g}")

The answer was:

74278552
colless_index = 0.43
74278552
colless_index2 = 0.37

I tried this code multiple times and the answer were about the same. So apparently both methods give the same results and colless_index2() is a bit faster. The results were the same with a comb and a balanced tree.

So I guess colless_index() is still better because it does not use tree.num_samples(), if understood correctly.

jeromekelleher · 2022-05-18T10:32:52Z

The relative performance here isn't so important @jeremyguez because both are written in Python, and the Python overhead will dominate for most trees.

The main issue though is that we can't use tree.num_samples here because samples aren't necessarily leaves, and leaves aren't necessarily samples (although the usually will be, for many of the trees that we use). We have to manually count the leaves because we don't have an efficient way of doing that. If you subsitute the num_leaves implementation above, the "definition" version will be much slower.

The speed of the "definition" version is pretty unimportant though, unless it's catestrophically slow it won't matter for the trees we're using the in tests.

The colless_index method defined above is preferable as the basis of a C implementation because it's nice and simple. It would also work well under numba acceleration. Try this version on your big tree (but make sure you run it more than once to warm up the jit):

import numba

@numba.njit
def _colless_index(postorder, left_child, right_sib):
    num_leaves = np.zeros_like(left_child)
    total = 0
    for u in postorder:
        v = left_child[u]
        while v != -1:
            num_leaves[u] += num_leaves[v]
            v = right_sib[v]
        v = left_child[u]
        if v == -1:
            num_leaves[u] = 1
        else:
            # NB assuming tree is binary! We'd probably check this before
            # invoking the method.
            total += abs(num_leaves[right_sib[v]] - num_leaves[v])
    return total


def colless_index_numba(tree):
    if tree.num_roots != 1:
        raise ValueError("Colless index not defined for multiroot trees")
    # TODO check tree.is_binary, somehow
    return _colless_index(tree.postorder(), tree.left_child_array, tree.right_child_array)

(This is just out of interest by the way - we don't want to use numba in tskit as it's a heavy dependency. It is an awesome technology though)

jeremyguez · 2022-05-18T10:33:25Z

@jeromekelleher
So I pushed the methods you proposed. There is just the TestDefinitions class about which I don't know how to compare the definition with the method, as assert tree.colless_index() == colless_index_definition(tree) wouldn't work since it sometimes raise a ValueError.

jeromekelleher · 2022-05-18T10:42:12Z

Thanks @jeremyguez - all we can do for the TestDefinion is check if the trees are binary first, and then either assert they get the same value or raise a ValueError.

jeremyguez · 2022-05-18T11:02:27Z

Thanks @jeremyguez - all we can do for the TestDefinion is check if the trees are binary first, and then either assert they get the same value or raise a ValueError.

I pushed a version doing this.
Thanks for your answers @jeromekelleher.

jeromekelleher

LGTM, a few final tweaks and good to go

python/tests/test_balance_metrics.py

jeromekelleher · 2022-05-18T11:27:40Z

python/tskit/trees.py

+        Returns the Colless imbalance index for this tree.
+        This is defined as the sum of all differences between number of
+        leaves under right sub-node and left sub-node for each binary node.
+        Non-binary nodes are not taken in account, thus a star tree has


The docstring is a bit out of date here. Should say something like "The Colless index is undefined for non-binary trees and trees with multiple roots. This method will raise a ValueError if the tree is not singly-rooted and binary."

Thanks, I changed it in the last commit.
I hope the English definition is clear enough: "This is defined as the sum of all differences between number of leaves under right sub-node and left sub-node for each node."

jeromekelleher · 2022-05-18T14:55:52Z

Great stuff, thanks @jeremyguez!

ps. We have a Slack workspace tskit-dev and it would be great if you could join up. Can you send me an email please if you'd like to join? jerome.kelleher@bdi.ox.ac.uk

jeremyguez commented May 11, 2022

View reviewed changes

python/tskit/trees.py Outdated Show resolved Hide resolved

jeremyguez commented May 11, 2022

View reviewed changes

python/tests/test_balance_metrics.py Show resolved Hide resolved

jeremyguez force-pushed the colless_imbalance_index branch from fbf92e6 to 69815f9 Compare May 18, 2022 10:21

jeremyguez force-pushed the colless_imbalance_index branch from 69815f9 to b360f2f Compare May 18, 2022 10:38

jeremyguez force-pushed the colless_imbalance_index branch from b360f2f to 99d246a Compare May 18, 2022 11:00

jeromekelleher reviewed May 18, 2022

View reviewed changes

Adding colless_index method and tests

099100f

jeremyguez force-pushed the colless_imbalance_index branch from 99d246a to 099100f Compare May 18, 2022 12:31

jeromekelleher approved these changes May 18, 2022

View reviewed changes

jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 18, 2022

mergify bot merged commit 23c8a24 into tskit-dev:main May 18, 2022

mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 18, 2022

jeromekelleher mentioned this pull request May 18, 2022

C implementation of Colless index #2254

Closed

jeromekelleher mentioned this pull request May 20, 2022

Colless imbalance index (Python) #2250

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding colless_index method and tests #2266

Adding colless_index method and tests #2266

jeremyguez commented May 11, 2022

codecov bot commented May 11, 2022 •

edited

Loading

jeromekelleher commented May 11, 2022 •

edited

Loading

jeremyguez commented May 13, 2022

jeromekelleher commented May 13, 2022

benjeffery commented May 14, 2022

jeremyguez commented May 18, 2022 •

edited

Loading

jeromekelleher commented May 18, 2022 •

edited

Loading

jeremyguez commented May 18, 2022

jeromekelleher commented May 18, 2022

jeremyguez commented May 18, 2022

jeromekelleher left a comment

jeromekelleher May 18, 2022

jeremyguez May 18, 2022

jeromekelleher commented May 18, 2022

Adding colless_index method and tests #2266

Adding colless_index method and tests #2266

Conversation

jeremyguez commented May 11, 2022

codecov bot commented May 11, 2022 • edited Loading

Codecov Report

jeromekelleher commented May 11, 2022 • edited Loading

jeremyguez commented May 13, 2022

jeromekelleher commented May 13, 2022

benjeffery commented May 14, 2022

jeremyguez commented May 18, 2022 • edited Loading

jeromekelleher commented May 18, 2022 • edited Loading

jeremyguez commented May 18, 2022

jeromekelleher commented May 18, 2022

jeremyguez commented May 18, 2022

jeromekelleher left a comment

Choose a reason for hiding this comment

jeromekelleher May 18, 2022

Choose a reason for hiding this comment

jeremyguez May 18, 2022

Choose a reason for hiding this comment

jeromekelleher commented May 18, 2022

codecov bot commented May 11, 2022 •

edited

Loading

jeromekelleher commented May 11, 2022 •

edited

Loading

jeremyguez commented May 18, 2022 •

edited

Loading

jeromekelleher commented May 18, 2022 •

edited

Loading