Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding colless_index method and tests #2266

Merged
merged 1 commit into from
May 18, 2022

Conversation

jeremyguez
Copy link
Contributor

Here is a first try at Colless index. @jeromekelleher

python/tskit/trees.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented May 11, 2022

Codecov Report

Merging #2266 (099100f) into main (4e7a85b) will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2266   +/-   ##
=======================================
  Coverage   93.29%   93.29%           
=======================================
  Files          27       27           
  Lines       26073    26089   +16     
  Branches     1167     1172    +5     
=======================================
+ Hits        24325    24341   +16     
  Misses       1718     1718           
  Partials       30       30           
Flag Coverage Δ
c-tests 92.23% <ø> (ø)
lwt-tests 89.05% <ø> (ø)
python-c-tests 71.90% <6.25%> (-0.08%) ⬇️
python-tests 98.88% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/tskit/trees.py 98.04% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4e7a85b...099100f. Read the comment docs.

@jeromekelleher
Copy link
Member

jeromekelleher commented May 11, 2022

I found a good resource on balance metrics here (it's really thorough, and there's lots of metrics). According to them, this metric is only defined for binary trees with a single root, but I think the simplest thing to do is raise a ValueError if we encounter a tree that isn't one of these. How about this?

def colless_index(self):
        if self.num_roots != 1:
            raise ValueError("Colless index not defined for multiroot trees")
        num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
        total = 0
        for u in self.nodes(order="postorder"):
            num_children = 0
            for v in self.children(u):
                num_leaves[u] += num_leaves[v]
                num_children += 1
            if num_children == 0:
                num_leaves[u] = 1
            elif num_children != 2:
                raise ValueError("Colless index not defined for nonbinary trees")
            else:
                total += abs(
                    num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
                )
        return total

and for a definition we have

def colless_index_definition(tree):
    is_binary = all(
        tree.num_children(u) == 2 for u in tree.nodes() if tree.is_internal(u)
    )
    if tree.num_roots != 1 or not is_binary:
        raise ValueError("Colless index only defined for binary trees with one root")

    # We have to use len(list(tree.leaves(u))) here because tree.num_leaves() actually
    # returns the number of *samples*.
    return sum(
        abs(
            len(list(tree.leaves(tree.left_child(u))))
            - len(list(tree.leaves(tree.right_child(u))))
        )
        for u in tree.nodes()
        if tree.is_internal(u)
    )

It gets a bit ugly here for a few reasons:

  1. We don't have a short/efficient way of saying "is this a binary tree" (it was discussed in passing here. In principle we should be able to track this efficiently when building the tree, but it's work to do it.)
  2. The num_leaves function doesn't do what it sounds like it should do (and should be removed from the API really)

I do think it's better to just raise an error here if there's any doubt about how the metric is defined when we have polytomies and multiple roots.

@jeremyguez
Copy link
Contributor Author

Thanks @jeromekelleher , very interesting resource!

I agree to raise an error when the tree is not binary following this resource, unlike Shao and Sokal (1990) who defined Colless index in that case by using only the binary nodes for the computation. Because in the extreme case where many nodes are polytomies, the index ends up being not representative of the imbalance at all with the Shao and Sokal definition, as only few of its nodes are used for the computation.

I have two questions for the moment on your code:

  1. Why isn't it possible to test for number of leaves == number of nodes - number of leaves + 1 to check if the tree is binary?
  2. I have some trouble understanding why do we need a method and a definition, and why are they different? Is this to test that both ways of computing give the same results (the most efficient and the simplest)?

@jeromekelleher
Copy link
Member

Why isn't it possible to test for number of leaves == number of nodes - number of leaves + 1 to check if the tree is binary?

Yes, I guess that'll work. We still have to do a full traversal to get those numbers though, so it's not any more efficient.

I have some trouble understanding why do we need a method and a definition, and why are they different? Is this to test that both ways of computing give the same results (the most efficient and the simplest)?

Basically, yes. Think of it as efficiency of human understanding vs efficiency of computer implementation. Humans and computers think differently, so it's very helpful to write things down both ways. In particular, deciding what the right behaviour in corner cases should be is really helped by having the simplest mathematical definition that you can (as, ideally, the corner case behaviour falls naturally out of that). This happened in the path_length computation in #2249 where I was wondering what the semantics should be for the virtual root - by having a nice mathematical definition for path_length, this was really easy.

@benjeffery
Copy link
Member

Let me know if/when you want a review here, otherwise I assume you both have it covered.

@jeremyguez
Copy link
Contributor Author

jeremyguez commented May 18, 2022

@jeromekelleher
Thank you for your detailed answer, very useful!

I compared those two functions in term of time of execution (colless_index() and colless_index2()):

import time
import tskit

# First method
def colless_index(self):
        if self.num_roots != 1:
            raise ValueError("Colless index not defined for multiroot trees")
        num_leaves = np.zeros(self.tree_sequence.num_nodes, dtype=np.int32)
        total = 0
        for u in self.nodes(order="postorder"):
            num_children = 0
            for v in self.children(u):
                num_leaves[u] += num_leaves[v]
                num_children += 1
            if num_children == 0:
                num_leaves[u] = 1
            elif num_children != 2:
                raise ValueError("Colless index not defined for nonbinary trees")
            else:
                total += abs(
                    num_leaves[self.right_child(u)] - num_leaves[self.left_child(u)]
                )
        return total

# Second method
def colless_index2(tree):
    if tree.num_roots != 1:
        raise ValueError("Colless index not defined for multiroot trees")

    total = 0

    for u in tree.nodes():
        if tree.is_internal(u):
            if tree.num_children(u) == 2 :
                total+= abs(tree.num_samples(tree.right_child(u))-tree.num_samples(tree.left_child(u)))
            else:
                raise ValueError("Colless index not defined for nonbinary trees")
    return(total)


# Big random tree
random = tskit.Tree.generate_random_binary(100000)

start = time.perf_counter()
print(colless_index(random))
stop = time.perf_counter()
print(f"colless_index = {stop-start:.2g}")

start = time.perf_counter()
print(colless_index2(random))
stop = time.perf_counter()
print(f"colless_index2 = {stop-start:.2g}")

The answer was:

74278552
colless_index = 0.43
74278552
colless_index2 = 0.37

I tried this code multiple times and the answer were about the same. So apparently both methods give the same results and colless_index2() is a bit faster. The results were the same with a comb and a balanced tree.

So I guess colless_index() is still better because it does not use tree.num_samples(), if understood correctly.

@jeremyguez jeremyguez force-pushed the colless_imbalance_index branch from fbf92e6 to 69815f9 Compare May 18, 2022 10:21
@jeromekelleher
Copy link
Member

jeromekelleher commented May 18, 2022

The relative performance here isn't so important @jeremyguez because both are written in Python, and the Python overhead will dominate for most trees.

The main issue though is that we can't use tree.num_samples here because samples aren't necessarily leaves, and leaves aren't necessarily samples (although the usually will be, for many of the trees that we use). We have to manually count the leaves because we don't have an efficient way of doing that. If you subsitute the num_leaves implementation above, the "definition" version will be much slower.

The speed of the "definition" version is pretty unimportant though, unless it's catestrophically slow it won't matter for the trees we're using the in tests.

The colless_index method defined above is preferable as the basis of a C implementation because it's nice and simple. It would also work well under numba acceleration. Try this version on your big tree (but make sure you run it more than once to warm up the jit):

import numba

@numba.njit
def _colless_index(postorder, left_child, right_sib):
    num_leaves = np.zeros_like(left_child)
    total = 0
    for u in postorder:
        v = left_child[u]
        while v != -1:
            num_leaves[u] += num_leaves[v]
            v = right_sib[v]
        v = left_child[u]
        if v == -1:
            num_leaves[u] = 1
        else:
            # NB assuming tree is binary! We'd probably check this before
            # invoking the method.
            total += abs(num_leaves[right_sib[v]] - num_leaves[v])
    return total


def colless_index_numba(tree):
    if tree.num_roots != 1:
        raise ValueError("Colless index not defined for multiroot trees")
    # TODO check tree.is_binary, somehow
    return _colless_index(tree.postorder(), tree.left_child_array, tree.right_child_array)

(This is just out of interest by the way - we don't want to use numba in tskit as it's a heavy dependency. It is an awesome technology though)

@jeremyguez
Copy link
Contributor Author

@jeromekelleher
So I pushed the methods you proposed. There is just the TestDefinitions class about which I don't know how to compare the definition with the method, as assert tree.colless_index() == colless_index_definition(tree) wouldn't work since it sometimes raise a ValueError.

@jeremyguez jeremyguez force-pushed the colless_imbalance_index branch from 69815f9 to b360f2f Compare May 18, 2022 10:38
@jeromekelleher
Copy link
Member

Thanks @jeremyguez - all we can do for the TestDefinion is check if the trees are binary first, and then either assert they get the same value or raise a ValueError.

@jeremyguez jeremyguez force-pushed the colless_imbalance_index branch from b360f2f to 99d246a Compare May 18, 2022 11:00
@jeremyguez
Copy link
Contributor Author

Thanks @jeremyguez - all we can do for the TestDefinion is check if the trees are binary first, and then either assert they get the same value or raise a ValueError.

I pushed a version doing this.
Thanks for your answers @jeromekelleher.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, a few final tweaks and good to go

python/tests/test_balance_metrics.py Show resolved Hide resolved
Returns the Colless imbalance index for this tree.
This is defined as the sum of all differences between number of
leaves under right sub-node and left sub-node for each binary node.
Non-binary nodes are not taken in account, thus a star tree has
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is a bit out of date here. Should say something like "The Colless index is undefined for non-binary trees and trees with multiple roots. This method will raise a ValueError if the tree is not singly-rooted and binary."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I changed it in the last commit.
I hope the English definition is clear enough: "This is defined as the sum of all differences between number of leaves under right sub-node and left sub-node for each node."

@jeremyguez jeremyguez force-pushed the colless_imbalance_index branch from 99d246a to 099100f Compare May 18, 2022 12:31
@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 18, 2022
@mergify mergify bot merged commit 23c8a24 into tskit-dev:main May 18, 2022
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label May 18, 2022
@jeromekelleher
Copy link
Member

Great stuff, thanks @jeremyguez!

ps. We have a Slack workspace tskit-dev and it would be great if you could join up. Can you send me an email please if you'd like to join? jerome.kelleher@bdi.ox.ac.uk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants