Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "fake root" to the tsk_tree struct #1691

Closed
jeromekelleher opened this issue Sep 10, 2021 · 12 comments
Closed

Add a "fake root" to the tsk_tree struct #1691

jeromekelleher opened this issue Sep 10, 2021 · 12 comments
Assignees
Labels
C API Issue is about the C API enhancement New feature or request
Milestone

Comments

@jeromekelleher
Copy link
Member

When writing algorithms with the quintuply linked tree structure for trees that contain multiple roots, it's quite tedious that we need to do things like

for root in tree.roots():
     do_algorithm(root)

It would be nicer if we could do something like

do_algorithm(tree.fake_root)

(I don't like the term fake_root but can't think of anything better).

The way this would be implemented in terms of the quintuply linked tree would be to add one more element to the pointer arrays, and to let this value (num_nodes) correspond to the ID of the fake root. Then, tree.left_child_array[-1] would correspond to the left root, etc.

This is a breaking change in terms of the low-level details of how we do tree traversals etc in C, so I think it would be good to get it done before 1.0.

@jeromekelleher jeromekelleher added this to the C API 1.0.0 milestone Sep 10, 2021
@jeromekelleher jeromekelleher self-assigned this Sep 10, 2021
@molpopgen
Copy link
Member

In one sense, this is adding a new head to the root list, so tree.root_head or tree.head_root may be a decent name?

@benjeffery
Copy link
Member

virtual_root?

@benjeffery benjeffery added C API Issue is about the C API enhancement New feature or request labels Sep 10, 2021
@jeromekelleher
Copy link
Member Author

Nice, all good ideas. Lemme hack something up and we can see how it looks and take a vote.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Sep 10, 2021

Aha, forgot how tricky this root tracking code is! I'm hoping this will make it less complicated, but I'll have to get my head back into it. Staring with Algorithm T is probably the way to go. The RootTrackingTree has a full implementation of the current root tracking code. It's always bothered me that there's more code for tracking the roots than there is for all the other cases (which are quite elegant I think).

One property we (almost) definitely want to preserve is that parent[u] == -1 for all the real roots. It would break all sorts of everything if they pointed to the virtual root instead.

If anyone would like to distract themselves with a really nice algorithm problem, then this is a cracker.

@molpopgen
Copy link
Member

Yeah, roots and the sample list updating stuff. I've banged my head on the latter a few times, and either made it worse/slower or wrong.

@molpopgen
Copy link
Member

One property we (almost) definitely want to preserve is that parent[u] == -1 for all the real roots. It would break all sorts of everything if they pointed to the virtual root instead.

Yep, that would be a BIG break.

@petrelharp
Copy link
Contributor

This is a nice idea. It also sounds very tricky.

@jeromekelleher
Copy link
Member Author

I've figured out the algorithm for doing this in #1704, and it works really nicely. It's so, so much nicer than the current approach that I think we basically have to do it. Before I jump into the breaking code changes though, here's my concrete proposal:

C changes

  • Add a new virtual_root member (that is always equal to num_nodes). Arguably this isn't needed, but I think it would lead to more readable client code and give a bit of flexibility in the future.
  • Add one extra element to the quintuply linked tree arrays.
  • Add a new right_root member to the tsk_tree_t struct, which is symmetric with left_root. Arguably we don't need the left_root struct member any more because we can access the value with tree.left_child[tree.virtual_root], but it would break quite a bit of code to remove left_root and we can maintain it easily enough, so this seems like the simplest option. Or, perhaps we should replace left_root struct member with a function tsk_tree_get_left_root(self)? This is probably the cleanest option API-wise, as it doesn't clutter up the struct with values that can be easily derived. Whichever option, we add a symmetric means of getting the right_root.
  • Update example code and internal library functions that use traversals to take advantage of virtual root (especially the parsimony method).
  • Change the semantics for the getter functions like get_parent etc to accept virtual_root as a parameter

Python changes

  • The quintuply linked tree arrays get one extra element representing the virtual root
  • Add a right_root property, symmetric to left_root in the Tree class
  • Add a virtual_root property
  • Change the semantics of Tree.parent(u) etc to accept virtual root as a parameter
  • Update test code to take advantage of the virtual root

The only way I can see these changes breaking client code is if

  1. People are depending on the error behaviour for things like Tree.parent(N)
  2. People are depending on length of the tree arrays being equal to ts.num_nodes (or in some other way)

Neither of these seem particularly likely to me.

Thoughts?

@benjeffery
Copy link
Member

This sounds great.
I seem to remember in a few places we have appended a NULL entry to a python array such that array[-1] gives the right result. I think that should still work with this.

@petrelharp
Copy link
Contributor

petrelharp commented Sep 17, 2021

Change the semantics of Tree.parent(u) etc to accept virtual root as a parameter

I'm not sure what you mean by this?

Add a new virtual_root member (that is always equal to num_nodes).

Hm - so, for checking if something is root we will still check if its parent is TSK_NULL, right? But when would we check if something is equal to num_nodes? (This isn't clear from the python mockup because both of those are equal to -1 in the python code). Doing checks for equality to num_nodes will be more confusing than checks to TSK_VIRTUAL_ROOT, I think.

@jeromekelleher
Copy link
Member Author

Change the semantics of Tree.parent(u) etc to accept virtual root as a parameter
I'm not sure what you mean by this?

It's just accepting N (num_nodes) as a valid argument and not raising a ValueError.

Hm - so, for checking if something is root we will still check if its parent is TSK_NULL, right?

Yes --- we guarantee that parent[u] == TSK_NULL for all roots. No existing code will need to change.

But when would we check if something is equal to num_nodes? (This isn't clear from the python mockup because both of those are equal to -1 in the python code). Doing checks for equality to num_nodes will be more confusing than checks to TSK_VIRTUAL_ROOT, I think.

We'll never do this. None of the quintuply linked arrays contains N (virtual root) as a value. We just have

tree.left_child[N] == left_root
tree.right_child[N] == right_root

the roots are sibs, as before, so that (if we had two roots) we'd have

tree.right_sib[left_root] == right_root
tree.left_sib[right_root] == left_root

because this is the virtual root, we still have (e.g.) tree.parent[left_root] == -1.

The parent and sib values for N are undefined, but we leave them as TSK_NULL for convenience. It doesn't seem worth the trouble making a different value for them.

The motivation for this is doing top-down tree traversal algorithms when we have multiple roots. Consider the postorder part of the Hartigan parsimony algorithm:

    # use a numpy array of 0/1 values to represent the set of states
    # to make the code as similar as possible to the C implementation.
    optimal_set = np.zeros((num_nodes + 1, num_alleles), dtype=np.int8)
    for allele, u in zip(genotypes, tree.tree_sequence.samples()):
        if allele != -1:
            optimal_set[u, allele] = 1
        else:
            optimal_set[u] = 1

    allele_count = np.zeros(num_alleles, dtype=int)
    for root in tree.roots:
        for u in tree.nodes(root, order="postorder"):
            allele_count[:] = 0
            for v in tree.children(u):
                for j in range(num_alleles):
                    allele_count[j] += optimal_set[v, j]
            if not tree.is_sample(u):
                max_allele_count = np.max(allele_count)
                optimal_set[u, allele_count == max_allele_count] = 1

    allele_count[:] = 0
    for v in tree.roots:
        for j in range(num_alleles):
            allele_count[j] += optimal_set[v, j]
    max_allele_count = np.max(allele_count)
    optimal_root_set = np.zeros(num_alleles, dtype=int)
    optimal_root_set[allele_count == max_allele_count] = 1
    ancestral_state = np.argmax(optimal_root_set)

This is ugly, because we have to (a) iterate over the roots, and (b) treat the ancestral state separately, even though it's the same logic. Using the new approach we can do this (not tested):

     # init code same as before

    allele_count = np.zeros(num_alleles, dtype=int)
    for u in tree.nodes(tree.virtual_root, order="postorder"):
        allele_count[:] = 0
        for v in tree.children(u):
            for j in range(num_alleles):
                allele_count[j] += optimal_set[v, j]
        if not tree.is_sample(u):
            max_allele_count = np.max(allele_count)
            optimal_set[u, allele_count == max_allele_count] = 1

    ancestral_state = np.argmax(optimal_set[tree.virtual_root])

which is just so much nicer!

(We'll have to be clear that tree.nodes() doesn't include the virtual_root by default too though)

@petrelharp
Copy link
Contributor

Perfect! Thanks for the explanation.

jeromekelleher added a commit to jeromekelleher/tskit that referenced this issue Oct 15, 2021
@mergify mergify bot closed this as completed in a882a5d Oct 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants