-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "fake root" to the tsk_tree struct #1691
Comments
In one sense, this is adding a new head to the root list, so |
|
Nice, all good ideas. Lemme hack something up and we can see how it looks and take a vote. |
Aha, forgot how tricky this root tracking code is! I'm hoping this will make it less complicated, but I'll have to get my head back into it. Staring with Algorithm T is probably the way to go. The RootTrackingTree has a full implementation of the current root tracking code. It's always bothered me that there's more code for tracking the roots than there is for all the other cases (which are quite elegant I think). One property we (almost) definitely want to preserve is that If anyone would like to distract themselves with a really nice algorithm problem, then this is a cracker. |
Yeah, roots and the sample list updating stuff. I've banged my head on the latter a few times, and either made it worse/slower or wrong. |
Yep, that would be a BIG break. |
This is a nice idea. It also sounds very tricky. |
I've figured out the algorithm for doing this in #1704, and it works really nicely. It's so, so much nicer than the current approach that I think we basically have to do it. Before I jump into the breaking code changes though, here's my concrete proposal: C changes
Python changes
The only way I can see these changes breaking client code is if
Neither of these seem particularly likely to me. Thoughts? |
This sounds great. |
I'm not sure what you mean by this?
Hm - so, for checking if something is root we will still check if its parent is |
It's just accepting N (num_nodes) as a valid argument and not raising a ValueError.
Yes --- we guarantee that
We'll never do this. None of the quintuply linked arrays contains N (virtual root) as a value. We just have
the roots are sibs, as before, so that (if we had two roots) we'd have
because this is the virtual root, we still have (e.g.) The parent and sib values for N are undefined, but we leave them as TSK_NULL for convenience. It doesn't seem worth the trouble making a different value for them. The motivation for this is doing top-down tree traversal algorithms when we have multiple roots. Consider the postorder part of the Hartigan parsimony algorithm: # use a numpy array of 0/1 values to represent the set of states
# to make the code as similar as possible to the C implementation.
optimal_set = np.zeros((num_nodes + 1, num_alleles), dtype=np.int8)
for allele, u in zip(genotypes, tree.tree_sequence.samples()):
if allele != -1:
optimal_set[u, allele] = 1
else:
optimal_set[u] = 1
allele_count = np.zeros(num_alleles, dtype=int)
for root in tree.roots:
for u in tree.nodes(root, order="postorder"):
allele_count[:] = 0
for v in tree.children(u):
for j in range(num_alleles):
allele_count[j] += optimal_set[v, j]
if not tree.is_sample(u):
max_allele_count = np.max(allele_count)
optimal_set[u, allele_count == max_allele_count] = 1
allele_count[:] = 0
for v in tree.roots:
for j in range(num_alleles):
allele_count[j] += optimal_set[v, j]
max_allele_count = np.max(allele_count)
optimal_root_set = np.zeros(num_alleles, dtype=int)
optimal_root_set[allele_count == max_allele_count] = 1
ancestral_state = np.argmax(optimal_root_set) This is ugly, because we have to (a) iterate over the roots, and (b) treat the ancestral state separately, even though it's the same logic. Using the new approach we can do this (not tested): # init code same as before
allele_count = np.zeros(num_alleles, dtype=int)
for u in tree.nodes(tree.virtual_root, order="postorder"):
allele_count[:] = 0
for v in tree.children(u):
for j in range(num_alleles):
allele_count[j] += optimal_set[v, j]
if not tree.is_sample(u):
max_allele_count = np.max(allele_count)
optimal_set[u, allele_count == max_allele_count] = 1
ancestral_state = np.argmax(optimal_set[tree.virtual_root]) which is just so much nicer! (We'll have to be clear that |
Perfect! Thanks for the explanation. |
Closes tskit-dev#1691 Closes tskit-dev#1706
When writing algorithms with the quintuply linked tree structure for trees that contain multiple roots, it's quite tedious that we need to do things like
It would be nicer if we could do something like
(I don't like the term
fake_root
but can't think of anything better).The way this would be implemented in terms of the quintuply linked tree would be to add one more element to the pointer arrays, and to let this value (num_nodes) correspond to the ID of the fake root. Then,
tree.left_child_array[-1]
would correspond to the left root, etc.This is a breaking change in terms of the low-level details of how we do tree traversals etc in C, so I think it would be good to get it done before 1.0.
The text was updated successfully, but these errors were encountered: