-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to 'prune back' a tree sequence #281
Comments
This is great.
|
If removing all nodes ancestral to a certain time is decapitating, then this would be to import tskit.gory_ops |
Maybe I'm not fully understanding here, but isn't this basically what |
You're right, it's definitely closely related @gtsambos. It would be excellent if we can implement it using ps. Let's talk about the census event soon --- I do actually need this as it happens. |
Hi @jeromekelleher, I just did a mock-up of this idea in this gist
It's actually even simpler than this - you just need to set the census nodes as your Btw, here's a Python function that puts nodes on trees at a given 'census' time:
Yes! 💻 |
The major difference between my implementation and yours is that you treat any (parent, child) combo as the same lineage,
even where the edges with this combo span non-adjacent intervals of the sequences. Whereas my I guess there might be contextual reasons to prefer one or the other, but my way makes a bit more sense to me because it ensures that each distinct 'leaf' corresponds to no more than one distinct historical chromosome. Ie. a particular ancestor may have contributed two distinct segments to a particular descendant's chromosome, but that doesn't necessarily mean that both segments have passed through the same sequence of individuals (which is my usual interpretation of lineage) |
This would be cool to have implemented. There are some subtleties that arise for handling mutations that don't occur in the case of |
Shouldn't you just set all the census-and-older nodes to be your samples?
Oh, this is a good point. |
So, perhaps the thing to do is to add in the 'census' nodes separately then, as I can imagine this being useful without the pollarding step. Handling the mutations is tricky, and I think we'll have to make the handling pluggable somehow (a callback that decides if a mutation goes above or below the census node, or something), as there's no general way for tskit to know what time the mutation happens at. |
Exactly. For each parent/child combo spanning the census time, you can pass the parent time, child time, census time, and the index of the mutation on the child node to a function returning |
Right; then we could just |
Related to #382 |
It turns out we'd like to do the opposite of this (cutting off the top of a tree sequence) for some applications (it'd give us an easy way to do time windowed statistics). |
Note that cutting off the top of a tree sequence is less problematic than the bottom - we don't want to re-map sample nodes or anything; I'd say just discard anything above a certain time - however, we do have to figure out what to call the newly-created roots. |
Yes, cutting off the top is much easier and simpler since we don't have to worry about samples. I'll make a new function for and an issue to track. Since moving the samples up the tree is much more subtle and would need more options, I think this should remain its own function, as discussed here. |
I think the general way to do this sort of operation should be to
This "adding new nodes" step removes a lot of the possible weirdnesses. Note that even cutting off the top runs into issues about samples (since there may be samples older than that time). I think we don't want to be movign around times of nodes, and if we add new nodes we don't have to. Also note that "everything" could include the nodes outside the time range or not; it could just mean removing the edges and mutations. |
Yep, agreed. That's basically what I've done in #2240 |
Note that you get quite different numbers of lineages when "pruning back" if you leave in the regions where a coalescent node is unary (or, I assume if you use @petrelharp 's import msprime
import tskit
import numpy as np
def node_is_coalescent(ts):
is_coalescent = np.zeros(ts.num_nodes, dtype=bool)
for tree in ts.trees():
is_coalescent[tree.num_children_array[:-1] > 1] = True
return np.where(is_coalescent)[0]
def keep_unary_regions_of_coalescent_nodes(ts):
tables = ts.dump_tables()
individual_arr = tables.nodes.individual
for u in node_is_coalescent(ts):
individual_arr[u] = tables.individuals.add_row()
tables.nodes.individual = individual_arr
tables.simplify(keep_unary_in_individuals=True, filter_nodes=False)
tables.nodes.individual = ts.nodes_individual # set the individuals back to the original
tables.individuals.truncate(ts.num_individuals)
return tables.tree_sequence()
def simplify_to_census(ts, census_time):
tables = tskit.TableCollection(sequence_length=ts.sequence_length)
tables.time_units = ts.time_units
tables.nodes.append_columns(time=ts.nodes_time, flags=ts.nodes_flags)
keep_nodes = list(np.where(ts.nodes_time == census_time)[0])
edge_list = []
for e in ts.edges():
edge_list.append(e)
if (
e.id + 1 == ts.num_edges or
ts.edges_parent[e.id + 1] != ts.edges_parent[e.id]
):
for e in edge_list:
if ts.nodes_time[e.parent] > census_time and ts.nodes_time[e.child] < census_time:
v = tables.nodes.add_row(time=census_time)
keep_nodes.append(v)
tables.edges.add_row(e.left, e.right, v, e.child)
tables.edges.add_row(e.left, e.right, e.parent, v)
else:
tables.edges.add_row(e.left, e.right, e.parent, e.child)
edge_list = []
# Sort output.
tables.sort()
tables.simplify(keep_nodes)
return tables.tree_sequence()
ts = msprime.sim_ancestry(
5, sequence_length=1e6, population_size=1e4, recombination_rate=1e-8,
record_full_arg=True,
)
ts = keep_unary_regions_of_coalescent_nodes(ts)
print("With unary regions of coalescent nodes")
for t in np.insert(np.logspace(0, np.log10(ts.max_time), 10), 0, 0):
nts = simplify_to_census(ts, t)
print("At time", t, nts.time_units, nts.num_samples, "lineages")
print("\nFully simplified")
sts = ts.simplify()
for t in np.insert(np.logspace(0, np.log10(ts.max_time), 10), 0, 0):
nts = simplify_to_census(sts, t)
print("At", t, nts.time_units, nts.num_samples, "lineages") Giving:
|
I want a function that will remove the topology after (forwards in time) a specific time, and mark each extant lineage with as a sample. The intuition is that we're drawing a line accross the tees, and sort of "pruning back" to this line. (This will allow us to generate the haplotypes extant at this time.)
Here's a simple implementation:
There's probably some subtleties and corner case I've missed out here, but this probably the basic idea.
I think this would be a useful addition to tskit. I don't know what we'd call it though, and I think it's worth keeping this in mind with the discussion going on over in #261 where we're chopping up the tree sequence in the space rather than time direction.
The text was updated successfully, but these errors were encountered: