-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about missing data #716
Comments
@hyanwong has brought up the (2) before as a weirdness. It does seem strange all right - IIRC I think the main reason for it is that's how genotypes are generated. Definitely open to suggestions on how we can make it better. For (1), weird things happen with simplify, but if someone wants to think through the details I'm definitely open to ideas. We haven't fully documented missing data yet, so we are free to make some changes I think. |
Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'". |
I don't know, tbh. Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent? |
I've created a "missing data" project to keep track of stuff. |
Not at all, although if we think we might change it, we should put a note in the documentation. |
The genotypes thing may be being reworked by @mufernando in #678 so implementing Peter's suggestion could be made part of that? |
I realized that things would be a lot less surprising if we set the default of |
I think it's too late for 0.3 anyway @petrelharp - we can pick this up afterwards. We do need to sort out missing data soon, but lets try to ship this big update first. |
Putting it off until after 0.3 seems OK as long as there will not then be a problem of "Well, we can't handle this the way we would like to, because we don't want to break backward compatibility with 0.3". Are we locking ourselves in to what we may conclude later is a bad policy? |
The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case. There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP. |
Hmm, is that entirely true? I see that @petrelharp has had to make some code changes in SLiM to avoid new problems with "missing data" that were not an issue previously, so it seems like something has changed with 0.3.
OK. |
Maybe I hadn't updated the tskit code since before 0.2? |
@bhaller has some important context about how this looks to SLiM in a different thread: MesserLab/SLiM#101 (comment) |
Here's another (probably silly) suggestion, which may or may not help. It's always bugged me that a site has an "ancestral state", as it's only ancestral relative to the current samples. It becomes impossible to extend the simulation backwards to a time when the ancestral state was something different. It also means that a Now that we can have multiple mutations at a site, I think it is more logical to imagine the ancestral state as a mutation above the (current) root. In this case, we could default to the ancestral state being missing ( If all sites have this property, then any node that is isolated in a tree would automatically take the ancestral state and therefore be marked as missing, with no further action needed on our part. Nodes that had a non-missing state but whose ancestry was unclear would then have to be marked with a mutation above them. |
As mentioned in an aside at MesserLab/SLiM#101 (comment), I think that internal non-sample nodes in a tree sequence, whether produced forwards or backwards in time, can be logically thought of as having "missing information" for some of the genome. In other words, if an internal node appears in some trees but not others, it is indicative of the fact that, in the parts of the genome where it does not appear, we no longer have enough information to reconstruct the ancestral haplotype. The idea of marking some regions of a sample as missing by pruning away the edges that attach them to the rest of the tree in that location is a logical extension of this way of looking at it. |
Hi, folks - I want to make a plug for changing the name of What "impute missing data" is doing is not really imputing at all, I'd argue. "Impute", for genomes, usually means "fancy guessing based on other nearby genotypes". What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as "imputing", but it's still kinda surprising as at first guess people would assume that tskit would 'impute' using the trees somehow. And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set So: my proposal is to rename |
I completely agree. Passing |
Or more specifically, |
OK, looks like we need to push the release back then. I've tagged this with 0.3.0. I think we need to have a discussion about this as a group, so let's try and make a time for next week. I'll start a thread on the slack. |
@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here's our conclusions:
I'll take a look at the feasibility of 1 and 2 here and report back. Any thoughts/objections? |
(Moved my comment to be below yours, @jeromekelleher , and added a bit.) So, we didn't actually choose a name for the flag in our chat today. On the C side, I'd propose TSK_DISABLE_MISSINGNESS. (A bit unfortunate that it's a negative, but since the default is for missingness to be on, I guess that's how it goes.) I'm not sure I'm a fan of TSK_MISSING_AS_ANCESTRAL. To me that doesn't convey what I think of as the essential function of this flag: turning off tskit's heuristics about what is "missing" and what is "not missing" entirely, and marking the tree sequence as not involving "missingness" at all. |
In |
Maybe something more explicit like |
Thinking this through... for the name of the flag/option, the two options are essentially either (a) "do we intend isolated samples to be missing data" or (b) "let isolated samples take the ancestral state". The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do. Thinking ahead to statistics: what we'll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with |
This sounds very sensible @petrelharp - my only reservations is that |
Gee, For the stats, |
|
By the way, my (slight) reservation about using the term "missing" in the description is that implies that this flag is slightly exceptional. But I suspect that as tree sequences start to be used for real life, rather than simple simulated datasets, missing data in samples will become the norm, rather than the exception. Or to put it another way, @petrelharp said "option (a) better connotes what we will do if there is missing data", but I think that the default mindset of most users will be to expect tree sequences to deal naturally with missing data. Therefore it might be better to describe the exceptional case: what we do if instead of allowing missing data, we substitute something else in there. But this probably reflects my biases coming from the world of tsinfer & real sequence data, rather than SLiM + simulation. |
|
Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.
Unless there is any issue shall we go with these? |
I agree with @bhaller that the word "isolated" in there is very useful. I marginally prefer [edit - I guess that would mean we could have an exact equivalent C name, |
I'm coming around to |
I'm with @jeromekelleher, and |
I've probably already made my position clear, but I'll say it again for the record. :-> I vote for |
A draft PR for these changes is at #794 |
Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:
simplify()
can make a formerly non-missing genotype become missing.The first thing is defensible, because we treat the state of every root as the "ancestral state" even though there's got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to *have some way to mark an isolated node as "not actually missing". *
The second thing does not feel right.
Here's a demonstration of (2), btw:
I don't remember discussion of this point when working out missing-ness, but maybe I'm forgetting it?
The text was updated successfully, but these errors were encountered: