Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about missing data #716

Closed
petrelharp opened this issue Jul 10, 2020 · 35 comments · Fixed by #794
Closed

Some questions about missing data #716

petrelharp opened this issue Jul 10, 2020 · 35 comments · Fixed by #794
Milestone

Comments

@petrelharp
Copy link
Contributor

Sorry to bring this up again, but #714 made @bhaller ask some good questions. Two observations:

  1. It is surprising that simplify() can make a formerly non-missing genotype become missing.
  2. An isolated node is considered to have a missing genotype even if it has a mutation above it, which is also surprising.

The first thing is defensible, because we treat the state of every root as the "ancestral state" even though there's got to be some element of unknown-ness. But it is certainly surprising, and makes me wonder if we can modify this behavior. The easiest way I could think of to modify it would be to *have some way to mark an isolated node as "not actually missing". *

The second thing does not feel right.

Here's a demonstration of (2), btw:

tables = tskit.TableCollection(sequence_length=1.0)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=0, flags=1)
tables.nodes.add_row(time=1, flags=0)
tables.edges.add_row(parent=3, child=1, left=0, right=1)
tables.edges.add_row(parent=3, child=2, left=0, right=1)
tables.sites.add_row(position=0.5, ancestral_state='x')
tables.mutations.add_row(site=0, node=0, derived_state='y')
ts = tables.tree_sequence()
ts.genotype_matrix()
# array([[-1,  0,  0]], dtype=int8)

I don't remember discussion of this point when working out missing-ness, but maybe I'm forgetting it?

@jeromekelleher
Copy link
Member

@hyanwong has brought up the (2) before as a weirdness. It does seem strange all right - IIRC I think the main reason for it is that's how genotypes are generated. Definitely open to suggestions on how we can make it better.

For (1), weird things happen with simplify, but if someone wants to think through the details I'm definitely open to ideas. We haven't fully documented missing data yet, so we are free to make some changes I think.

@petrelharp
Copy link
Contributor Author

Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".

@jeromekelleher
Copy link
Member

Is there any problem with saying that it only counts as missing data if the nice is isolation and has no mutation above it? That way we would at least be able to represent that "this node isn't related to anyone else, but has allele 'A'".

I don't know, tbh. Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?

@jeromekelleher
Copy link
Member

I've created a "missing data" project to keep track of stuff.

@petrelharp
Copy link
Contributor Author

Missing data is hard, and I'd like to put fully nailing it down off until the current batch of releases have been done. Is this urgent?

Not at all, although if we think we might change it, we should put a note in the documentation.

@hyanwong
Copy link
Member

The genotypes thing may be being reworked by @mufernando in #678 so implementing Peter's suggestion could be made part of that?

@petrelharp
Copy link
Contributor Author

I realized that things would be a lot less surprising if we set the default of impute_missing_data to True (and, in retrospect it would have been nice to call that option isolated_samples_are_missing or something). But, it's probably too late to make that switch?

@jeromekelleher
Copy link
Member

I think it's too late for 0.3 anyway @petrelharp - we can pick this up afterwards. We do need to sort out missing data soon, but lets try to ship this big update first.

@bhaller
Copy link

bhaller commented Aug 7, 2020

Putting it off until after 0.3 seems OK as long as there will not then be a problem of "Well, we can't handle this the way we would like to, because we don't want to break backward compatibility with 0.3". Are we locking ourselves in to what we may conclude later is a bad policy?

@jeromekelleher
Copy link
Member

jeromekelleher commented Aug 7, 2020

The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case. There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.

@bhaller
Copy link

bhaller commented Aug 7, 2020

The way things are done now has been this way since 0.2.0 @bhaller, so if we change we're going to have to consider how to manage breakage in either case.

Hmm, is that entirely true? I see that @petrelharp has had to make some code changes in SLiM to avoid new problems with "missing data" that were not an issue previously, so it seems like something has changed with 0.3.

There's plenty of numbers, so if we have to go to 0.4 immediately after 0.3 then that doesn't worry me - but we really do need to ship 0.3 ASAP.

OK.

@petrelharp
Copy link
Contributor Author

Maybe I hadn't updated the tskit code since before 0.2?

@petrelharp
Copy link
Contributor Author

@bhaller has some important context about how this looks to SLiM in a different thread: MesserLab/SLiM#101 (comment)

@hyanwong
Copy link
Member

hyanwong commented Aug 7, 2020

Here's another (probably silly) suggestion, which may or may not help. It's always bugged me that a site has an "ancestral state", as it's only ancestral relative to the current samples. It becomes impossible to extend the simulation backwards to a time when the ancestral state was something different. It also means that a union of two tree sequences with different ancestral states isn't easy.

Now that we can have multiple mutations at a site, I think it is more logical to imagine the ancestral state as a mutation above the (current) root. In this case, we could default to the ancestral state being missing (tskit.NULL), with a mutation above the root switching that state to whatever we think the ancestral state for the current samples is.

If all sites have this property, then any node that is isolated in a tree would automatically take the ancestral state and therefore be marked as missing, with no further action needed on our part. Nodes that had a non-missing state but whose ancestry was unclear would then have to be marked with a mutation above them.

@hyanwong
Copy link
Member

hyanwong commented Aug 7, 2020

As mentioned in an aside at MesserLab/SLiM#101 (comment), I think that internal non-sample nodes in a tree sequence, whether produced forwards or backwards in time, can be logically thought of as having "missing information" for some of the genome. In other words, if an internal node appears in some trees but not others, it is indicative of the fact that, in the parts of the genome where it does not appear, we no longer have enough information to reconstruct the ancestral haplotype.

The idea of marking some regions of a sample as missing by pruning away the edges that attach them to the rest of the tree in that location is a logical extension of this way of looking at it.

@petrelharp
Copy link
Contributor Author

Hi, folks - I want to make a plug for changing the name of impute_missing_data before the release - maybe it doesn't matter, since we'll have to keep the old name around anyhow? - but anyhow, here goes:

What "impute missing data" is doing is not really imputing at all, I'd argue. "Impute", for genomes, usually means "fancy guessing based on other nearby genotypes". What the option here is doing is just assigning the ancestral value. That is a natural guess for missing data, so that does count as "imputing", but it's still kinda surprising as at first guess people would assume that tskit would 'impute' using the trees somehow.

And, more seriously, there are other situations where isolated samples are not actually supposed to represent missing data. SLiM simulations, for instance. To get things to work out correctly for those, we need to set impute_missing_data=True, when really what we mean is isolated_samples_are_missing=False. So, we can certainly communicate that this is what everyone needs to do, but I anticipate a fair amount of confusion.

So: my proposal is to rename impute_missing_data=True to isolated_samples_are_missing=False or perhaps even just missing_data=False.

@bhaller
Copy link

bhaller commented Aug 13, 2020

So: my proposal is to rename impute_missing_data=True to isolated_samples_are_missing=False or perhaps even just missing_data=False.

I completely agree. Passing TSK_IMPUTE_MISSING_DATA feels almost like I'm asking tskit to do the opposite of what I want, in SLiM; I want isolated samples never to be considered "missing", and yet I'm asking tskit to "impute missing data". Very misleading/confusing.

@hyanwong
Copy link
Member

Or more specifically, missing_is_ancestral ?

@jeromekelleher jeromekelleher added this to the Python 0.3.0 milestone Aug 14, 2020
@jeromekelleher
Copy link
Member

OK, looks like we need to push the release back then. I've tagged this with 0.3.0. I think we need to have a discussion about this as a group, so let's try and make a time for next week. I'll start a thread on the slack.

@jeromekelleher
Copy link
Member

@benjeffery, @petrelharp, @bhaller and I had a meeting to discuss this today. Here's our conclusions:

  1. The current names for controlling the behaviour are confusing (TSK_IMPUTE_MISSING_DATA), so we need an alternative. I like @hyanwong's idea of TSK_MISSING_AS_ANCESTRAL, but open to ideas here.
  2. An isolated node that has a mutation above it shouldn't be considered missing. We should try to fix this if at all possible.
  3. Most of the reason that missing data is problematic currently in SLiM is because of the current behaviour of simplify, where we remove unary branches from the MRCA of samples back to more ancient nodes. We should change the default behaviour of simplify to keep these edges, because there's important information in knowing how long back your simulation actually started.

I'll take a look at the feasibility of 1 and 2 here and report back.

Any thoughts/objections?

@bhaller
Copy link

bhaller commented Aug 18, 2020

(Moved my comment to be below yours, @jeromekelleher , and added a bit.) So, we didn't actually choose a name for the flag in our chat today. On the C side, I'd propose TSK_DISABLE_MISSINGNESS. (A bit unfortunate that it's a negative, but since the default is for missingness to be on, I guess that's how it goes.) I'm not sure I'm a fan of TSK_MISSING_AS_ANCESTRAL. To me that doesn't convey what I think of as the essential function of this flag: turning off tskit's heuristics about what is "missing" and what is "not missing" entirely, and marking the tree sequence as not involving "missingness" at all.

@hyanwong
Copy link
Member

In TSK_MISSING_AS_ANCESTRAL, you could think of the "MISSING" as meaning that the ancestry is missing from that sample, not (just) that this is "missing data".

@benjeffery
Copy link
Member

Maybe something more explicit like TSK_ISOLATED_AS_ANCESTRAL?

@petrelharp
Copy link
Contributor Author

Thinking this through... for the name of the flag/option, the two options are essentially either (a) "do we intend isolated samples to be missing data" or (b) "let isolated samples take the ancestral state". The first is more general, so we could re-use the name of the option in more other methods. But the second is more precise, and says what the option is actually making the method do.

Thinking ahead to statistics: what we'll want to do is either treat isolated samples as ancestral, or remove them because they are missing (as with na.rm=TRUE in R). Either of the options above works, but I think option (a) better connotes what we will do if there is missing data? What about isolated_samples_missing, defaulting to True, and a C flag of TSK_ISOLATED_SAMPLES_NOT_MISSING?

@jeromekelleher
Copy link
Member

This sounds very sensible @petrelharp - my only reservations is that isolated_samples_missing is quite a long name. I'm tempted to have something like model_missing=True, or missingness=True, or allow_missing, or enable_missing or even missing_data (that's too terse an uninterpretable, though). Really I think we want a flag (like @bhaller says) which just says allow for missingness in the data (modelled by isolated samples) or just assume that everyone gets the ancestral state.

@petrelharp
Copy link
Contributor Author

Gee, missing_data doesn't sound too cryptic to me?

For the stats, remove_missing would be more descriptive, but that doesn't work for e.g. .variants(). So, I'd vote for missing_data ... and maybe TSK_NO_MISSING_DATA? What do you all think?

@hyanwong
Copy link
Member

isolated_as_missing is a bit shorter than isolated_samples_missing, and we don't need the work "samples", since the word "isolated" is only ever applied to samples, ISTR.

@hyanwong
Copy link
Member

By the way, my (slight) reservation about using the term "missing" in the description is that implies that this flag is slightly exceptional. But I suspect that as tree sequences start to be used for real life, rather than simple simulated datasets, missing data in samples will become the norm, rather than the exception. Or to put it another way, @petrelharp said "option (a) better connotes what we will do if there is missing data", but I think that the default mindset of most users will be to expect tree sequences to deal naturally with missing data. Therefore it might be better to describe the exceptional case: what we do if instead of allowing missing data, we substitute something else in there.

But this probably reflects my biases coming from the world of tsinfer & real sequence data, rather than SLiM + simulation.

@bhaller
Copy link

bhaller commented Aug 19, 2020

isolated_as_missing seems pretty good to me; maybe isolated_are_missing would be even better? And on the C side, TSK_ISOLATED_ARE_NOT_MISSING or (less verbosely, and it still seems clear) TSK_ISOLATED_NOT_MISSING? missing_data is cryptic to me, I agree, because it sounds like you could be saying, by passing false, "I guarantee that missing data is not present" rather than "Do not consider any data to be missing". And I like "isolated" being in there because it really makes clear what's being talked about in practice – what the flag will actually govern.

@benjeffery
Copy link
Member

Hi all, 0.3.0 is very close now thanks to #782. The name changes discussed above are the only remaining item.

isolated_are_missing seems to have most support if I've read the thread correctly. On the C side there is less consensus, but TSK_ISOLATED_NOT_MISSING seems to mirror the python name.

Unless there is any issue shall we go with these?

@hyanwong
Copy link
Member

hyanwong commented Aug 26, 2020

I agree with @bhaller that the word "isolated" in there is very useful. I marginally prefer isolated_as_ancestral default False (or s/as/are/, whichever sounds nicer), but that's because I think of this as the unusual case, which needs a description to explain it, whereas from my POV it's clear that an isolated node should be thought of as unknown (=missing) and doesn't require a description. But I'm probably unusual in my thinking here, so I'd be happy to be outvoted.

[edit - I guess that would mean we could have an exact equivalent C name, TSK_ISOLATED_ARE_ANCESTRAL, which could be an advantage?]

@jeromekelleher
Copy link
Member

I'm coming around to isolated_as_missing=True - this seems to convey the most information, and since people will need to use the option very rarely (missing data will be handled correctly, by default), it's OK that it's a little on the long side.

@petrelharp
Copy link
Contributor Author

I'm with @jeromekelleher, and ISOLATED_NOT_MISSING on the C side.

@bhaller
Copy link

bhaller commented Aug 26, 2020

I've probably already made my position clear, but I'll say it again for the record. :-> I vote for isolated_as_missing or isolated_are_missing, and with TSK_ISOLATED_NOT_MISSING. Not a fan of the ancestral variants.

@benjeffery
Copy link
Member

A draft PR for these changes is at #794

@mergify mergify bot closed this as completed in #794 Aug 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants