-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add samples and sites options to ts.genotype_matrix? #678
Comments
I think there's a Sites is a bit less obvious, but I guess list of site IDs to output would be fine. For now, we could just skip any sites that are not in this list as we're iterating over the variants. But, ah - this is done in C at the moment IIRC. That's probably pointless, to be honest, and we should probably just implement the |
I'm interested in this, so might have a go at it soon. |
Great, thanks @mufernando! The first thing to do is to figure out if a Python version of the current If we don't lose any real perf by doing it in python, then we can get rid of the current Python-C function, and our lives will be much easier. |
To really do this efficiently for sites we'd need a method for efficiently seeking to particular trees, so I'm putting this in the random access project. |
In removing SlimTreeSequence over in pyslim, I've run in to the |
Edit: actually, this method is used in an example recipe in the SLiM manual ("18.13 - Tree-sequence recording and nucleotide-based models") to count up the trinucleotide mutation spectrum (ie the contexts in which mutations occurred), so I will be including the functionality somehow. |
We have #2176 planned (soon) to bring the new C variant methods up to Python. After this, it will be pretty easy to make convenient methods on top. What's your timescale on needing this? |
Ah, awesome. I can just move over my python code, so I don't need it at any particular time. Which sort of 'soon' is this one, do you think? That'd help me figure out how to deal with it in the short term. |
I would hope before the paper release! So a couple of months? |
@benjeffery has suggested doing the following tasks (as separate PRs) to solve this issue.
|
Samples argument is already done, and can be ticked off. |
As agreed in #2494 we will revisit adding |
As noted in the docs, the
genotype_matrix
method can be very memory intensive. As we get to larger datasets, I suspect users might want to only extract portions of the matrix, e.g. certain samples, or certain regions. For example one extreme would be to export the haplotype for a single individual. The obvious way to do this is to runsimplify
to reduce the number of samples, andkeep_intervals
to reduce the number of sites, but this isn't likely to be obvious to the average user. Should we provide 2 parameters to thegenotype_matrix
method itself which, if notNone
, run these two preprocessing steps to reduce the output before returning the genotype matrix?The text was updated successfully, but these errors were encountered: