Add samples and sites options to ts.genotype_matrix? #678

hyanwong · 2020-06-13T09:21:14Z

As noted in the docs, the genotype_matrix method can be very memory intensive. As we get to larger datasets, I suspect users might want to only extract portions of the matrix, e.g. certain samples, or certain regions. For example one extreme would be to export the haplotype for a single individual. The obvious way to do this is to run simplify to reduce the number of samples, and keep_intervals to reduce the number of sites, but this isn't likely to be obvious to the average user. Should we provide 2 parameters to the genotype_matrix method itself which, if not None, run these two preprocessing steps to reduce the output before returning the genotype matrix?

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2020-06-15T09:48:09Z

I think there's a samples argument to variants which we could also easily add to genotype matrix. That would be an easy PR, I think. Going through simplify etc seems excessive.

Sites is a bit less obvious, but I guess list of site IDs to output would be fine. For now, we could just skip any sites that are not in this list as we're iterating over the variants.

But, ah - this is done in C at the moment IIRC. That's probably pointless, to be honest, and we should probably just implement the genotype_matrix method in Python.

mufernando · 2020-06-24T00:21:20Z

I'm interested in this, so might have a go at it soon.

jeromekelleher · 2020-06-24T08:28:22Z

I'm interested in this, so might have a go at it soon.

Great, thanks @mufernando! The first thing to do is to figure out if a Python version of the current genotype_matrix function performs much worse than what we have now. To do this, create a large tree sequence (mininum 100K samples) and do some timing checks.

If we don't lose any real perf by doing it in python, then we can get rid of the current Python-C function, and our lives will be much easier.

jeromekelleher · 2020-09-29T14:11:07Z

To really do this efficiently for sites we'd need a method for efficiently seeking to particular trees, so I'm putting this in the random access project.

petrelharp · 2022-05-22T14:36:45Z

In removing SlimTreeSequence over in pyslim, I've run in to the ts.mutations_at( ) method I provided there, that provides more or less this functionality. It'd be nice to point to a replacement in tskit. What's the proposal for this API? Is it a method of Variant? I'm not sure if anyone is using it, so perhaps I won't actually do it right away (maybe, though?), but it'd at least be nice to have a concrete path forward worked out, and perhaps I can re-use my testing code from pyslim at least.

petrelharp · 2022-05-22T15:47:02Z

Edit: actually, this method is used in an example recipe in the SLiM manual ("18.13 - Tree-sequence recording and nucleotide-based models") to count up the trinucleotide mutation spectrum (ie the contexts in which mutations occurred), so I will be including the functionality somehow.

benjeffery · 2022-05-23T00:45:08Z

We have #2176 planned (soon) to bring the new C variant methods up to Python. After this, it will be pretty easy to make convenient methods on top. What's your timescale on needing this?

petrelharp · 2022-05-23T15:48:48Z

Ah, awesome. I can just move over my python code, so I don't need it at any particular time. Which sort of 'soon' is this one, do you think? That'd help me figure out how to deal with it in the short term.

benjeffery · 2022-05-23T15:49:40Z

I would hope before the paper release! So a couple of months?

szhan · 2022-08-16T12:43:56Z

@benjeffery has suggested doing the following tasks (as separate PRs) to solve this issue.

~~0 - Add ts.genotype_matrix to performance benchmark - PR 1~~
~~1 - Change current implementation to python similar to ts.variants - no changes to tests - PR 2~~
~~1.5 - Check for perf regression~~
2 - Add sites argument to this function and tests for it - PR 3
~~3 - Add samples argument and tests - PR 4~~

hyanwong · 2022-09-12T16:20:52Z

Samples argument is already done, and can be ticked off.

szhan · 2022-09-16T15:47:37Z

As agreed in #2494 we will revisit adding sites argument at some time.

mufernando self-assigned this Jun 24, 2020

hyanwong mentioned this issue Jul 13, 2020

Some questions about missing data #716

Closed

benjeffery added C API Issue is about the C API enhancement New feature or request Python API Issue is about the Python API labels Sep 29, 2020

This was referenced Apr 12, 2021

Add nodes argument to genotype_matrix #1207

Open

method to compute how many copies of each mutation some individuals carry #1308

Open

benjeffery added this to the Python 0.5.1 (paper) milestone May 24, 2022

szhan self-assigned this Aug 16, 2022

benjeffery unassigned mufernando Aug 16, 2022

szhan removed the C API Issue is about the C API label Aug 16, 2022

This was referenced Aug 16, 2022

Add genotypes_matrix performance benchmark #2464

Merged

Add Python implementation of genotype_matrix #2465

Merged

This was referenced Sep 6, 2022

Add samples arg, updated docs, and add tests #2493

Merged

Add sites argument to genotype matrix #2494

Closed

szhan closed this as completed Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add samples and sites options to ts.genotype_matrix? #678

Add samples and sites options to ts.genotype_matrix? #678

hyanwong commented Jun 13, 2020

jeromekelleher commented Jun 15, 2020

mufernando commented Jun 24, 2020 •

edited

Loading

jeromekelleher commented Jun 24, 2020

jeromekelleher commented Sep 29, 2020

petrelharp commented May 22, 2022

petrelharp commented May 22, 2022

benjeffery commented May 23, 2022

petrelharp commented May 23, 2022

benjeffery commented May 23, 2022

szhan commented Aug 16, 2022 •

edited by benjeffery

Loading

hyanwong commented Sep 12, 2022

szhan commented Sep 16, 2022

Add samples and sites options to ts.genotype_matrix? #678

Add samples and sites options to ts.genotype_matrix? #678

Comments

hyanwong commented Jun 13, 2020

jeromekelleher commented Jun 15, 2020

mufernando commented Jun 24, 2020 • edited Loading

jeromekelleher commented Jun 24, 2020

jeromekelleher commented Sep 29, 2020

petrelharp commented May 22, 2022

petrelharp commented May 22, 2022

benjeffery commented May 23, 2022

petrelharp commented May 23, 2022

benjeffery commented May 23, 2022

szhan commented Aug 16, 2022 • edited by benjeffery Loading

hyanwong commented Sep 12, 2022

szhan commented Sep 16, 2022

mufernando commented Jun 24, 2020 •

edited

Loading

szhan commented Aug 16, 2022 •

edited by benjeffery

Loading