working towards a slice that will work on intervals #261

andrewkern · 2019-07-11T22:42:12Z

this is a baby step PR, aimed at getting the slice() function to work on multiple intervals. in talking with @petrelharp we figured the best start would be to implement this in a test, open a PR, and then move on. Thoughts on this inefficient implementation?

codecov · 2019-07-11T23:12:27Z

Codecov Report

Merging #261 into master will decrease coverage by 0.89%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #261     +/-   ##
=========================================
- Coverage   86.82%   85.92%   -0.9%     
=========================================
  Files          18       19      +1     
  Lines       10319    14227   +3908     
  Branches     1884     2773    +889     
=========================================
+ Hits         8959    12224   +3265     
- Misses        815     1048    +233     
- Partials      545      955    +410

Flag	Coverage Δ
#c_tests	`86.87% <100%> (+0.04%)`	⬆️
#python_c_tests	`90% <100%> (?)`
#python_tests	`99.18% <100%> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
python/tskit/trees.py	`98.54% <ø> (-0.05%)`	⬇️
python/tskit/util.py	`100% <100%> (ø)`	⬆️
python/tskit/tables.py	`99.8% <100%> (ø)`	⬆️
python/tskit/__init__.py	`100% <100%> (ø)`	⬆️
python/_tskitmodule.c	`83.38% <0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f676cd9...e0b85c9. Read the comment docs.

python/tests/test_topology.py

petrelharp · 2019-07-11T23:23:44Z

Looks good, and brings up some API questions:

As written, start and stop need to be lists/vectors. To avoid backwards-incompatible changes, we could check if start or stop are single numbers and if so replace them by e.g. [start]. It would also be nice to not have to remember the [ ] in the one-window case.
what should reset_coordinates do? see notes in the code.

andrewkern · 2019-07-11T23:26:08Z

the first item here is easy enough. I can alter that.

the second is a whole can of worms. what would be the point of resetting the coordinate space on a multiple interval slice? when would that be useful? not saying it isn't, I'm just not sure when we would want that at all.

petrelharp · 2019-07-12T05:06:52Z

I'm just not sure when we would want that at all.

I guess that I'm not sure either. I'm just trying to make things match up with the previous, .slice() function. An alternative would be to keep slice as is and make this function called .mask(). This seems redundant.

How about go ahead an implement without reset_coordinates, we can see how easy it is to add?

andrewkern · 2019-07-12T05:56:30Z

worked on it this evening and think i have the reset_coordinates thing working. i also began fleshing out a test-- see test_multi_interval_slice(). obviously a bogus test still

python/tests/test_topology.py

jeromekelleher · 2019-07-12T06:58:15Z

LGTM. I'm not sure I can see a way to doing it efficiently with numpy though - I think we'll have to do this in C. If this is something we want urgently for stdpopsim I can jump in - just let me know.

andrewkern · 2019-07-12T14:03:35Z

okay i changed that camelCase name (sorry), rebased to squash commits, and force pushed. i do think we want this for stdpopsim so if you could jump in on the C implementation @jeromekelleher that would be amazing!

looping over edges and ranges is far from efficient. i'm curious to see how to better implement this.

jeromekelleher · 2019-07-12T14:55:58Z

OK, I'll have a go when I get a chance over the next few days. I'll push to this branch to keep it all the same PR.

petrelharp · 2019-07-13T05:18:25Z

The numpy implementation should be good as long as there are not a large number of intervals. I suppose you're thinking of something that doesn't scale with the number of intervals? Oh, I see - it would be pretty straightforward using sortedness and edge insertion/removal indices. It still sounds tedious: I'm very tempted to say we should just see how the python implementation does with whatever we need in stdpopsim, though.

jeromekelleher · 2019-07-14T14:23:23Z

I've made some updates here and started writing out a set of tests. I've realised that actually properly covering all the corner cases here properly will take a lot of effort, and I'm starting to wonder if this is the right approach in the first place.

First, I've reset the slice function back to its original definition, as it seemed that providing a list of intervals was more natural that two lists of start, stop values.

I then added a dice function, which seemed like a nice generalisation of slice. However, after making a start on probing the corner cases for this, I think it's the wrong thing to do:

I think the UI is in the wrong direction: we should specify the intervals we want to remove rather than the ones we want to keep. This is what we actually want for stdpopsim anyway.
I think reset_coordinates is a bad idea here. Firstly, there is a tricky numerical problems, where the sum of the interval lengths is not exactly equal to the right coordinate of the rightmost edge. I'm sure there'd be others as well if we tested this thoroughly. Secondly, why would we want to do this anyway? Surely we want to preserve the coordinates? For example, if we wanted to plot the tree sequence against the original recombination map, it'll be an error-prone mess if we squish all the coordinates together like this. There's no harm in having gaps with no topology in the middle of the tree sequence - they don't contribute any sites (if we remove them) and they don't affect the branch length statistics.

So, I propose we get rid of dice, and have a new method which removes information from a given list of intervals. We did something similar for the tree sequences inferring by tsinfer for 1000 genomes etc where we snipped out the toplogy for centromeres (but that was only a single interval). Not sure what it should be called though...

Also, not convinced that slice is the best name for the current slice function in this light. We haven't released it, so we're free to change it if we want.

andrewkern · 2019-07-14T23:07:02Z

I agree about reset_coordinates -- it's a bit esoteric.

dice could run either direction i suppose (erase or retain the intervals). should we just add an option to the function?

jeromekelleher · 2019-07-15T07:51:26Z

I agree about reset_coordinates -- it's a bit esoteric.

Great - I'm much happier about implementing/testing this if we don't have to do this.

dice could run either direction i suppose (erase or retain the intervals). should we just add an option to the function?

That's true, you probably would want to do it in both directions for different applications. Is dice the right name, I wonder?

andrewkern · 2019-07-15T17:00:43Z

one possible name for this function would be subset_ranges or simply subset.

so what is the plan for this PR @jeromekelleher? do you want to merge a renamed dice?

jeromekelleher · 2019-07-15T18:42:00Z

so what is the plan for this PR @jeromekelleher? do you want to merge a renamed dice?

I'd like to add a function in this PR that'll support what we need for the analysis repo. I think there's two options:

Have something like dice, which can return a table collection with either a set of intervals removed or retained (but, without reset_coordinates).
Make something that just deletes a set of intervals from the edges and sites. We could call this delete_intervals or delete:

def delete(self, intervals):
     """
     Returns a copy of this TableCollection with the tree topology and site data in the specified 
     list of genomic intervals deleted.  Intervals must be disjoint and sorted in increasing order.
      """

This seems clear and straightforward to implement, without an unnecessary complications. I guess we might want 'delete' for something later, but I can't see what.

jeromekelleher · 2019-07-15T18:43:07Z

@petrelharp, have you any thoughts here? The delete option seems simplest to me and does the job we want.

jeromekelleher · 2019-07-15T18:45:35Z

On a different note, maybe we should rename slice to extract_slice or something as well.

andrewkern · 2019-07-16T05:02:37Z

@petrelharp and I talked more about this today. For the analysis repo I'm becoming less certain that we even want to dice the tree sequence, rather than just mask out the sites from analysis. If we were to use a nicely implemented dice, we would still have to simulate the full tree sequence for the chromosome, so there is no savings in computation.

fodder for tomorrow's call.

jeromekelleher · 2019-07-16T07:56:24Z

True enough - we should also think about what would happen to branch length statistics. I guess either way we'd want to mask out the windows that include these areas of the genome, so it's not going to make any difference whether we dice/delete it first or not.

andrewkern · 2019-07-16T21:03:49Z

i've been playing with the dice function and added two more tests with more intervals to dice (10, 100). It seems to be very fast with this number of intervals. probably in fine shape for use with the stdpopsim analysis.

petrelharp · 2019-07-16T21:22:46Z

@andrewkern just did some speed tests on a tree sequence with >1M edges, and found that with 10 intervals it took <1sec, with 100 intervals it took 4sec and with 1000 intervals, 42sec. We will never have 1000 intervals, and we only have to do this once, so the current implementation seems good enough for now.

petrelharp · 2019-07-16T21:24:52Z

LGTM except for documentation of dice.

petrelharp · 2019-07-16T21:27:25Z

Well, let me discuss the name for a minute? Seems to me this could just be called slice, and have only one function - they do just the same thing, no? The name dice is nice but also risks initial confusion with regular polyhedra with numbered faces.

petrelharp · 2019-07-17T02:08:47Z

what would happen to branch length statistics

... or, any statistic that is span normalised. Good point. A nice solution to this would add a mask argument to the statistic, or even as a property of the tree sequence. But in the meantime at least it's easy to have windows with breakpoints aligning with masked regions and omit the masked-out ones.

jeromekelleher · 2019-07-17T08:08:07Z

Well, let me discuss the name for a minute? Seems to me this could just be called slice, and have only one function - they do just the same thing, no? The name dice is nice but also risks initial confusion with regular polyhedra with numbered faces.

Maybe this got lost above, but I don't think it should be called dice either, and I'm not sure 'slice' should be called slice as well. I'm voting for delete, which removes a set of intervals. We rename slice to extract_slice. We want to mask for stdpopsim, as we're finding the intervals we don't want, right?

petrelharp · 2019-07-18T05:04:28Z

I think we'll have to square this away on Monday, and I've got a bit confused, but: is the proposal is to have two functions, one which removes intervals, and the other which keeps intervals? Obviously, there'd only have to be one implementation under the hood. This seems fine. If so, I think we should name the functions something close, e.g. extract_slice and delete_slice, so it's obvious they are in the same class of operations. (and, those names are fine with me)

petrelharp · 2019-08-07T15:57:58Z

I vote for delete_intervals( ) and keep_intervals( ) since there's other things one could delete, etc.

hyanwong · 2019-08-07T16:11:28Z

FWIW delete_spans is shorter to type, if you need a qualifier. Not sure what I prefer TBH.

petrelharp · 2019-08-07T16:20:25Z

FWIW delete_spans is shorter to type, if you need a qualifier. Not sure what I prefer TBH.

I'm good with that.

andrewkern · 2019-08-07T17:34:37Z

@petrelharp, @andrewkern, any objections to moving to delete(intervals) and keep(intervals) here? I'll finish this up ASAP if we're all happy with this direction.

happy with this. also like @hyanwong's suggestion of using _spans to disambiguate the object

jeromekelleher · 2019-08-07T19:00:05Z

I'm going to go with delete_intervals(intervals) --- fewer words to explain!

hyanwong · 2019-08-08T18:34:32Z

I'm coding up trim() and friends now. It seems like I should remove any sites that are in the trimmed regions. A function such as ts.remove_sites(self, site_ids, record_provenance=True) seems a sensible addition to the tskit armoury, and might be useful for delete_intervals mightn't it (which is why I bring it up here)? I don't think this is already done is it? Any objections to my coding it up?

jeromekelleher · 2019-08-08T21:37:49Z

I'm coding up trim() and friends now. It seems like I should remove any sites that are in the trimmed regions. A function such as ts.remove_sites(self, site_ids, record_provenance=True) seems a sensible addition to the tskit armoury, and might be useful for delete_intervals mightn't it (which is why I bring it up here)? I don't think this is already done is it? Any objections to my coding it up

We're removing the sites here so I don't see any need for a remove_sites method right now.

hyanwong · 2019-08-09T07:57:39Z

Oh yes, I see - should have looked at the PR code, sorry.

Nevertheless, I think we need to remove_sites when trimming, as there might be cases which haven't gone through the keep_intervals process (e.g. if we have missing data). If we trim these without removing sites, we might end up with sites at negative positions. But I suggest moving this discussion to -> #292

jeromekelleher · 2019-08-10T01:41:14Z

OK, hopefully this is ready for final review and merge.

Improve testing for keep/delete intervals and document. Also remove the ``test_dice`` file, as this isn't being used.

jeromekelleher · 2019-08-13T00:56:02Z

Nobody has complained, so I'm going to merge this.

petrelharp · 2019-08-13T05:46:20Z

Summary: we now have

TableCollection.delete_intervals( )
TableCollection.copy( ) 💯

and some useful things in util: pack_bytes, unpack_bytes, pack_strings, unpack_strings, and a few interval operations.

:yay:!

jeromekelleher · 2019-08-13T13:04:16Z

Also, TableCollection.keep_intervals(). I move the packing functions into util as it seemed like the right home for them.

petrelharp reviewed Jul 11, 2019

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

petrelharp reviewed Jul 11, 2019

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

jeromekelleher reviewed Jul 12, 2019

View reviewed changes

python/tests/test_topology.py Outdated Show resolved Hide resolved

andrewkern force-pushed the interval_slice branch from a24b1db to e83ba72 Compare July 12, 2019 14:01

jeromekelleher mentioned this pull request Jul 16, 2019

Output masking feature popsim-consortium/stdpopsim#104

Closed

hyanwong mentioned this pull request Aug 8, 2019

First pass at trim functions #292

Closed

jeromekelleher force-pushed the interval_slice branch 2 times, most recently from 335d16f to 95b7726 Compare August 10, 2019 01:11

jeromekelleher mentioned this pull request Aug 10, 2019

Faster version of keep_intervals #295

Closed

jeromekelleher force-pushed the interval_slice branch 2 times, most recently from 9b51122 to 94e54ad Compare August 10, 2019 01:38

andrewkern and others added 7 commits August 12, 2019 20:54

working towards a slice that will work on intervals

f47d4cb

created a numpy implementation of the interval slice with Peter

abad059

Add TableCollection.copy()

7db7c72

Add 'dice' method

820fa47

python version of fast dice

6df17fc

Add keep_intervals and delete_intervals.

b1c8f0c

Move interval and packing code to util.py

e0b85c9

Improve testing for keep/delete intervals and document. Also remove the ``test_dice`` file, as this isn't being used.

jeromekelleher force-pushed the interval_slice branch from 94e54ad to e0b85c9 Compare August 13, 2019 00:54

jeromekelleher merged commit be86a85 into tskit-dev:master Aug 13, 2019

jeromekelleher mentioned this pull request Aug 27, 2019

add mask option to stats #346

Open

hyanwong mentioned this pull request Sep 27, 2019

Trim functions #373

Closed

petrelharp mentioned this pull request Oct 1, 2020

clarification to keep_intervals docs #889

Merged

working towards a slice that will work on intervals #261

working towards a slice that will work on intervals #261

Conversation

andrewkern commented Jul 11, 2019

codecov bot commented Jul 11, 2019 • edited Loading

Codecov Report

petrelharp commented Jul 11, 2019

andrewkern commented Jul 11, 2019

petrelharp commented Jul 12, 2019

andrewkern commented Jul 12, 2019

jeromekelleher commented Jul 12, 2019

andrewkern commented Jul 12, 2019 • edited Loading

jeromekelleher commented Jul 12, 2019

petrelharp commented Jul 13, 2019

jeromekelleher commented Jul 14, 2019

andrewkern commented Jul 14, 2019

jeromekelleher commented Jul 15, 2019

andrewkern commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

jeromekelleher commented Jul 15, 2019

andrewkern commented Jul 16, 2019

jeromekelleher commented Jul 16, 2019

andrewkern commented Jul 16, 2019

petrelharp commented Jul 16, 2019

petrelharp commented Jul 16, 2019

petrelharp commented Jul 16, 2019

petrelharp commented Jul 17, 2019

jeromekelleher commented Jul 17, 2019

petrelharp commented Jul 18, 2019

petrelharp commented Aug 7, 2019

hyanwong commented Aug 7, 2019

petrelharp commented Aug 7, 2019

andrewkern commented Aug 7, 2019

jeromekelleher commented Aug 7, 2019

hyanwong commented Aug 8, 2019 • edited Loading

jeromekelleher commented Aug 8, 2019

hyanwong commented Aug 9, 2019 • edited Loading

jeromekelleher commented Aug 10, 2019

jeromekelleher commented Aug 13, 2019

petrelharp commented Aug 13, 2019

jeromekelleher commented Aug 13, 2019

codecov bot commented Jul 11, 2019 •

edited

Loading

andrewkern commented Jul 12, 2019 •

edited

Loading

hyanwong commented Aug 8, 2019 •

edited

Loading

hyanwong commented Aug 9, 2019 •

edited

Loading