Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

working towards a slice that will work on intervals #261

Merged
merged 7 commits into from
Aug 13, 2019

Conversation

andrewkern
Copy link
Member

this is a baby step PR, aimed at getting the slice() function to work on multiple intervals. in talking with @petrelharp we figured the best start would be to implement this in a test, open a PR, and then move on. Thoughts on this inefficient implementation?

@codecov
Copy link

codecov bot commented Jul 11, 2019

Codecov Report

Merging #261 into master will decrease coverage by 0.89%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #261     +/-   ##
=========================================
- Coverage   86.82%   85.92%   -0.9%     
=========================================
  Files          18       19      +1     
  Lines       10319    14227   +3908     
  Branches     1884     2773    +889     
=========================================
+ Hits         8959    12224   +3265     
- Misses        815     1048    +233     
- Partials      545      955    +410
Flag Coverage Δ
#c_tests 86.87% <100%> (+0.04%) ⬆️
#python_c_tests 90% <100%> (?)
#python_tests 99.18% <100%> (+0.01%) ⬆️
Impacted Files Coverage Δ
python/tskit/trees.py 98.54% <ø> (-0.05%) ⬇️
python/tskit/util.py 100% <100%> (ø) ⬆️
python/tskit/tables.py 99.8% <100%> (ø) ⬆️
python/tskit/__init__.py 100% <100%> (ø) ⬆️
python/_tskitmodule.c 83.38% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f676cd9...e0b85c9. Read the comment docs.

@petrelharp
Copy link
Contributor

Looks good, and brings up some API questions:

  • As written, start and stop need to be lists/vectors. To avoid backwards-incompatible changes, we could check if start or stop are single numbers and if so replace them by e.g. [start]. It would also be nice to not have to remember the [ ] in the one-window case.
  • what should reset_coordinates do? see notes in the code.

@andrewkern
Copy link
Member Author

the first item here is easy enough. I can alter that.

the second is a whole can of worms. what would be the point of resetting the coordinate space on a multiple interval slice? when would that be useful? not saying it isn't, I'm just not sure when we would want that at all.

@petrelharp
Copy link
Contributor

I'm just not sure when we would want that at all.

I guess that I'm not sure either. I'm just trying to make things match up with the previous, .slice() function. An alternative would be to keep slice as is and make this function called .mask(). This seems redundant.

How about go ahead an implement without reset_coordinates, we can see how easy it is to add?

@andrewkern
Copy link
Member Author

worked on it this evening and think i have the reset_coordinates thing working. i also began fleshing out a test-- see test_multi_interval_slice(). obviously a bogus test still

@jeromekelleher
Copy link
Member

LGTM. I'm not sure I can see a way to doing it efficiently with numpy though - I think we'll have to do this in C. If this is something we want urgently for stdpopsim I can jump in - just let me know.

@andrewkern
Copy link
Member Author

andrewkern commented Jul 12, 2019

okay i changed that camelCase name (sorry), rebased to squash commits, and force pushed. i do think we want this for stdpopsim so if you could jump in on the C implementation @jeromekelleher that would be amazing!

looping over edges and ranges is far from efficient. i'm curious to see how to better implement this.

@jeromekelleher
Copy link
Member

OK, I'll have a go when I get a chance over the next few days. I'll push to this branch to keep it all the same PR.

@petrelharp
Copy link
Contributor

The numpy implementation should be good as long as there are not a large number of intervals. I suppose you're thinking of something that doesn't scale with the number of intervals? Oh, I see - it would be pretty straightforward using sortedness and edge insertion/removal indices. It still sounds tedious: I'm very tempted to say we should just see how the python implementation does with whatever we need in stdpopsim, though.

@jeromekelleher
Copy link
Member

I've made some updates here and started writing out a set of tests. I've realised that actually properly covering all the corner cases here properly will take a lot of effort, and I'm starting to wonder if this is the right approach in the first place.

First, I've reset the slice function back to its original definition, as it seemed that providing a list of intervals was more natural that two lists of start, stop values.

I then added a dice function, which seemed like a nice generalisation of slice. However, after making a start on probing the corner cases for this, I think it's the wrong thing to do:

  • I think the UI is in the wrong direction: we should specify the intervals we want to remove rather than the ones we want to keep. This is what we actually want for stdpopsim anyway.
  • I think reset_coordinates is a bad idea here. Firstly, there is a tricky numerical problems, where the sum of the interval lengths is not exactly equal to the right coordinate of the rightmost edge. I'm sure there'd be others as well if we tested this thoroughly. Secondly, why would we want to do this anyway? Surely we want to preserve the coordinates? For example, if we wanted to plot the tree sequence against the original recombination map, it'll be an error-prone mess if we squish all the coordinates together like this. There's no harm in having gaps with no topology in the middle of the tree sequence - they don't contribute any sites (if we remove them) and they don't affect the branch length statistics.

So, I propose we get rid of dice, and have a new method which removes information from a given list of intervals. We did something similar for the tree sequences inferring by tsinfer for 1000 genomes etc where we snipped out the toplogy for centromeres (but that was only a single interval). Not sure what it should be called though...

Also, not convinced that slice is the best name for the current slice function in this light. We haven't released it, so we're free to change it if we want.

@andrewkern
Copy link
Member Author

I agree about reset_coordinates -- it's a bit esoteric.

dice could run either direction i suppose (erase or retain the intervals). should we just add an option to the function?

@jeromekelleher
Copy link
Member

I agree about reset_coordinates -- it's a bit esoteric.

Great - I'm much happier about implementing/testing this if we don't have to do this.

dice could run either direction i suppose (erase or retain the intervals). should we just add an option to the function?

That's true, you probably would want to do it in both directions for different applications. Is dice the right name, I wonder?

@andrewkern
Copy link
Member Author

one possible name for this function would be subset_ranges or simply subset.

so what is the plan for this PR @jeromekelleher? do you want to merge a renamed dice?

@jeromekelleher
Copy link
Member

so what is the plan for this PR @jeromekelleher? do you want to merge a renamed dice?

I'd like to add a function in this PR that'll support what we need for the analysis repo. I think there's two options:

  • Have something like dice, which can return a table collection with either a set of intervals removed or retained (but, without reset_coordinates).
  • Make something that just deletes a set of intervals from the edges and sites. We could call this delete_intervals or delete:
def delete(self, intervals):
     """
     Returns a copy of this TableCollection with the tree topology and site data in the specified 
     list of genomic intervals deleted.  Intervals must be disjoint and sorted in increasing order.
      """

This seems clear and straightforward to implement, without an unnecessary complications. I guess we might want 'delete' for something later, but I can't see what.

@jeromekelleher
Copy link
Member

@petrelharp, have you any thoughts here? The delete option seems simplest to me and does the job we want.

@jeromekelleher
Copy link
Member

On a different note, maybe we should rename slice to extract_slice or something as well.

@andrewkern
Copy link
Member Author

@petrelharp and I talked more about this today. For the analysis repo I'm becoming less certain that we even want to dice the tree sequence, rather than just mask out the sites from analysis. If we were to use a nicely implemented dice, we would still have to simulate the full tree sequence for the chromosome, so there is no savings in computation.

fodder for tomorrow's call.

@jeromekelleher
Copy link
Member

True enough - we should also think about what would happen to branch length statistics. I guess either way we'd want to mask out the windows that include these areas of the genome, so it's not going to make any difference whether we dice/delete it first or not.

@andrewkern
Copy link
Member Author

i've been playing with the dice function and added two more tests with more intervals to dice (10, 100). It seems to be very fast with this number of intervals. probably in fine shape for use with the stdpopsim analysis.

@petrelharp
Copy link
Contributor

@andrewkern just did some speed tests on a tree sequence with >1M edges, and found that with 10 intervals it took <1sec, with 100 intervals it took 4sec and with 1000 intervals, 42sec. We will never have 1000 intervals, and we only have to do this once, so the current implementation seems good enough for now.

@petrelharp
Copy link
Contributor

LGTM except for documentation of dice.

@petrelharp
Copy link
Contributor

Well, let me discuss the name for a minute? Seems to me this could just be called slice, and have only one function - they do just the same thing, no? The name dice is nice but also risks initial confusion with regular polyhedra with numbered faces.

@petrelharp
Copy link
Contributor

what would happen to branch length statistics

... or, any statistic that is span normalised. Good point. A nice solution to this would add a mask argument to the statistic, or even as a property of the tree sequence. But in the meantime at least it's easy to have windows with breakpoints aligning with masked regions and omit the masked-out ones.

@jeromekelleher
Copy link
Member

Well, let me discuss the name for a minute? Seems to me this could just be called slice, and have only one function - they do just the same thing, no? The name dice is nice but also risks initial confusion with regular polyhedra with numbered faces.

Maybe this got lost above, but I don't think it should be called dice either, and I'm not sure 'slice' should be called slice as well. I'm voting for delete, which removes a set of intervals. We rename slice to extract_slice. We want to mask for stdpopsim, as we're finding the intervals we don't want, right?

@petrelharp
Copy link
Contributor

I think we'll have to square this away on Monday, and I've got a bit confused, but: is the proposal is to have two functions, one which removes intervals, and the other which keeps intervals? Obviously, there'd only have to be one implementation under the hood. This seems fine. If so, I think we should name the functions something close, e.g. extract_slice and delete_slice, so it's obvious they are in the same class of operations. (and, those names are fine with me)

@petrelharp
Copy link
Contributor

I vote for delete_intervals( ) and keep_intervals( ) since there's other things one could delete, etc.

@hyanwong
Copy link
Member

hyanwong commented Aug 7, 2019

FWIW delete_spans is shorter to type, if you need a qualifier. Not sure what I prefer TBH.

@petrelharp
Copy link
Contributor

FWIW delete_spans is shorter to type, if you need a qualifier. Not sure what I prefer TBH.

I'm good with that.

@andrewkern
Copy link
Member Author

@petrelharp, @andrewkern, any objections to moving to delete(intervals) and keep(intervals) here? I'll finish this up ASAP if we're all happy with this direction.

happy with this. also like @hyanwong's suggestion of using _spans to disambiguate the object

@jeromekelleher
Copy link
Member

I'm going to go with delete_intervals(intervals) --- fewer words to explain!

@hyanwong
Copy link
Member

hyanwong commented Aug 8, 2019

I'm coding up trim() and friends now. It seems like I should remove any sites that are in the trimmed regions. A function such as ts.remove_sites(self, site_ids, record_provenance=True) seems a sensible addition to the tskit armoury, and might be useful for delete_intervals mightn't it (which is why I bring it up here)? I don't think this is already done is it? Any objections to my coding it up?

@jeromekelleher
Copy link
Member

I'm coding up trim() and friends now. It seems like I should remove any sites that are in the trimmed regions. A function such as ts.remove_sites(self, site_ids, record_provenance=True) seems a sensible addition to the tskit armoury, and might be useful for delete_intervals mightn't it (which is why I bring it up here)? I don't think this is already done is it? Any objections to my coding it up

We're removing the sites here so I don't see any need for a remove_sites method right now.

@hyanwong
Copy link
Member

hyanwong commented Aug 9, 2019

Oh yes, I see - should have looked at the PR code, sorry.

Nevertheless, I think we need to remove_sites when trimming, as there might be cases which haven't gone through the keep_intervals process (e.g. if we have missing data). If we trim these without removing sites, we might end up with sites at negative positions. But I suggest moving this discussion to -> #292

@jeromekelleher jeromekelleher force-pushed the interval_slice branch 2 times, most recently from 335d16f to 95b7726 Compare August 10, 2019 01:11
@jeromekelleher jeromekelleher force-pushed the interval_slice branch 2 times, most recently from 9b51122 to 94e54ad Compare August 10, 2019 01:38
@jeromekelleher
Copy link
Member

OK, hopefully this is ready for final review and merge.

@jeromekelleher
Copy link
Member

Nobody has complained, so I'm going to merge this.

@jeromekelleher jeromekelleher merged commit be86a85 into tskit-dev:master Aug 13, 2019
@petrelharp
Copy link
Contributor

Summary: we now have

  • TableCollection.delete_intervals( )
  • TableCollection.copy( ) 💯

and some useful things in util: pack_bytes, unpack_bytes, pack_strings, unpack_strings, and a few interval operations.

:yay:!

@jeromekelleher
Copy link
Member

Also, TableCollection.keep_intervals(). I move the packing functions into util as it seemed like the right home for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants