-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
working towards a slice that will work on intervals #261
Conversation
Codecov Report
@@ Coverage Diff @@
## master #261 +/- ##
=========================================
- Coverage 86.82% 85.92% -0.9%
=========================================
Files 18 19 +1
Lines 10319 14227 +3908
Branches 1884 2773 +889
=========================================
+ Hits 8959 12224 +3265
- Misses 815 1048 +233
- Partials 545 955 +410
Continue to review full report at Codecov.
|
Looks good, and brings up some API questions:
|
the first item here is easy enough. I can alter that. the second is a whole can of worms. what would be the point of resetting the coordinate space on a multiple interval slice? when would that be useful? not saying it isn't, I'm just not sure when we would want that at all. |
I guess that I'm not sure either. I'm just trying to make things match up with the previous, How about go ahead an implement without |
worked on it this evening and think i have the |
LGTM. I'm not sure I can see a way to doing it efficiently with numpy though - I think we'll have to do this in C. If this is something we want urgently for stdpopsim I can jump in - just let me know. |
a24b1db
to
e83ba72
Compare
okay i changed that camelCase name (sorry), rebased to squash commits, and force pushed. i do think we want this for stdpopsim so if you could jump in on the C implementation @jeromekelleher that would be amazing! looping over edges and ranges is far from efficient. i'm curious to see how to better implement this. |
OK, I'll have a go when I get a chance over the next few days. I'll push to this branch to keep it all the same PR. |
The numpy implementation should be good as long as there are not a large number of intervals. I suppose you're thinking of something that doesn't scale with the number of intervals? Oh, I see - it would be pretty straightforward using sortedness and edge insertion/removal indices. It still sounds tedious: I'm very tempted to say we should just see how the python implementation does with whatever we need in stdpopsim, though. |
I've made some updates here and started writing out a set of tests. I've realised that actually properly covering all the corner cases here properly will take a lot of effort, and I'm starting to wonder if this is the right approach in the first place. First, I've reset the I then added a
So, I propose we get rid of Also, not convinced that |
I agree about
|
Great - I'm much happier about implementing/testing this if we don't have to do this.
That's true, you probably would want to do it in both directions for different applications. Is |
one possible name for this function would be so what is the plan for this PR @jeromekelleher? do you want to merge a renamed |
I'd like to add a function in this PR that'll support what we need for the analysis repo. I think there's two options:
def delete(self, intervals):
"""
Returns a copy of this TableCollection with the tree topology and site data in the specified
list of genomic intervals deleted. Intervals must be disjoint and sorted in increasing order.
""" This seems clear and straightforward to implement, without an unnecessary complications. I guess we might want 'delete' for something later, but I can't see what. |
@petrelharp, have you any thoughts here? The |
On a different note, maybe we should rename |
@petrelharp and I talked more about this today. For the fodder for tomorrow's call. |
True enough - we should also think about what would happen to branch length statistics. I guess either way we'd want to mask out the windows that include these areas of the genome, so it's not going to make any difference whether we |
i've been playing with the |
@andrewkern just did some speed tests on a tree sequence with >1M edges, and found that with 10 intervals it took <1sec, with 100 intervals it took 4sec and with 1000 intervals, 42sec. We will never have 1000 intervals, and we only have to do this once, so the current implementation seems good enough for now. |
LGTM except for documentation of |
Well, let me discuss the name for a minute? Seems to me this could just be called |
... or, any statistic that is span normalised. Good point. A nice solution to this would add a |
Maybe this got lost above, but I don't think it should be called dice either, and I'm not sure 'slice' should be called slice as well. I'm voting for |
I think we'll have to square this away on Monday, and I've got a bit confused, but: is the proposal is to have two functions, one which removes intervals, and the other which keeps intervals? Obviously, there'd only have to be one implementation under the hood. This seems fine. If so, I think we should name the functions something close, e.g. |
I vote for |
FWIW |
I'm good with that. |
happy with this. also like @hyanwong's suggestion of using |
I'm going to go with |
I'm coding up |
We're removing the sites here so I don't see any need for a remove_sites method right now. |
Oh yes, I see - should have looked at the PR code, sorry. Nevertheless, I think we need to |
335d16f
to
95b7726
Compare
9b51122
to
94e54ad
Compare
OK, hopefully this is ready for final review and merge. |
Improve testing for keep/delete intervals and document. Also remove the ``test_dice`` file, as this isn't being used.
94e54ad
to
e0b85c9
Compare
Nobody has complained, so I'm going to merge this. |
Summary: we now have
and some useful things in :yay:! |
Also, TableCollection.keep_intervals(). I move the packing functions into util as it seemed like the right home for them. |
this is a baby step PR, aimed at getting the
slice()
function to work on multiple intervals. in talking with @petrelharp we figured the best start would be to implement this in a test, open a PR, and then move on. Thoughts on this inefficient implementation?