-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Intervalindex #15309
ENH: Intervalindex #15309
Conversation
63900d9
to
8415bb7
Compare
@shoyer so
what other signature is useful here? |
Codecov Report
@@ Coverage Diff @@
## master #15309 +/- ##
=========================================
Coverage ? 91%
=========================================
Files ? 145
Lines ? 50236
Branches ? 0
=========================================
Hits ? 45718
Misses ? 4518
Partials ? 0
Continue to review full report at Codecov.
|
Yes, that's one use case, though it's not necessarily worth doing it that's all it does. I was thinking of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got about half-way through, still need to look at the Cython.
I'm not convinced that it makes sense to add mask
. It introduces potential ambiguity in the data model, because now there are two ways of having null values (in the sub-indexes and mask) that we have the keep in sync.
doc/source/whatsnew/v0.20.0.txt
Outdated
@@ -96,6 +97,9 @@ support for bz2 compression in the python 2 c-engine improved (:issue:`14874`). | |||
|
|||
.. _whatsnew_0200.enhancements.uint64_support: | |||
|
|||
Enanced UInt64 Support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhanced
pandas/core/algorithms.py
Outdated
if dropna and (result.values == 0).all(): | ||
result = result.iloc[0:0] | ||
|
||
# normalizing is by len of all (regarless of dropna) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regardless
pandas/indexes/interval.py
Outdated
if right is None: | ||
|
||
if not isinstance(left, IntervalIndex): | ||
return Index(left, closed=closed, name=name, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would raise here instead. Otherwise something like IntervalIndex(np.arange(10))
would return a Int64Index
, which is clearly a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise with mixtures of Intervals
with inconsistent closed
values
In [46]: pd.IntervalIndex([pd.Interval(0, 1), pd.Interval(2, 3, closed='left')])
Out[46]: Index([(0, 1], [2, 3)], dtype='object')
That should probably raise.
pandas/indexes/interval.py
Outdated
def __new__(cls, left, right=None, closed='right', mask=None, | ||
name=None, copy=False, dtype=None, fastpath=False): | ||
|
||
if right is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you probably know, I'm not a big fan of these highly overloaded constructor.
I do get that it is convenient for consistency with other index types to write IntervalIndex(values)
.
pandas/indexes/interval.py
Outdated
if right is None: | ||
|
||
if not isinstance(left, IntervalIndex): | ||
return Index(left, closed=closed, name=name, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same concern as above
pandas/indexes/interval.py
Outdated
raise ValueError('left and right must have the same length') | ||
left_valid = notnull(self.left) | ||
right_valid = notnull(self.right) | ||
if not (left_valid == right_valid).all(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check should be updated to account for the mask.
pandas/indexes/interval.py
Outdated
closed='right') | ||
""" | ||
breaks = np.asarray(breaks) | ||
mask = isnull(breaks[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is reasonable, but note that the only valid IntervalIndex you can construct with from_breaks
that contains nulls is all null.
return len(self.left) | ||
|
||
@cache_readonly | ||
def values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I had a cython function for this (maybe not necessary).
pandas/indexes/interval.py
Outdated
def _maybe_cast_slice_bound(self, label, side, kind): | ||
return getattr(self, side)._maybe_cast_slice_bound(label, side, kind) | ||
|
||
def _convert_list_indexer(self, keyarr, kind=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which methods call _convert_list_indexer
?
pandas/indexes/interval.py
Outdated
|
||
else: | ||
if isinstance(target, IntervalIndex): | ||
raise NotImplementedError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be an important. I think it's necessary to ensure df.loc[df.index]
and df.reindex(df.index)
work?
9865000
to
a7f097b
Compare
Is [6] just not well-represented in the intervaltrees? I actually think this is a pretty common case. |
a7f097b
to
46f5179
Compare
Shouldn't this be a CategoricalIndex with IntervalIndex categories?
I think it's just as well represented for interval trees as it is for hash tables. But my understanding is that
So actually, for consistency we should probably raise in For this sort of behavior, I think you want |
interesting you should say that, I fiddled with this a bit
so So this actually makes a lot of sense. BUT, if you assign it to a DataFrame what should this do. I can leave it as a categorical, but then you get weird inconsistences, see here: bfd99d3, so instead I coerce this to a ndarray of intervals. Thus the index because an |
ahh, yes I can easily try different paths for unique and non-unique, ok. Do you have an implementation for non-unique? (e.g. given indexers return the included intervals for that indexer). |
@shoyer ahh I see you DO have a |
this all looks ok
This is suspect, this should be returning -1 for the indexers (rather than raising). So going to fix.
|
pandas/indexes/interval.py
Outdated
'for IntervalIndex indexers') | ||
else: | ||
return self._engine.get_indexer(target.values) | ||
indexer = self._engine.get_indexer(target.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go in a separate Intervalndex.get_indexer_non_unique
method, right?
My understanding of the contract of Index.get_indexer
is that it returns an integer array with length equal to target
.
this is from the original issue
why should an |
Yeah, I don't remember why now but this does seem strange. The only reason why this makes any sense to me is that there are optimizations one can do for look-ups if the IntervalIndex consists of non-overlapping, continuous intervals (you can use binary search instead of the interval tree). |
If you don't me asking... if I wanted to use the new
The code now has the changes made in this PR, but |
@qAp This PR has changes in the c code, so you need to rebuild pandas. So after the commands you used above to checkout the correct branch, you need to do |
cc37b9d
to
067375c
Compare
@qAp happy to have you test this out.....pls report any issues here. |
be778c4
to
9844010
Compare
@jreback Any news on this? |
@zfrenchee I have to make a few more internal changes to make some things consistent. can I ask your usecases? / examples (e.g. so can have additional tests mainly.....) |
b1c0c61
to
275d951
Compare
going to merge in next day or 2. Will almost certainly need a follow up in any event. |
closes pandas-dev#7640 closes pandas-dev#8625
rename _is_contained_in -> contains add sorting test
merged. give this a whirl. |
Jeff -- thanks for pushing this one past the finish line!
…On Fri, Apr 14, 2017 at 6:33 AM, Jeff Reback ***@***.***> wrote:
merged.
give this a whirl.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#15309 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1l706ZxVNoIWKJIn36ypkUn8-KZJks5rv3WVgaJpZM4L3m2_>
.
|
:> and let the bugs begin! hahah. |
closes #7640
closes #8625
reprise of #8707
Similar indicies
indexing