-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/API/ENH: IntervalIndex #8707
Conversation
so the whole notion of boxing - eg having an underlying impl then getting s boxed scalar type out is inherent in what u r doing you get this for free - see here though prob need to strip out the box/iter stuff (and maybe some other stuff - contains) can put in another mixin that u can use (it could live in index.py maybe BoxMixIn) |
so one of the big use cases here is as the index in a series coming out of cut/qcut |
def __ne__(self, other): | ||
return not self == other | ||
|
||
def __lt__(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can generically do the index ops like this: https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L43
alternatively you can have a _make_index_op
that returns the a function
looks really good! Instead of cythonizing You are using an |
@shoyer sort of confused why you don't need it for indexing (e.g you have left/right for that purpose). ONLY on boxing do you need to do this (e.g. you need to implement the Think of this like |
@jreback Yes, not including _data at all and skipping all the standard index machinery was my original idea. But, there are actually a few advantages to keeping that representation around (the object index):
|
I disagree entirely. The whole point is that you don't need the object representation. You already have the impl in
caveat: what I did is naive. And not sure of the exact lookup semantics for IN an interval, but easy enough to maybe keep a 'freq' for left/right, (or can just do right-left), e.g. to find the 'natural' interval of the Index. Then you can take a number and find the left and right of it (WITHOUT using searchsorted, but indexing which is O(1)) then do a simple comparison. you ONLY need to use search sorted I think if their is no freq. E.g. you have a bunch of non-regular intervals (but since you have left and right you can prob do a pretty good job, e.g. if you find where it is on the lft, then you know about where it is on the right). You may need a custom searchsorted type of this 'irregular' intervals. |
(this is w/o boxing)
With Boxing
remember these still bee the original construction + ._data and keeping it around has memory overhead (much more than 2x the original) |
@jreback OK, I will give this a try without using |
haha, ok. so the freq I am talking about is between 2 Intervals and not the distance in a single interval (which is a value like the left/right). Hmm do they have to be the same? I guess for a regular interval series they do.
have a look at tseries.frequency.FrequencyInferer, though it this case its almost trivial e.g.
BTW, my point of this example is to have tests for the possible types of interval operands: integer, float, Timedeltas come to mind, though I suppose Timestamps/Periods make sense too. hmm, then their is:
Which i suppose 'works', but not sure how searching this would work. Maybe only allow certain types (and the left/right of an Interval should be the same). I think CategoricalIndex is the way to the above. |
I think you do need both the space between intervals and the distance for single intervals to be at a constant frequency in order to be able to do fast lookups. Everything needs to be able to map to a I may try to get things working first for the general case and then add the optimized Also agreed that we need tests for many types. Strings can't have a well-defined frequency, but I don't much harm in allowing their use. |
yup though you don't need it to map to This Index is just a dispatcher really to the combination of the underlying indexes. Much like MultiIndex is a collection of Index objects (that is well-ordered). |
|
||
@cache_readonly | ||
def mid(self): | ||
# TODO: figure out how to do add/sub as arithemtic even on Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I need some way to take the difference of two DatetimeIndex
objects (i.e., returning a TimedeltaIndex
). Because +
/-
still means union/set difference for indexes, I can't think of any way to do this right now short of converting to series objects (ugh) or converting the indexes to ndarrays and doing the NaT
checks manually (also ugh).
What do you think? I think we either need to define add
and sub
methods for indexes (kind of pointless in the long term) or make +
/-
on indexes do arithmetic in 0.16. Or we could define temporary private _add
and _sub
methods on indexes for this strict benefit of this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in reality all I really need is the mean of two indexes. So I could also write a one-off function for that, doing the special work arounds for time indexes.
Just do this to subtract the 2 dti's (and not so set difference). bti I would for now not allow tz's in any passed in left/right (may work though)
|
@jreback you're not worried about NaT screwing things up? I suppose we might as well insist that left and right cannot have missing values, anyways. |
Actually, turns out subtracting
I am shocked! |
hahha their is a simple routine to do this though let me see about making this not do set difference and do subtraction on - it does make more sense maybe
|
@shoyer hows this coming? |
I suggest moving this code to the new |
+1 on moving as well |
can you rebase/update |
@shoyer can u rebase it? |
@shoyer status (can you rebase as well) |
I still would love to see It feels like this change is 80% of the way there, but that last 20% (tests, docs, handling edge cases, compat with other parts of the library) is a lot of work. I also worry about adding in more ad-hoc Cython templating when we are on the verge of a core rewrite that will provide a much more sustainable way to do this (with C++ templates). |
status? |
See above. Nothing new to add. |
ok closing then |
@shoyer thinking of finishing this assume no objections and still quite useful feature? |
@jreback Absolutely, I would be totally supportive if you or anyone else wants to finish this! I agree that tempita is the way to go for templating. |
so this is possible as a numpy array, but does not make sense as an This comes from this (a test in
? |
@jreback Yes, that was one of the trickier aspects of this PR. Basically, the fundamental problem here is the
The lowest category is an interval that is closed on both sides, different from all the other intervals. Possible the best path forward here is to deprecate |
hmm, this should be then equivalent (at least for floats) to:
or reallly just (this would work for Timestamp/Timedelta as well, though for ints would end up changing the dtype..... |
So there's also a case for ditching |
hmm, ok, have it mostly working. rebasing was easy. but have to fix things to be in the style of Index, and the templating. soon. |
@shoyer I think was your intention, and after mulling it over for a while it makes sense. Even though this might be a bit odd (as the API now will return different 'types' depending if the labels are specified), this makes intuitive sense (as if a user provides labels, I could make an In point of fact these 2 things (Categoricals and IntervalIndex) work very similarly when grouping and such, so I think this is ok. 0.19.2
with PR
|
@shoyer pls have a look, all tests finally passing, though a few issues remain. here's summary of changes since your version.
Going to work some more on this before a PR, but any comments welcome.
the I had a heckava time getting |
Re the |
@jorisvandenbossche yeah I think we did talk about doing a
|
actually this just works (so will fix API).
and in 0.19.2
|
ok pretty simple. added
|
closes #7640 closes #8625 reprise of #8707 Author: Jeff Reback <jeff@reback.net> Author: Stephan Hoyer <shoyer@climate.com> Closes #15309 from jreback/intervalindex and squashes the following commits: 11ab1e1 [Jeff Reback] merge conflicts 834df76 [Jeff Reback] more docs fbc1cf8 [Jeff Reback] doc example and bug 7577335 [Jeff Reback] fixup on merge of changes in algorithms.py 3a3e02e [Jeff Reback] sorting example 4333937 [Jeff Reback] api-types test fixing f0e3ad2 [Jeff Reback] pep b2d26eb [Jeff Reback] more docs e5f8082 [Jeff Reback] allow pd.cut to take an IntervalIndex for bins 4a5ebea [Jeff Reback] more tests & fixes for non-unique / overlaps rename _is_contained_in -> contains add sorting test 340c98b [Jeff Reback] CLN/COMPAT: IntervalIndex 74162aa [Stephan Hoyer] API/ENH: IntervalIndex
closes #7640
closes #8625
This is a work in progress, but it's far enough along that I'd love to get
some feedback.
TODOs (more called out in the code):
get_loc
get_indexer
slice_locs
get_loc
get_indexer
(for non-overlapping monotonic indexes)is_monotonic
always works (also lexicographic)order
workspd.Index
values
from_intervals
?Interval
?from_tuples
?Categorical
/cut
get_indexer
in full generality (for randomly ordered indexes)MultiIndex
-- should at least give a sane error messagesNotImplementedError
interval_range
-- likeperiod_range
to_interval
for casting strings (e.g.,to_interval('(0, 1]')
)Series.loc.__getitem__
Series.loc.__setitem__
CC @jreback @cpcloud @immerrr