-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: New Index type for binned data (IntervalIndex) #7640
Comments
FWIW, in our local in-house n-dim library we have something similar (an IntervalAxis), and it works quite well. |
@shoyer all for this! I know you are against this, but I would encorage you to inherit from |
+1 here too. tho i think @shoyer great idea and excellent write up :) |
Thanks for the support! I'm not sure when I'll get around to implementing this, but I will add it to my source open backlog :). @jreback Agreed, for an new index class inside pandas, it is OK to subclass from @cpcloud Also agreed, |
This would be very useful for me, too. Currently I'm using a This also seems like it will help make contiguous groupby (#5494) easier, since it gives a natural choice of index for the groupings. |
if u (or anyone else) essentially test cases for everything from construction to various indexing ops eg for Int64Index result = Index([1,2,3]) |
Thanks for including me; sorry I didn't notice earlier (mail filter was throwing github alerts out). Indeed, I think a general interval index is probably a great addition; although, I lack the breadth in vision to see a general solution. I did actually implement a hacky version of an interval index in pyuvvis that converts a datetime index to intervals of seconds, minutes etc... The main lesson I learned is that your interval index should be able to map back to the original data. To do this, I actually retain the original datetimeindex, and use metadata like "_interval=True" to navigate between all of the logic. In my case, this mapping is stored on the TimeSpectra object (dataframe + metadata). I put a demo of this up in case seeing a hack in action might help in the design of a general solution. http://nbviewer.ipython.org/github/hugadams/pyuvvis/blob/master/examples/Notebooks/intervals.ipynb |
@hugadams Looking at your notebook, it appears you may be thinking of a TimedeltaIndex? The idea behind IntervalIndex is somewhat distinct -- although I can imagine that an IntervalIndex wrapping a TimedeltaIndex could be useful in some cases. @jreback Sounds like a good idea, when I get the chance I will start writing some test cases and add them to this issue. |
Ha ya exactly! Thanks, never even saw this thread. I'll post my notebook there for reference as well. I must not understand the intervalindex then. |
Here are a bunch of test cases: shoyer@838a597 I can open a PR if that makes things easier. |
@shoyer that's a nice test suite...link is good for now. but of course expand to an actual impl! |
I have updated the first post with some revisions to implementation details (per by test-cases). Basically, I realized that there is no a strong need to require that intervals be contiguous, and dropping that requirement should add some nice flexibility (e.g., the ability to subsample intervals with @jreback Haha, I thought that was your job? ;) In all seriousness, I will probably get around to this at some point but the existing Index objects are pretty complex. #5080 would help -- I'm not looking forward to writing kludges around this also being an |
well this is the removing of ndarray from Index! https://github.com/jreback/pandas/tree/index almost done |
Fixes pandas-dev#7640, pandas-dev#8625 This is a work in progress, but it's far enough along that I'd love to get some feedback. TODOs (more called out in the code): - [ ] documentation + docstrings - [ ] finish the index methods: - [ ] `get_loc` - [ ] `get_indexer` - [ ] `slice_locs` - [ ] comparison operations - [ ] fix `is_monotonic` (pending pandas-dev#8680) - [ ] ensure sorting works - [ ] arithmetic operations (not essential for MVP) - [ ] cythonize the bottlenecks: - [ ] `from_breaks` - [ ] `_data` - [ ] `Interval`? - [ ] `MultiIndex` - [ ] `Categorical`/`cut` - [ ] serialization - [ ] lots more tests CC @jreback @cpcloud @immerrr
First draft PR is up in #8707. So far, this is actually much easier than I feared... |
For those of you not following along in #901 (which is honestly a dup of this issue), I am now thinking that the implementation here should probably use an actual interval-tree rather than relying on sortedness. Also, for future reference: a suitable data-structure for an index of multi-dimensional intervals (an |
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
closes pandas-dev#7640 closes pandas-dev#8625
I'm not sure if this belongs here or elsewhere. However, I'm trying to not clutter everything uselessly by just adding to the ever growing list of issues. If this belongs elsewhere, I'm happy to move it. Is there a reason >>> pd.cut(np.linspace(0,` 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index().index
CategoricalIndex([ (0.0, 11.222], (11.222, 22.444], (22.444, 33.667],
(33.667, 44.889], (44.889, 56.111], (56.111, 67.333],
(67.333, 78.556], (78.556, 89.778], (89.778, 101.0]],
categories=[(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], ...], ordered=True, dtype='category') instead of the following. IntervalIndex([(0.0, 11.222], (11.222, 22.444], (22.444, 33.667], (33.667, 44.889], (44.889, 56.111], (56.111, 67.333], (67.333, 78.556], (78.556, 89.778], (89.778, 101.0]]
closed='right',
dtype='interval[float64]') I would naively think that an cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.plot(index_part="mid") would plot the counts vs. the index mid point. |
the categories are an interval index or whatever type we are actually binning cut/qcut return categorical always |
I understand. However, a categorical index does not have the same methods
and properties available as an interval index. Is it at all reasonable to
return an interval index when the categories are purely numeric? Are there
reasons to use a categorical over an interval?
Ben
…---------------------
B. L. Alterman
Candidate, Applied Physics
Solar and Heliospheric Research Group
Climate and Space Sciences and Engineering
University of Michigan
balterma@umich.edu
On Mon, May 7, 2018 at 7:19 PM, Jeff Reback ***@***.***> wrote:
the categories are an interval index or whatever type we are actually
binning
cut/qcut return categorical always
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7640 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AMPWVbkMDlkY9ZtE0QDOdOjmLJNBG41Gks5twNaXgaJpZM4CJWMy>
.
|
@bla1089 I did consider return an
So in theory it is possible, but I don't really see a compelling reason to switch. cats are a nicer holder type of data like this. What exactly is the issue? |
@jreback I find myself doing the following type of thing rather often: cut = pd.cut(np.linspace(0, 100), bins=np.linspace(0, 101, 10)).value_counts().sort_index()
cut.index = pd.IntervalIndex(cut.index).mid.astype(float)
cut.plot(drawstyle="steps-mid") I hadn't seen a particular issue for it and I was wondering if I was missing something. The backwards compat issue is certainly relevant. |
Design
The idea is to have a natural representation of the grids that ubiquitously appear in simulations and measurements of physical systems. Instead of referencing a single value, a grid cell references a range of values, based on the chosen discretization. Typically, cells boundaries would be specified by floating point numbers. In one dimension, a grid cell corresponds to an interval, the name we use here.
The key feature of
IntervalIndex
is that looking up an indexer should return all intervals in which the indexer's values fall.FloatIndex
is a poor substitute, because of floating point precision issues, and because I don't want to label values by a single point.A
IntervalIndex
is uniquely identified by itsintervals
andclosed
('left'
or'right'
) properties, an ndarray of shape(len(idx), 2)
, indicating each interval. Other useful properties forIntervalIndex
would includeleft
,right
andmid
, which should return arrays (indexes?) corresponding to the left, right or mid-points of each interval.The constructor should allow the optional keyword argument
breaks
(an array of lengthlen(idx) + 1
) to specified instead ofintervals
.It's not entirely obvious what
idx.values
should be (idx.mid
? strings like'(0, 1]'
? an array of tuples orInterval
objects?). I think the most useful choice for cross compatibility would probably be to an ndarray likeidx.mid
.IntervalIndex
should support mathematical operations (e.g.,idx + 1
), which are calculated by vectorizing the operation over the breaks.Examples
An example already in pandas that should be a
IntervalIndex
is thelevels
property of categorical returned bycut
, which is currently an object array of strings:Example usage:
Implementation
A
IntervalIndex
would be a monotonic and non-overlapping one-dimensional array of intervals. It is not required to be contiguous. A scalarInterval
would correspond to a contiguous interval between start and stop values (e.g., given by integers, floating point numbers or datetimes).For index lookups, I propose to do a binary search (
np.searchsorted
) onidx.left
. If we add the constraint that all intervals must have a fixed width, we could calculate the bin using a formula in constant time, but I'm not sure the loss in flexibility would be worth the speedup.IntervalIndex
should play nicely when used as the levels forCategorical
variable (#7217), but it is not the same as aCategoricalIndex
(#7629). For example, aIntervalIndex
should not allow for redundant values. To represent redundant or non-continuous intervals, you would need to make in aCategorical
orCategoricalIndex
which uses aIntervalIndex
for the levels. Callingdf.reset_index()
on anDataFrame
with anIntervalIndex
would create a newCategorical
column.Note: I'm not entirely sure if this design doc belongs here or on mailing list (I'm happy to post it there if requested).
Here is the comment where I brought this up previously: #5460 (comment)
CC @hugadams -- I expect
IntervalIndex
would be very handy for your pyuvvis.The text was updated successfully, but these errors were encountered: