-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: IntervalIndex[IntervalArray] #20611
Conversation
Hello @TomAugspurger! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on July 13, 2018 at 01:01 Hours UTC |
@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None): | |||
ret = take_1d(self.categories.values, self._codes) | |||
if dtype and not is_dtype_equal(dtype, self.categories.dtype): | |||
return np.asarray(ret, dtype) | |||
if is_extension_array_dtype(ret): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
__array__
has to return an ndarray. Without this, Categorical[ExtensionArray]would fail, as
take_1d(...)` would be an ExtensionArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an update on this section already in intna
@@ -683,6 +682,11 @@ def construct_from_string(cls, string): | |||
msg = "a string needs to be passed, got type {typ}" | |||
raise TypeError(msg.format(typ=type(string))) | |||
|
|||
@property | |||
def type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the correct scalar type for IntervalArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc-string
@@ -270,7 +270,8 @@ def __new__(cls, data=None, dtype=None, copy=False, name=None, | |||
**kwargs) | |||
|
|||
# interval | |||
if is_interval_dtype(data) or is_interval_dtype(dtype): | |||
if ((is_interval_dtype(data) or is_interval_dtype(dtype)) and | |||
not is_object_dtype(dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so Index(intervals, dtype=object)
works correctly (Index, not IntervalIndex)
@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key): | |||
------- | |||
ndarray of hashed values array, same size as len(c) | |||
""" | |||
hashed = hash_array(c.categories.values, encoding, hash_key, | |||
# Convert ExtensionArrays to ndarrays | |||
values = np.asarray(c.categories.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Categorica[ExtensionArray].values
will be an EA, and this has to be an ndarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, shouldn't you need to check if this is an EA here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, asarray
won't copy an ndarray, so calling it on an ndarray should be fine.
is_extension_array_dtype
takes ~ 2 microseconds, and asarray
takes ~10 ns, so better to just always call asarray instead of checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this change? what is it for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried removing it, and test_hash_scalar
failed for Interval(0, 1)
:
pandas/pandas/tests/util/test_hashing.py
Lines 91 to 102 in 1033e8b
def test_hash_scalar(self): | |
for val in [1, 1.4, 'A', b'A', u'A', pd.Timestamp("2012-01-01"), | |
pd.Timestamp("2012-01-01", tz='Europe/Brussels'), | |
datetime.datetime(2012, 1, 1), | |
pd.Timestamp("2012-01-01", tz='EST').to_pydatetime(), | |
pd.Timedelta('1 days'), datetime.timedelta(1), | |
pd.Period('2012-01-01', freq='D'), pd.Interval(0, 1), | |
np.nan, pd.NaT, None]: | |
result = _hash_scalar(val) | |
expected = hash_array(np.array([val], dtype=object), | |
categorize=True) | |
assert result[0] == expected[0] |
@@ -402,13 +403,17 @@ def encode(obj): | |||
u'freq': u_safe(getattr(obj, 'freqstr', None)), | |||
u'tz': tz, | |||
u'compress': compressor} | |||
elif isinstance(obj, IntervalIndex): | |||
return {u'typ': u'interval_index', | |||
elif isinstance(obj, (IntervalIndex, IntervalArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should talk about this more generally in #20612
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very very tricky from a back-compat perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm, that doesn't sound fun. I'll see what I can do to avoid that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, should this be reverted? Do any msgpack tests fail if it is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test below fails on the last copied line for Interval
data if reverted:
pandas/pandas/tests/io/test_packers.py
Lines 357 to 360 in da6e26d
def test_basic_index(self): | |
for s, i in self.d.items(): | |
i_rec = self.encode_decode(i) |
@@ -546,10 +546,8 @@ def test_basic(self): | |||
|
|||
s = Series(ii, name='A') | |||
|
|||
# dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API change.
@@ -886,7 +886,7 @@ def test_hasnans_isnans(self): | |||
assert not idx.hasnans | |||
|
|||
idx = index.copy() | |||
values = idx.values | |||
values = np.asarray(idx.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this change since IntervalArray
currently isn't mutable. Worth talking about that.
@@ -51,7 +51,7 @@ def test_astype_category(self, index): | |||
'timedelta64', 'timedelta64[ns]', 'datetime64', 'datetime64[ns]', | |||
'datetime64[ns, US/Eastern]']) | |||
def test_astype_cannot_cast(self, index, dtype): | |||
msg = 'Cannot cast IntervalIndex to dtype' | |||
msg = 'Cannot cast IntervalArray to dtype' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm shouldn't have changed this. Looking into it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level summary:
- Still have IntervalIndex composed with IntervalArray. This seems the most sensible for me.
- IntervalArray currently isn't mutable. That could be changed but will require some care to ensure that
.left
and.right
are modified inplace atomically
pandas/core/indexes/interval.py
Outdated
_typ = 'intervalindex' | ||
_comparables = ['name'] | ||
_attributes = ['name', 'closed'] | ||
_allow_index_ops = True | ||
_exception_rewrite = lambda: rewrite_exception('IntervalArray', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what people's thoughts are. Without this, the exception would be
In [6]: pd.IntervalIndex(2)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: IntervalArray(...) must be called with a collection of some kind, 2 was passed
This lets the message be IntervalIndex(...)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments -1 unless much more shared code
pandas/core/arrays/__init__.py
Outdated
from .interval import IntervalArray | ||
|
||
|
||
__all__ = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don’t need the all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's so the #noqa
aren't needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don’t do this anywhere else - don’t need introduce new ways of doing things
u less u r fixing it everywhere
result = IntervalMixin.__new__(cls) | ||
|
||
closed = closed or 'right' | ||
left = _ensure_index(left, copy=copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is SO duplicating II
have to do something about this - this is a technical debt nightmare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you see IntervalIndex._simple_new? There isn't any duplicated code between the two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so EA takes a different take on construction that in Index. Where we use _simple_new and _shalllow_copy, but conceptually these are the same things. Would be ok with changing the spellings inside Index to conform this. This is my main objection generally. We spell things one way and a different way in EA, this is so confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Can you elaborate on this comment (as you say this is your main objection, but I don't fully understand it).
So IntervalArray._simple_new
construct from left/right, while IntervalIndex._simple_new
now constructs from values (IntervalArray), while previously it constructed from left/right. So yes, this is different between array and index, but, for IntervalIndex it is now more consistent with the base Index._simple_new (which constructs from values), so I think that is a good thing.
you copied code from II this is completely he wrong direction to go |
Could you maybe demo a single method that you think can be moved to a mixin? |
this should be an extremely simple wrapper around the Index code my point is evey method here should be shared |
pandas/core/indexes/interval.py
Outdated
value itself if False, ``nan``. | ||
|
||
..versionadded:: 0.23.0 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what the heck is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return cls.from_arrays(left, right, closed, name=name, copy=False, | ||
dtype=dtype) | ||
|
||
def to_tuples(self, na_tuple=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger i do see u ripped some shared code out of this
but the top level EA is just adding way too much
Again, I really don't think there's any code duplication in the current implementation. We've been discussing Indexes subclassing EAs in #19696, but the prevailing opinion there is that composition is better, specifically #19696 (comment) and #19696 (comment). If you want to weigh in there, then please do. But assuming we do have
Ok... I don't think that's worth the extra time it'll take to rewrite CategoricalIndex and IntervalIndex to use inheritance. |
Writable docstrings. Removed lambda as class attribute.
@@ -1347,7 +926,7 @@ def _format_data(self, name=None): | |||
tail = [formatter(x) for x in self] | |||
summary = '[{tail}]'.format(tail=', '.join(tail)) | |||
|
|||
return summary + self._format_space() | |||
return summary + ',' + self._format_space() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: This was missing a ,
at the end of the first line.
In [2]: pd.IntervalIndex.from_breaks([1, 2])
Out[2]:
IntervalIndex([(1, 2]] # <-- missing comma
closed='right',
dtype='interval[int64]')
I think also closes #19453, as it achieves the desired end result? At the very least it looks to implement the behavior I wanted from that issue. Will try taking a closer look at the changes here after work. |
Thanks, forgot about that issue. Adding to the closed list.
…On Wed, Apr 4, 2018 at 2:03 PM, jschendel ***@***.***> wrote:
I think also closes #19453
<#19453>, as it achieves the
desired end result? At the very least it looks to implement the behavior I
wanted from that issue. Will try taking a closer look at the changes here
after work.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20611 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIupYF34AACMJv8OlV5cGnSZ53l8qks5tlRj2gaJpZM4THJIe>
.
|
Codecov Report
@@ Coverage Diff @@
## master #20611 +/- ##
=========================================
Coverage ? 91.96%
=========================================
Files ? 166
Lines ? 50274
Branches ? 0
=========================================
Hits ? 46232
Misses ? 4042
Partials ? 0
Continue to review full report at Codecov.
|
Did a basic version of I think that we have to copy both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I won't have time this week (offline the next days) to do a thorough review, but +1 on general approach. Thanks a lot for this!
For IntervalIndex, if we want to reduce the lines of code, we could rather easily add the interval-specific methods and attributes (left, right, closed, length, mid, is_non_overlapping_monotonic, ...) in an automatic way (like we add arithmetic methods on DataFrame etc).
Basically the methods that just passthrough the self._data
equivalent and could also inherit the docstring automatically.
pandas/core/arrays/interval.py
Outdated
return self.values | ||
|
||
def get_values(self): | ||
return self.values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method needed?
pandas/core/arrays/interval.py
Outdated
|
||
return self._shallow_copy(new_left, new_right) | ||
|
||
take_nd = take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think this should be needed?
pandas/core/arrays/interval.py
Outdated
|
||
# Conversion | ||
@property | ||
def values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not call this .values
(is this property required somewhere else?)
This is basically __array__
?
pandas/core/arrays/interval.py
Outdated
tuples = np.where(~self.isna(), tuples, np.nan) | ||
return tuples | ||
|
||
def tolist(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed somewhere?
pandas/core/arrays/interval.py
Outdated
# TODO: think about putting this in a parent | ||
return self.values.tolist() | ||
|
||
def repeat(self, repeats): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem, is this needed?
# Avoid materializing self.values | ||
return self.left.size | ||
# Avoid materializing ndarray[Interval] | ||
return self._data.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could in principle be moved to the base Index class (same for shape, itemsize, dtype, ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few quick comments; will review more later.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -299,6 +299,41 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist | |||
df['A'].dtype | |||
df['B'].dtype | |||
|
|||
.. _whatsnew_023.enhancements.interval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_whatsnew_023.
--> _whatsnew_0230.
and likewise on line 337.
|
||
return self._shallow_copy(left, right) | ||
|
||
def __setitem__(self, key, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this support the case where value
is np.nan
? Doesn't look like that currently works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it currently fails. I'll see what can be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's also something strange going on with datetime data:
In [2]: ia = pd.interval_range(pd.Timestamp('20180101'), periods=3).values
In [3]: ia
Out[3]:
IntervalArray([(2018-01-01, 2018-01-02], (2018-01-02, 2018-01-03], (2018-01-03, 2018-01-04]],
closed='right',
dtype='interval[datetime64[ns]]')
In [4]: new_iv = pd.Interval(pd.Timestamp('20180101'), pd.Timestamp('20180105'))
In [5]: new_iv
Out[5]: Interval('2018-01-01', '2018-01-05', closed='right')
In [6]: ia[0] = new_iv
In [7]: ia
Out[7]: ---------------------------------------------------------------------------
ValueError: left side of interval must be <= right side
In [8]: ia.left
Out[8]: DatetimeIndex(['2018-01-01', '2018-01-05', '2018-01-03'], dtype='datetime64[ns]', freq='D')
In [9]: ia.right
Out[9]: DatetimeIndex(['2018-01-05', '2018-01-03', '2018-01-04'], dtype='datetime64[ns]', freq='D')
pandas/core/arrays/interval.py
Outdated
return self._shallow_copy(left, right) | ||
|
||
def __setitem__(self, key, value): | ||
if not (is_interval_dtype(value) or isinstance(value, Interval)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should ABCInterval
be used for the isinstance
check?
pandas/core/arrays/interval.py
Outdated
|
||
if not isinstance(value, Interval): | ||
msg = ("Interval.fillna only supports filling with scalar " | ||
"'value'. Got a '{}' instead of a 'pandas.Interval'." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message seems a little unclear/confusing. Maybe something like: "IntervalArray.fillna only supports filling with pandas.Interval, got a '{}' instead."
pandas/core/arrays/interval.py
Outdated
self._left = left | ||
self._right = right | ||
|
||
def fillna(self, value=None, method=None, limit=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be some extra checks on value
here. Right now if I pass an Interval
with a different closed
, it silently gets changed to match the array's closed
:
In [2]: ia = pd.interval_range(0, 4).insert(0, np.nan).values
In [3]: ia
Out[3]:
IntervalArray([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
closed='right',
dtype='interval[float64]')
In [4]: ia.fillna(pd.Interval(10, 20, closed='both'))
Out[4]:
IntervalArray([(10.0, 20.0], (0.0, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]],
closed='right',
dtype='interval[float64]')
Could probably create some type of helper function/method for this, as I'd guess it'd be somewhat of a common thing to do.
pandas/core/arrays/interval.py
Outdated
def __setitem__(self, key, value): | ||
if not (is_interval_dtype(value) or isinstance(value, Interval)): | ||
msg = "'value' should be an interval type, got {} instead." | ||
raise TypeError(msg.format(type(value))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like generally type(value).__name__
is used in messages instead of just type(value)
. Not super concerned with this distinction though, just a minor inconsistency I noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left some comments. But this move shows a lot of rough edges. you had to copy quite a bit of code that is already in the index by base class. I really dont' want to re-invent the wheel. Need to decide what goes to EA base class (but that can't be used in Index ATM), and what should be in a shared EA/Index Mixin. This is getting pretty messy.
pandas/compat/__init__.py
Outdated
@@ -451,3 +451,11 @@ def is_platform_mac(): | |||
|
|||
def is_platform_32bit(): | |||
return struct.calcsize("P") * 8 < 64 | |||
|
|||
|
|||
class _WritableDoc(type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be in pandas.util._decorators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a decorator. Still want it there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we have a pandas.util._doctools, move there
@@ -1218,6 +1219,8 @@ def __array__(self, dtype=None): | |||
ret = take_1d(self.categories.values, self._codes) | |||
if dtype and not is_dtype_equal(dtype, self.categories.dtype): | |||
return np.asarray(ret, dtype) | |||
if is_extension_array_dtype(ret): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment this.
@@ -683,6 +682,11 @@ def construct_from_string(cls, string): | |||
msg = "a string needs to be passed, got type {typ}" | |||
raise TypeError(msg.format(typ=type(string))) | |||
|
|||
@property | |||
def type(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc-string
pandas/core/indexes/interval.py
Outdated
_typ = 'intervalindex' | ||
_comparables = ['name'] | ||
_attributes = ['name', 'closed'] | ||
_allow_index_ops = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a remnant of a failed refactor. Will remove.
# both left and right are values | ||
pass | ||
|
||
result = self._data._shallow_copy(left=left, right=right) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so you are copying then constructing again???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left & right are shallow copied just like before, and _simple_new
is called just like before. Unless I made a mistake (which I don't think I did), this is identical to before.
pandas/core/arrays/interval.py
Outdated
right = np.concatenate([interval.right for interval in to_concat]) | ||
return cls._simple_new(left, right, closed=closed, copy=False) | ||
|
||
def _shallow_copy(self, left=None, right=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a method in EA? it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as this is only uses internally within the IntervalArray class, it is not necessarily needed to make this part of the official ExtensionArray interface.
What are the advantages in doing so? (also the signature will be different for each array type)
pandas/core/arrays/interval.py
Outdated
verify_integrity=False) | ||
|
||
# TODO: doc | ||
def copy(self, deep=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this in EA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, copy
is part of the extension array interface
pandas/core/arrays/interval.py
Outdated
return self.left.itemsize + self.right.itemsize | ||
|
||
def take(self, indices, axis=0, allow_fill=True, fill_value=None, | ||
**kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc. wonder how of this you can push to EA base clasess (I suspect almost all of it), this is a pure indexing operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be in favor of adding some basic implementation of take
to EA, but in this case, IntervalArray
would have to override it anyhow, as it needs to take both from left
and right
, which is something very specific to intervals.
pandas/core/arrays/interval.py
Outdated
|
||
# Formatting | ||
|
||
def _format_data(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really a generic formatter Index. should this be in EA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or rather somewhere in the formatting code? (and then it could still be called from within a default EA repr)
head = [] | ||
tail = [formatter(x) for x in self] | ||
summary = '[{tail}]'.format(tail=', '.join(tail)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there is lots of copying of other code because you can't share the Index base things for this. You need prob push some of this code (e.g. formatting) to something like IndexOpsMixin (or the like). and use that here.
Pushed some pre-rebase commits:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. I think you're good to merge in master anytime.
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -299,7 +299,42 @@ Supplying a ``CategoricalDtype`` will make the categories in each column consist | |||
df['A'].dtype | |||
df['B'].dtype | |||
|
|||
.. _whatsnew_023.enhancements.extension: | |||
.. _whatsnew_0230.enhancements.interval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will need changing during the rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good.
doc/source/whatsnew/v0.24.0.txt
Outdated
Storing Interval Data in Series and DataFrame | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Interval data may now be stored in a Series or DataFrame, in addition to an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks on Series/DataFrame.
doc/source/whatsnew/v0.24.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Interval data may now be stored in a Series or DataFrame, in addition to an | ||
:class:`IntervalIndex` like before (:issue:`19453`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before -> previously.
doc/source/whatsnew/v0.24.0.txt
Outdated
ser | ||
ser.dtype | ||
|
||
Previously, these would be cast to a NumPy array of Interval objects. In general, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around Interval, or use a class ref. & Series
doc/source/whatsnew/v0.24.0.txt
Outdated
a Series. | ||
|
||
Note that the ``.values`` of a Series containing intervals is no longer a NumPy | ||
array. Rather, it's an ``ExtensionArray``, composed of two arrays ``left`` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to show the internal impl. (e.g. just the first sentence is ok)
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
ser.values | ||
|
||
To recover the NumPy array of Interval objects, use :func:`numpy.asarray`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you are going to have the below section I would not use this here (showing asarray) as its duplicative, just point to the ref for the below section)
pandas/core/arrays/interval.py
Outdated
return cls._simple_new(left, right, closed=closed, copy=False) | ||
|
||
def _shallow_copy(self, left=None, right=None, closed=None): | ||
if left is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a doc-string
# Formatting | ||
|
||
def _format_data(self): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can do in a followup (though prob easy here) to use
format_object_summary
from pandas.core.formats.printing
as this is vastly simplified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its used generally for all Indexes now
@@ -205,7 +205,9 @@ def _hash_categorical(c, encoding, hash_key): | |||
------- | |||
ndarray of hashed values array, same size as len(c) | |||
""" | |||
hashed = hash_array(c.categories.values, encoding, hash_key, | |||
# Convert ExtensionArrays to ndarrays | |||
values = np.asarray(c.categories.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we still need this change? what is it for?
@@ -240,6 +240,7 @@ def test_reindex_non_na_fill_value(self, data_missing): | |||
na = data_missing[0] | |||
|
|||
array = data_missing._from_sequence([na, valid]) | |||
print(array) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous print
|
||
|
||
@contextlib.contextmanager | ||
def rewrite_exception(old_name, new_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being used to overwrite exception messages so they display the correct class name. Looks like it's currently just used for IntervalIndex
when it calls an underlying IntervalArray
method that could raise some with an IntervalArray
specific message, e.g. 'category, object, and string subtypes are not supported for IntervalArray'
.
I suppose we could try to either make these messages slightly more generic ("IntervalArray" --> "IntervalArray-like" or "Interval classes" maybe) or try implementing an additional common interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like defining a custom exception type in arrays/interval.py and then
def _raise_exception(self):
# We have access to the correct `type(self).__name__`
pass
def method(self):
try:
do_thing()
except _IntervalException:
self._raise_exception()
does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could just use IntervalArray in the exception message. Doesn't really matter to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't really have a preference either. I might end up trying a few things to see what's cleanest.
Addressed the latest set of review comments, aside from the |
pandas/core/arrays/interval.py
Outdated
if not isinstance(value, ABCInterval): | ||
msg = ("'IntervalArray.fillna' only supports filling with a " | ||
"'scalar pandas.Interval'. Got a '{}' instead." | ||
.format(type(value).__name__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single quote is in the wrong place. Should be 'pandas.Interval'
.
pandas/core/arrays/interval.py
Outdated
|
||
def __setitem__(self, key, value): | ||
# na value: need special casing to set directly on numpy arrays | ||
_needs_float_conversion = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_needs_float_conversion -> needs_float_conversion?
pandas/core/arrays/interval.py
Outdated
raise ValueError("Intervals must all be closed on the same side.") | ||
closed = closed.pop() | ||
|
||
# TODO: avoid intermediate list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO (which I think I added) seems unavoidable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I don't think it can be avoided. Removed the TODO. IIUC the intermediate list doesn't copy data, so not a big deal (np.concatenate
does of course).
""" | ||
|
||
@Appender(_interval_shared_docs['set_closed'] % _shared_docs_kwargs) | ||
def set_closed(self, closed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make IntervalArray.closed
a setter as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this would be a problem with the current implementation, but #19371 makes me a bit hesitant since it'd make a closed
setter a (limited) dtype
setter.
@@ -402,13 +403,17 @@ def encode(obj): | |||
u'freq': u_safe(getattr(obj, 'freqstr', None)), | |||
u'tz': tz, | |||
u'compress': compressor} | |||
elif isinstance(obj, IntervalIndex): | |||
return {u'typ': u'interval_index', | |||
elif isinstance(obj, (IntervalIndex, IntervalArray)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, should this be reverted? Do any msgpack tests fail if it is?
|
||
|
||
@contextlib.contextmanager | ||
def rewrite_exception(old_name, new_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like defining a custom exception type in arrays/interval.py and then
def _raise_exception(self):
# We have access to the correct `type(self).__name__`
pass
def method(self):
try:
do_thing()
except _IntervalException:
self._raise_exception()
does that make sense?
|
||
|
||
@contextlib.contextmanager | ||
def rewrite_exception(old_name, new_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could just use IntervalArray in the exception message. Doesn't really matter to me.
doc/source/basics.rst
Outdated
* :ref:`Categorical <categorical>` | ||
* :ref:`Datetime with Timezone <timeseries.timezone_series>` | ||
* :ref:`Period <timeseries.periods>` | ||
* Interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a ref for this? (maybe should create one if not)?
doc/source/whatsnew/v0.24.0.txt
Outdated
@@ -345,6 +414,7 @@ Interval | |||
^^^^^^^^ | |||
|
|||
- Bug in the :class:`IntervalIndex` constructor where the ``closed`` parameter did not always override the inferred ``closed`` (:issue:`19370`) | |||
- Bug in the ``IntervalIndex`` repr where a trailing comma was missing after the list of intervals (:issue`20611`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need extra colon
copy=False, dtype=None, verify_integrity=True): | ||
result = IntervalMixin.__new__(cls) | ||
|
||
closed = closed or 'right' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe closed should always be not-None here by-definition?
raise TypeError(msg.format(type(value))) | ||
|
||
# Need to ensure that left and right are updated atomically, so we're | ||
# forced to copy, update the copy, and swap in the new values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I am clear why this is problematic. is there a failure possible when updating the 2nd one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the forcing the copy I'm seeing test_setitem_sequence
fail:
pandas/pandas/tests/extension/base/setitem.py
Lines 17 to 23 in 1dd05cc
def test_setitem_sequence(self, data): | |
arr = pd.Series(data) | |
original = data.copy() | |
arr[[0, 1]] = [data[1], data[0]] | |
assert arr[0] == original[1] | |
assert arr[1] == original[0] |
pandas/core/arrays/interval.py
Outdated
self._right = right | ||
|
||
def fillna(self, value=None, method=None, limit=None): | ||
if method is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally add a doc-string
class TestReshaping(BaseInterval, base.BaseReshapingTests): | ||
pass | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose should move some of these tests to pandas/tests/arrays/interval
, maybe leave the extension API specific ones here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow. I didn't add any tests to the base class, so not sure what should be moved here?
@jschendel how do you feel about this? IMO, things looked quite nice the last time I looked. Are the outstanding issues
Anything else? Are we comfortable doing those as followups? (fine by me) |
Yes, I believe those are the only outstanding issues, aside from the question I had about @jreback's last comment. I'm fine with doing those items as followups. |
Cool, +1 from me then.
…On Fri, Jul 13, 2018 at 3:10 PM, Jeremy Schendel ***@***.***> wrote:
Yes, I believe those are the only outstanding issues, aside from the
question I had about @jreback <https://github.com/jreback>'s last
comment. I'm fine with doing those items as followups.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20611 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIl209JtiUWQNdHlT5hf3MW1J9F5Kks5uGP6ggaJpZM4THJIe>
.
|
@jschendel what comment? i am ok with merging and doing follow ups as well |
"I suppose should move some of these tests to Regardless, doesn't seem like something that would block this from being merged, and could be handled in a followup too. |
|
||
@property | ||
def itemsize(self): | ||
return self.left.itemsize + self.right.itemsize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel a bit after the fact, but I don't think that we should add this to IntervalArray
. We deprecated itemsize
some time ago for Series/Index (#20721).
Just noticed this because we get the warning during the test:
pandas/tests/indexes/interval/test_interval.py::TestIntervalIndex::()::test_itemsize
/home/travis/build/pandas-dev/pandas/pandas/core/arrays/interval.py:694: FutureWarning: Int64Index.itemsize is deprecated and will be removed in a future version
return self.left.itemsize + self.right.itemsize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was originally part of IntervalIndex
and is currently what IntervalIndex
calls under the hood, so if we remove it from IntervalArray
we'd need to add it back IntervalIndex
. Can certainly do that, unless there's some objection based on it making the two inconsistent. Removing it from IntervalArray
seems like the best course of action though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, removing it from IntervalArray
is the best way forward I think.
Also, on IntervalIndex
it should be deprecated (now it is deprecated on the base IndexOpsMixin class, but by overriding it in IntervalIndex, it lost that deprecation. But apparently IntervalIndex is the only index that is not tested in the base tests, so this 'regression' was not catched by the tests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #22049 so we don't loose track of this discussion
Co-authored-by: Jeremy Schendel <jschendel@users.noreply.github.com>
A run of
Bisection done on Arch Linux and reproducible on Ubuntu 18.04. |
Closes #19453
Closes #19209