Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow storing ExtensionArrays in containers #19520

Merged
merged 139 commits into from
Feb 23, 2018
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
b7069ef
ENH: non-interval changes
TomAugspurger Feb 2, 2018
9cd92c7
COMPAT: py2 Super
TomAugspurger Feb 3, 2018
9211bbd
BUG: Use original object for extension array
TomAugspurger Feb 3, 2018
80f83a6
Consistent boxing / unboxing
TomAugspurger Feb 3, 2018
ca004d8
32-bit compat
TomAugspurger Feb 3, 2018
5d4a686
Add a test array
TomAugspurger Feb 5, 2018
9f4ad42
linting
TomAugspurger Feb 5, 2018
b1db4e8
Default __iter__
TomAugspurger Feb 5, 2018
00d6bb3
Tests for value_counts
TomAugspurger Feb 5, 2018
1608e3d
Implement value_counts
TomAugspurger Feb 5, 2018
52e2180
Py2 compat
TomAugspurger Feb 5, 2018
e6d06e2
Fixed dropna
TomAugspurger Feb 5, 2018
d356f19
Test fixups
TomAugspurger Feb 6, 2018
a6ae340
Started setitem
TomAugspurger Feb 6, 2018
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
05eb9f3
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 6, 2018
82fd0c6
Setitem tests, decimal example
TomAugspurger Feb 7, 2018
8b1e7d6
Compat
TomAugspurger Feb 7, 2018
10af4b6
Fixed extension block tests.
TomAugspurger Feb 7, 2018
cd5f1eb
Clarify binop tests
TomAugspurger Feb 7, 2018
0a9d9fd
TST: Removed ops tests
TomAugspurger Feb 7, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
ffa888a
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
b15ecac
More failing tests
TomAugspurger Feb 7, 2018
88b8f4f
remove bad test
TomAugspurger Feb 7, 2018
349ac1a
better setitem
TomAugspurger Feb 8, 2018
27ab045
Dropna works.
TomAugspurger Feb 8, 2018
8358fb1
Restore xfail test
TomAugspurger Feb 8, 2018
8ef34a9
Test Categorical
TomAugspurger Feb 8, 2018
340d11b
Xfail setitem tests
TomAugspurger Feb 8, 2018
8297888
TST: Skip JSON tests on py2
TomAugspurger Feb 8, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
fc337de
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
6abe9da
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
c5db5da
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
0b112f2
cleanup
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
b556f83
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
268aabc
Linting
TomAugspurger Feb 9, 2018
d671259
Move to extension
TomAugspurger Feb 9, 2018
cb740ed
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
0ec6958
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
4d08218
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
b4de5c4
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
3a5118b
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 13, 2018
1e8e87e
Remove unused change
TomAugspurger Feb 13, 2018
0f5e4f0
Docs
TomAugspurger Feb 13, 2018
1436d1d
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 13, 2018
c4dab88
Test cleanpu
TomAugspurger Feb 13, 2018
a312ba5
Always set PANDAS_TESTING_MODE
TomAugspurger Feb 13, 2018
758689f
Revert "Always set PANDAS_TESTING_MODE"
TomAugspurger Feb 13, 2018
02c3d40
Explicitly catch warnings or not
TomAugspurger Feb 13, 2018
9e17037
fastparquet warnings
TomAugspurger Feb 13, 2018
4599db4
Unicode literals strikes again.
TomAugspurger Feb 14, 2018
d34d9ca
Restore circle env var
TomAugspurger Feb 14, 2018
29d2528
More parquet test catching
TomAugspurger Feb 14, 2018
905d72e
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 14, 2018
412c951
No stacklevel
TomAugspurger Feb 14, 2018
78834f1
Lower bound on FP
TomAugspurger Feb 15, 2018
f8eac55
Exact bound for FP
TomAugspurger Feb 15, 2018
f09c863
Don't use fastpath for ExtensionBlock make_block
TomAugspurger Feb 15, 2018
cedb63d
Consistently use _values
TomAugspurger Feb 16, 2018
cae2c26
TST: Additional constructor tests
TomAugspurger Feb 16, 2018
8088096
CLN: de-nested a bit
TomAugspurger Feb 16, 2018
8aed325
_fill_value handling
TomAugspurger Feb 16, 2018
453728a
Handle user provided dtype in constructors.
TomAugspurger Feb 16, 2018
cc13c8d
Document ExtensionBlock._maybe_coerce_values
TomAugspurger Feb 16, 2018
f90ac07
Created ABCExtensionArray
TomAugspurger Feb 16, 2018
4a03b26
TST: Tests for is_object_dtype and is_string_dtype and EAs
TomAugspurger Feb 16, 2018
635223f
fixup! Handle user provided dtype in constructors.
TomAugspurger Feb 17, 2018
cf423a7
Doc for setitem
TomAugspurger Feb 17, 2018
2d1a66c
Split base tests
TomAugspurger Feb 17, 2018
c849865
Revert test_parquet changes
TomAugspurger Feb 17, 2018
c3ec822
API: Removed _fill_value from the interface
TomAugspurger Feb 17, 2018
f4cf45c
Push coercion to extension dtype till later
TomAugspurger Feb 17, 2018
9c5d479
Linting
TomAugspurger Feb 17, 2018
1175c0d
ERR: Better error message for coercion to 3rd party dtypes
TomAugspurger Feb 17, 2018
c816d99
CLN: Make take_nd EA aware
TomAugspurger Feb 17, 2018
9c9f59e
Revert sparse changes
TomAugspurger Feb 17, 2018
08af9a3
Other _typ for ABCExtensionArray
TomAugspurger Feb 17, 2018
2e992f7
Test cleanup and expansion.
TomAugspurger Feb 17, 2018
cc5cc3e
Copy if copy
TomAugspurger Feb 17, 2018
704ee67
TST: remove self param for fixture
TomAugspurger Feb 18, 2018
c262968
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 19, 2018
8bf0334
Remove unnescessary EA handling in Series ctor
TomAugspurger Feb 20, 2018
c8d88da
API: Removed value_counts
TomAugspurger Feb 20, 2018
24f3b60
More doc notes
TomAugspurger Feb 20, 2018
50bd5dd
Handle expanding a DataFrame with an EA
TomAugspurger Feb 20, 2018
879bc84
Added ExtensionDtype.__eq__
TomAugspurger Feb 20, 2018
33c9d1f
linting
TomAugspurger Feb 21, 2018
f07c166
REF: is_dtype_equal refactor
TomAugspurger Feb 21, 2018
79d43b1
Remove reference to dtype being a class
TomAugspurger Feb 21, 2018
91c629b
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 21, 2018
a1ebf53
move
TomAugspurger Feb 22, 2018
aa57cad
Moved sparse check to take_nd
TomAugspurger Feb 22, 2018
c82748c
Docstring
TomAugspurger Feb 22, 2018
e919343
Split tests
TomAugspurger Feb 22, 2018
1ea74da
Revert index change
TomAugspurger Feb 22, 2018
0c41a34
Copy changes
TomAugspurger Feb 22, 2018
009bece
Simplify EA implementation names
TomAugspurger Feb 22, 2018
ea5562b
Linting
TomAugspurger Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
is_object_dtype,
is_extension_array_dtype,
is_categorical_dtype, is_sparse,
is_period_dtype,
is_numeric_dtype, is_float_dtype,
Expand Down Expand Up @@ -542,7 +543,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

else:

if is_categorical_dtype(values) or is_sparse(values):
if is_extension_array_dtype(values) or is_sparse(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sparse should be recognized as an extension array type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my todo list. I'll do it after interval.


# handle Categorical and sparse,
result = Series(values).values.value_counts(dropna=dropna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Series(values)._values ? Because I would expect this to fail for anything that is not a categorical (as a category series returns a categorical, but for now others return a ndarray there?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ._values, but

but for now others return a ndarray there?

What types do you mean by others? Internally, our only type returning true here is Categorical (or sparse of is_sparse).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, forgot again for a moment that we don't yet have those other internal extension arrays apart from Categorical :)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder why we're doing Series(values).values here. I suspect to handle any of categorical / series / index inputs in a consistent way.

Probably better to make an _unbox_pandas method that extracts the underlying array from an array (no-op) or Series/Index (._values).

Expand Down
36 changes: 27 additions & 9 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""An interface for extending pandas with custom arrays."""
import numpy as np

from pandas.errors import AbstractMethodError

_not_implemented_message = "{} does not implement {}."
Expand All @@ -23,14 +25,14 @@ class ExtensionArray(object):
* isna
* take
* copy
* _formatting_values
* _concat_same_type

Some additional methods are required to satisfy pandas' internal, private
block API.

* _concat_same_type
* _can_hold_na
* _formatting_values
* _fill_value

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
Expand All @@ -51,9 +53,6 @@ class ExtensionArray(object):
Extension arrays should be able to be constructed with instances of
the class, i.e. ``ExtensionArray(extension_array)`` should return
an instance, not error.

Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
"""
# ------------------------------------------------------------------------
# Must be a Sequence
Expand Down Expand Up @@ -105,6 +104,16 @@ def __len__(self):
# type: () -> int
raise AbstractMethodError(self)

def __iter__(self):
"""Iterate over elements.

This needs to be implemented so that pandas recognizes extension arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a better doc-string here

as list-like. The default implementation makes successive calls to
``__getitem__``, which may be slower than necessary.
"""
for i in range(len(self)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not just raising AbstractMethodError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather you would iterate on some underlying value

We don't know how the subclass is storing things though, so there's no .values to iterate over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot

Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some context: If I remove __iter__, the object will still be iterable with essentially this implementation since it defines __getitem__ and __len__. We just need to implement it explicitly so that is_iterable works correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object) is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.

yield self[i]

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -177,9 +186,9 @@ def take(self, indexer, allow_fill=True, fill_value=None):

Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
and that the underlying structured array is stored as ``self.data``.
Then ``take`` may be written as
Suppose the extension array somehow backed by a NumPy array and that
the underlying structured array is stored as ``self.data``. Then
``take`` may be written as

.. code-block:: python

Expand Down Expand Up @@ -219,7 +228,7 @@ def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this should be implemented only by subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very sensible default IMO, subclasses can always override if needed


@classmethod
def _concat_same_type(cls, to_concat):
Expand All @@ -236,6 +245,7 @@ def _concat_same_type(cls, to_concat):
"""
raise AbstractMethodError(cls)

@property
def _can_hold_na(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface change: This should have been a property.

# type: () -> bool
"""Whether your array can hold missing values. True by default.
Expand All @@ -245,3 +255,11 @@ def _can_hold_na(self):
Setting this to false will optimize some operations like fillna.
"""
return True

def value_counts(self, dropna=True):
from pandas import value_counts

if dropna:
self = self[~self.isna()]

return value_counts(np.array(self))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too tied to materializing using ndarary. rather this should raise AbstractMethodError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have a difference in philosophy here then. I'm trying to follow the lead of standard library in providing default implementations for as much as possible, even if they're sub-optimal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I find a bit "strange" about this method in adding it to the required interface of an extension array, is that it returns a Series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's strange about that? The fact that it's being added, or that it's returning a Series? I'm not sure what else it would return.

Ideally, we'll someday have an Index[extension_dtype], so that you get a Series with an exetension-array backed index. That's seems like the most natural return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to wrap this on the return, IOW call EA(....) here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, not real happy with this interface, but I guess ok for now. let's revisit this at somepoint in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger I suppose you want this in here, so you can implement a more efficient method on IPArray?
Because we could also, for now, keep this implementation in pd.value_counts (so for people who want this functionality, they can use it) instead of adding it as a method on ExtensionArray interface itself. And we can revisit when talking about unique/factorize/duplicated/.. algos support for ExtensionArrays in general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly added it since that's what Categorical does. I can see if removing it breaks anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, we're OK with pd.value_counts and Series[EA] currently failing for 3rd party EAs? By removing that from the interface, we'll see

AttributeError: 'DecimalArray' object has no attribute 'value_counts'

I could make that error message better, but I'd prefer waiting on that till we flesh out what we want pd.value_counts(EA) to do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in principle you can easily add this fallback to do a value_counts on the array of objects there?

12 changes: 9 additions & 3 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""Extend pandas with custom array types"""
import inspect

from pandas.errors import AbstractMethodError


Expand Down Expand Up @@ -106,7 +108,8 @@ def is_dtype(cls, dtype):

Parameters
----------
dtype : str or dtype
dtype : str, object, or type
The dtype to check.

Returns
-------
Expand All @@ -118,12 +121,15 @@ def is_dtype(cls, dtype):

1. ``cls.construct_from_string(dtype)`` is an instance
of ``cls``.
2. 'dtype' is ``cls`` or a subclass of ``cls``.
2. ``dtype`` is an object and is an instance of ``cls``
3. 'dtype' is a class and is ``cls`` or a subclass of ``cls``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this have double-ticks?

"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
elif inspect.isclass(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have 2 conditions the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issubclass vs. isinstance :) Want to be able to handle .is_dtype(ExtensionDtype) and .is_dtype(ExtensionDtype()).

return issubclass(dtype, cls)
else:
return isinstance(dtype, cls)
3 changes: 2 additions & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1708,9 +1708,10 @@ def is_extension_array_dtype(arr_or_dtype):
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, ABCSeries):
arr_or_dtype = arr_or_dtype._values
elif isinstance(arr_or_dtype, ABCIndexClass):
arr_or_dtype = arr_or_dtype._as_best_array()
return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


Expand Down
32 changes: 21 additions & 11 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@
is_datetimelike_v_numeric, is_float_dtype,
is_datetime64_dtype, is_datetime64tz_dtype,
is_timedelta64_dtype, is_interval_dtype,
is_complex_dtype, is_categorical_dtype,
is_complex_dtype,
is_string_like_dtype, is_bool_dtype,
is_integer_dtype, is_dtype_equal,
is_extension_array_dtype,
needs_i8_conversion, _ensure_object,
pandas_dtype,
is_scalar,
Expand Down Expand Up @@ -52,12 +53,15 @@ def isna(obj):


def _isna_new(obj):
from ..arrays import ExtensionArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't do this. rather define an ABCExtensionArray. instead. This is the purpose of these, to avoid smell like this.


if is_scalar(obj):
return libmissing.checknull(obj)
# hack (for now) because MI registers as ndarray
elif isinstance(obj, ABCMultiIndex):
raise NotImplementedError("isna is not defined for MultiIndex")
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass)):
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
ExtensionArray)):
return _isna_ndarraylike(obj)
elif isinstance(obj, ABCGeneric):
return obj._constructor(obj._data.isna(func=isna))
Expand Down Expand Up @@ -124,17 +128,20 @@ def _use_inf_as_na(key):


def _isna_ndarraylike(obj):

values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if is_categorical_dtype(values):
from pandas import Categorical
if not isinstance(values, Categorical):
values = values.values
result = values.isna()
elif is_interval_dtype(values):
if is_extension_array_dtype(obj):
if isinstance(obj, ABCIndexClass):
values = obj._as_best_array()
elif isinstance(obj, ABCSeries):
values = obj._values
else:
values = obj
result = values.isna()
elif is_string_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already there, but I find the use of is_string_dtype rather confusing here, as intervals are not strings at all (but it just checks for string or object dtype ..).
But maybe can do

elif is_interval_dtype(values):
    ...
elif is string_dtype(values):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, pls fix is_string_dtype (to exclude interval as well as period dtypes (you might be able to say object and not extension_type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix itself once IntervalArray is a proper extension type. Added tests to confirm that extension types are not True for is_string_dtype and is_object_dtype.

if is_interval_dtype(values):
# TODO(IntervalArray): remove this if block
from pandas import IntervalIndex
result = IntervalIndex(obj).isna()
else:
Expand Down Expand Up @@ -406,4 +413,7 @@ def remove_na_arraylike(arr):
"""
Return array-like containing only true/non-NaN values, possibly empty.
"""
return arr[notna(lib.values_from_object(arr))]
if is_extension_array_dtype(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately values_from_object just calls .get_values() on the object if it exists. This smells like something is missing from the EA interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if / elif can be avoided. values_from_object gets and calls arr.get_values if defined, which returns an ndarray. But we need to call notna on the EA itself, since it has special missing value handling.

return arr[notna(arr)]
else:
return arr[notna(lib.values_from_object(arr))]
20 changes: 9 additions & 11 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
is_categorical_dtype,
is_object_dtype,
is_extension_type,
is_extension_array_dtype,
is_datetimetz,
is_datetime64_any_dtype,
is_datetime64tz_dtype,
Expand Down Expand Up @@ -71,7 +72,7 @@
create_block_manager_from_arrays,
create_block_manager_from_blocks)
from pandas.core.series import Series
from pandas.core.arrays import Categorical
from pandas.core.arrays import Categorical, ExtensionArray
import pandas.core.algorithms as algorithms
from pandas.compat import (range, map, zip, lrange, lmap, lzip, StringIO, u,
OrderedDict, raise_with_traceback)
Expand Down Expand Up @@ -511,7 +512,7 @@ def _get_axes(N, K, index=index, columns=columns):
index, columns = _get_axes(len(values), 1)
return _arrays_to_mgr([values], columns, index, columns,
dtype=dtype)
elif is_datetimetz(values):
elif (is_datetimetz(values) or is_extension_array_dtype(values)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the dtype is an extension dtype? (like the check above for categorical)
But certainly fine with not supporting it for now.

Similar for Series, we need to either define what pd.Series(values, dtype=SomeExtensionDtype()) does, or either error on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [7]: pd.Series([0, 1, 2], dtype=cyberpandas.IPType())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-7fc01964da1d> in <module>()
----> 1 pd.Series([0, 1, 2], dtype=cyberpandas.IPType())

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    246             else:
    247                 data = _sanitize_array(data, index, dtype, copy,
--> 248                                        raise_cast_failure=True)
    249
    250                 data = SingleBlockManager(data, index, fastpath=True)

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   3214         if dtype is not None:
   3215             try:
-> 3216                 subarr = _try_cast(data, False)
   3217             except Exception:
   3218                 if raise_cast_failure:  # pragma: no cover

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _try_cast(arr, take_fast_path)
   3168             subarr = maybe_cast_to_datetime(arr, dtype)
   3169             if not is_extension_type(subarr):
-> 3170                 subarr = np.array(subarr, dtype=dtype, copy=copy)
   3171         except (ValueError, TypeError):
   3172             if is_categorical_dtype(dtype):

TypeError: data type not understood

How should we handle this? I'm fine with raising with a better error message. We don't know how to cast from (arbitrary array, extension_dtype) -> extension array.

Although... maybe we could support Series(extension_array, dtype=extension_dtype).

Oh hey, that "works"

In [10]: pd.Series(cyberpandas.IPArray([0, 1, 2]), dtype=cyberpandas.IPType())
Out[10]:
0    0.0.0.0
1    0.0.0.1
2    0.0.0.2
dtype: ip

But it only works since we ignore the dtype entirely :)

I will raise if data.dtype is not dtype equal with dtype I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for things like astype(SomeExtensionDtype) we will need an informative error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't a dateimetz an extension type now? why can't you remove the first part of the clauase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not yet. Just Categorical so far. I have Interval and Period started, but no PRs yet.

# GH19157
if columns is None:
columns = [0]
Expand Down Expand Up @@ -2796,15 +2797,15 @@ def reindexer(value):
# now align rows
value = reindexer(value).T

elif isinstance(value, Categorical):
elif isinstance(value, ExtensionArray):
value = value.copy()

elif isinstance(value, Index) or is_sequence(value):
from pandas.core.series import _sanitize_index

# turn me into an ndarray
value = _sanitize_index(value, self.index, copy=False)
if not isinstance(value, (np.ndarray, Index)):
if not isinstance(value, (np.ndarray, Index, ExtensionArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use is_array_like here (and maybe in other places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rust remove this change until we support array-backed indexes.

if isinstance(value, list) and len(value) > 0:
value = maybe_convert_platform(value)
else:
Expand All @@ -2826,7 +2827,7 @@ def reindexer(value):
value = maybe_cast_to_datetime(value, value.dtype)

# return internal types directly
if is_extension_type(value):
if is_extension_type(value) or is_extension_array_dtype(value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.

return value

# broadcast across multiple columns if necessary
Expand Down Expand Up @@ -3355,12 +3356,9 @@ class max type
new_obj = self.copy()

def _maybe_casted_values(index, labels=None):
if isinstance(index, PeriodIndex):
values = index.astype(object).values
elif isinstance(index, DatetimeIndex) and index.tz is not None:
values = index
else:
values = index.values
values = index._as_best_array()
# TODO: Check if nescessary...
if not isinstance(index, (PeriodIndex, DatetimeIndex)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to what happened before in the else branch. Seems like it's doing some type inference / coercion.

if values.dtype == np.object_:
values = lib.maybe_convert_objects(values)

Expand Down
34 changes: 33 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pandas import compat

from pandas.core.accessor import CachedAccessor
from pandas.core.arrays import ExtensionArray
from pandas.core.dtypes.generic import (
ABCSeries, ABCDataFrame,
ABCMultiIndex,
Expand Down Expand Up @@ -1038,6 +1039,31 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a need for a method like this. It may be good to add to Series as well, and maybe make public.

Essentially, we want a clear way to get the "best" .values. For Indexes with ExtensionArray types (Categorical for now, Interval, Period soon) that's the extension array. Otherwise, it's an ndarray (or I think `DatetimeIndex for datetime w/ tz right now).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this will also depend on decision if we will change return value for .values for series with period and interval objects)
For Series we already have something like ._values. Won't that be the equivalent there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. Maybe I should re-use that convention.

Unfortunately, DatetimeIndex[tz]._values returns an ndarray, w/o the TZ, where it "should" return a DatetimeIndex (and eventually an ExtensionArray). I'll see how badly changing that breaks things.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes ths should be equiv of _values. DTI should be fixed as a pre-cursor PR. no problem making a different method name for this, but then simply rename _values to that name (I don't personally like _as_best_array), maybe (_as_array) is better. Should again be a pre-cursor independent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #19548

# type: () -> Union[ExtensionArray, ndarary]
"""Return the underlying values as the best array type.

Indexes backed by ExtensionArrays will return the ExtensionArray.
Otherwise, an ndarray is returned.

Examples
--------
>>> pd.Index([0, 1, 2])._as_best_array()
array([0, 1, 2])

>>> pd.CategoricalIndex(['a', 'a', 'b'])._as_best_array()
[a, a, b]
Categories (2, object): [a, b]

>>> pd.IntervalIndex.from_breaks([0, 1, 2])._as_best_array()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IntervalArray([(0, 1], (1, 2]])
"""
# We need this since CategoricalIndex.values -> Categorical
# but IntervalIndex.values -> ndarray[object]
# TODO: IntervalIndex defines _array_values. Would be nice to
# have an unambiguous way of getting an ndarray (or just use asarray?)
return self.values

_index_shared_docs['astype'] = """
Create an Index with values cast to dtypes. The class of a new Index
is determined by dtype. When conversion is impossible, a ValueError
Expand Down Expand Up @@ -1946,6 +1972,12 @@ def _format_with_header(self, header, na_rep='NaN', **kwargs):

if is_categorical_dtype(values.dtype):
values = np.array(values)

elif isinstance(values, ExtensionArray):
# This is still un-exercised within pandas, since all our
# extension dtypes have custom indexes.
values = values._formatting_values()

elif is_object_dtype(values.dtype):
values = lib.maybe_convert_objects(values, safe=1)

Expand Down Expand Up @@ -2525,7 +2557,7 @@ def get_value(self, series, key):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, Index) and is_scalar(key):
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_array_like might work here

try:
return s[key]
except (IndexError, ValueError):
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,9 @@ def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()

def _as_best_array(self):
return self._data

def tolist(self):
return self._data.tolist()

Expand Down
7 changes: 7 additions & 0 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1034,6 +1034,13 @@ def _to_embed(self, keep_tz=False, dtype=None):

return self.values.copy()

def _as_best_array(self):
# no-tz -> ndarray
# tz -> DatetimeIndex (for now)
if self.tz is not None:
return self
return self.values

def to_pydatetime(self):
"""
Return DatetimeIndex as object ndarray of datetime.datetime objects
Expand Down
Loading