Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow storing ExtensionArrays in containers #19520

Merged
merged 139 commits into from
Feb 23, 2018
Merged
Show file tree
Hide file tree
Changes from 116 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
b7069ef
ENH: non-interval changes
TomAugspurger Feb 2, 2018
9cd92c7
COMPAT: py2 Super
TomAugspurger Feb 3, 2018
9211bbd
BUG: Use original object for extension array
TomAugspurger Feb 3, 2018
80f83a6
Consistent boxing / unboxing
TomAugspurger Feb 3, 2018
ca004d8
32-bit compat
TomAugspurger Feb 3, 2018
5d4a686
Add a test array
TomAugspurger Feb 5, 2018
9f4ad42
linting
TomAugspurger Feb 5, 2018
b1db4e8
Default __iter__
TomAugspurger Feb 5, 2018
00d6bb3
Tests for value_counts
TomAugspurger Feb 5, 2018
1608e3d
Implement value_counts
TomAugspurger Feb 5, 2018
52e2180
Py2 compat
TomAugspurger Feb 5, 2018
e6d06e2
Fixed dropna
TomAugspurger Feb 5, 2018
d356f19
Test fixups
TomAugspurger Feb 6, 2018
a6ae340
Started setitem
TomAugspurger Feb 6, 2018
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
05eb9f3
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 6, 2018
82fd0c6
Setitem tests, decimal example
TomAugspurger Feb 7, 2018
8b1e7d6
Compat
TomAugspurger Feb 7, 2018
10af4b6
Fixed extension block tests.
TomAugspurger Feb 7, 2018
cd5f1eb
Clarify binop tests
TomAugspurger Feb 7, 2018
0a9d9fd
TST: Removed ops tests
TomAugspurger Feb 7, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
ffa888a
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
b15ecac
More failing tests
TomAugspurger Feb 7, 2018
88b8f4f
remove bad test
TomAugspurger Feb 7, 2018
349ac1a
better setitem
TomAugspurger Feb 8, 2018
27ab045
Dropna works.
TomAugspurger Feb 8, 2018
8358fb1
Restore xfail test
TomAugspurger Feb 8, 2018
8ef34a9
Test Categorical
TomAugspurger Feb 8, 2018
340d11b
Xfail setitem tests
TomAugspurger Feb 8, 2018
8297888
TST: Skip JSON tests on py2
TomAugspurger Feb 8, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
fc337de
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
6abe9da
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
c5db5da
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
0b112f2
cleanup
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
b556f83
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
268aabc
Linting
TomAugspurger Feb 9, 2018
d671259
Move to extension
TomAugspurger Feb 9, 2018
cb740ed
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
0ec6958
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
4d08218
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
b4de5c4
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
3a5118b
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 13, 2018
1e8e87e
Remove unused change
TomAugspurger Feb 13, 2018
0f5e4f0
Docs
TomAugspurger Feb 13, 2018
1436d1d
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 13, 2018
c4dab88
Test cleanpu
TomAugspurger Feb 13, 2018
a312ba5
Always set PANDAS_TESTING_MODE
TomAugspurger Feb 13, 2018
758689f
Revert "Always set PANDAS_TESTING_MODE"
TomAugspurger Feb 13, 2018
02c3d40
Explicitly catch warnings or not
TomAugspurger Feb 13, 2018
9e17037
fastparquet warnings
TomAugspurger Feb 13, 2018
4599db4
Unicode literals strikes again.
TomAugspurger Feb 14, 2018
d34d9ca
Restore circle env var
TomAugspurger Feb 14, 2018
29d2528
More parquet test catching
TomAugspurger Feb 14, 2018
905d72e
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 14, 2018
412c951
No stacklevel
TomAugspurger Feb 14, 2018
78834f1
Lower bound on FP
TomAugspurger Feb 15, 2018
f8eac55
Exact bound for FP
TomAugspurger Feb 15, 2018
f09c863
Don't use fastpath for ExtensionBlock make_block
TomAugspurger Feb 15, 2018
cedb63d
Consistently use _values
TomAugspurger Feb 16, 2018
cae2c26
TST: Additional constructor tests
TomAugspurger Feb 16, 2018
8088096
CLN: de-nested a bit
TomAugspurger Feb 16, 2018
8aed325
_fill_value handling
TomAugspurger Feb 16, 2018
453728a
Handle user provided dtype in constructors.
TomAugspurger Feb 16, 2018
cc13c8d
Document ExtensionBlock._maybe_coerce_values
TomAugspurger Feb 16, 2018
f90ac07
Created ABCExtensionArray
TomAugspurger Feb 16, 2018
4a03b26
TST: Tests for is_object_dtype and is_string_dtype and EAs
TomAugspurger Feb 16, 2018
635223f
fixup! Handle user provided dtype in constructors.
TomAugspurger Feb 17, 2018
cf423a7
Doc for setitem
TomAugspurger Feb 17, 2018
2d1a66c
Split base tests
TomAugspurger Feb 17, 2018
c849865
Revert test_parquet changes
TomAugspurger Feb 17, 2018
c3ec822
API: Removed _fill_value from the interface
TomAugspurger Feb 17, 2018
f4cf45c
Push coercion to extension dtype till later
TomAugspurger Feb 17, 2018
9c5d479
Linting
TomAugspurger Feb 17, 2018
1175c0d
ERR: Better error message for coercion to 3rd party dtypes
TomAugspurger Feb 17, 2018
c816d99
CLN: Make take_nd EA aware
TomAugspurger Feb 17, 2018
9c9f59e
Revert sparse changes
TomAugspurger Feb 17, 2018
08af9a3
Other _typ for ABCExtensionArray
TomAugspurger Feb 17, 2018
2e992f7
Test cleanup and expansion.
TomAugspurger Feb 17, 2018
cc5cc3e
Copy if copy
TomAugspurger Feb 17, 2018
704ee67
TST: remove self param for fixture
TomAugspurger Feb 18, 2018
c262968
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 19, 2018
8bf0334
Remove unnescessary EA handling in Series ctor
TomAugspurger Feb 20, 2018
c8d88da
API: Removed value_counts
TomAugspurger Feb 20, 2018
24f3b60
More doc notes
TomAugspurger Feb 20, 2018
50bd5dd
Handle expanding a DataFrame with an EA
TomAugspurger Feb 20, 2018
879bc84
Added ExtensionDtype.__eq__
TomAugspurger Feb 20, 2018
33c9d1f
linting
TomAugspurger Feb 21, 2018
f07c166
REF: is_dtype_equal refactor
TomAugspurger Feb 21, 2018
79d43b1
Remove reference to dtype being a class
TomAugspurger Feb 21, 2018
91c629b
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 21, 2018
a1ebf53
move
TomAugspurger Feb 22, 2018
aa57cad
Moved sparse check to take_nd
TomAugspurger Feb 22, 2018
c82748c
Docstring
TomAugspurger Feb 22, 2018
e919343
Split tests
TomAugspurger Feb 22, 2018
1ea74da
Revert index change
TomAugspurger Feb 22, 2018
0c41a34
Copy changes
TomAugspurger Feb 22, 2018
009bece
Simplify EA implementation names
TomAugspurger Feb 22, 2018
ea5562b
Linting
TomAugspurger Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
is_object_dtype,
is_extension_array_dtype,
is_categorical_dtype, is_sparse,
is_period_dtype,
is_numeric_dtype, is_float_dtype,
Expand Down Expand Up @@ -542,10 +543,10 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

else:

if is_categorical_dtype(values) or is_sparse(values):
if is_extension_array_dtype(values) or is_sparse(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sparse should be recognized as an extension array type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my todo list. I'll do it after interval.


# handle Categorical and sparse,
result = Series(values).values.value_counts(dropna=dropna)
result = Series(values)._values.value_counts(dropna=dropna)
result.name = name
counts = result.values

Expand Down Expand Up @@ -1290,10 +1291,12 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
"""
Specialized Cython take which sets NaN values in one pass

This dispatches to ``take`` defined on ExtensionArrays.

Parameters
----------
arr : ndarray
Input array
arr : ndarray, ExtensionArray, DatetimeIndex, IntervalIndex, SparseArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure this is necessary , you can just say array-like (maybe should define that as this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Input array. SparseArray is densified with ``get_values``
indexer : ndarray
1-D array of indices to take, subarrays corresponding to -1 value
indicies are filed with fill_value
Expand All @@ -1313,16 +1316,23 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None,
If False, indexer is assumed to contain no -1 values so no filling
will be done. This short-circuits computation of a mask. Result is
undefined if allow_fill == False and -1 is present in indexer.

Returns
-------
subarray : object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array

May be the same type as the input, or cast to an ndarray.
"""

# TODO(EA): Remove these if / elifs as datetimeTZ, interval, become EAs
# dispatch to internal type takes
if is_categorical(arr):
return arr.take_nd(indexer, fill_value=fill_value,
allow_fill=allow_fill)
if is_extension_array_dtype(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
elif is_datetimetz(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
elif is_interval_dtype(arr):
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
elif is_sparse(arr):
arr = arr.get_values()

if indexer is None:
indexer = np.arange(arr.shape[axis], dtype=np.int64)
Expand Down
98 changes: 79 additions & 19 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,13 @@ class ExtensionArray(object):
* isna
* take
* copy
* _formatting_values
* _concat_same_type

Some additional methods are required to satisfy pandas' internal, private
Some additional methods are available to satisfy pandas' internal, private
block API.

* _concat_same_type
* _can_hold_na
* _formatting_values

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
Expand All @@ -53,13 +52,12 @@ class ExtensionArray(object):
Extension arrays should be able to be constructed with instances of
the class, i.e. ``ExtensionArray(extension_array)`` should return
an instance, not error.

Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
"""
_typ = 'extension' # For pandas.core.dtypes.generic.ABCExtensionArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add "don't override" for extension authors?

# ------------------------------------------------------------------------
# Must be a Sequence
# ------------------------------------------------------------------------

def __getitem__(self, item):
# type (Any) -> Any
"""Select a subset of self.
Expand Down Expand Up @@ -92,7 +90,41 @@ def __getitem__(self, item):
raise AbstractMethodError(self)

def __setitem__(self, key, value):
# type: (Any, Any) -> None
# type: (Union[int, np.ndarray], Any) -> None
"""Set one or more values inplace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes on the semantics of setitem if people want to take a look.


Parameters
----------
key : int or ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incongruous with the below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added slice

When called from, e.g. ``Series.__setitem__``, ``key`` will be
one of

* scalar int
* ndarray of integers.
* boolean ndarray
* slice object

value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
value or values to be set of ``key``.

Notes
-----
This method is not required to satisfy the interface. If an
ExtensionArray chooses to implement __setitem__, then some semantics
should be observed:

* Setting multiple values : ExtensionArrays should support setting
multiple values at once, ``key`` will be a sequence of integers and
``value`` will be a same-length sequence.

* Broadcasting : For a sequence ``key`` and a scalar ``value``,
each position in ``key`` should be set to ``value``.

* Coercion : Most users will expect basic coercion to work. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very hard to support / do. I would not mention this, and its up to the EA subclass to do this (and to be honest kind of breaks the contract, IOW this is supposed to be an integer / slice and not a coercible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed up to the EA author to decide on this, but this is about coercing value, not key, so it does not infer with the contract of what key can be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section may not be appropriate for the docstring anyway. I'll remove it and add it to the narrative docs (which I've started on another branch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section may not be appropriate for the docstring anyway.

To clarify, we have two consumers of the docs. Library authors implementing an EA and users of that library's EA.

To the extent possible, I think we should write docstrings for users of the EA. We can put notes on expected implementation elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility (or in addition) is to leave some implementation comments as actual comments. Like

def method(..):
    """
    User oriented docstring
    """
    # some additional
    # developer oriented comments

as I think it is useful to keep some of those things in the code itself, for someone looking at the base class implementation.
I think eg __iter__, nbytes, .. are other examples where we could do such a split

example, a string like ``'2018-01-01'`` is coerced to a datetime
when setting on a datetime64ns array. In general, if the
``__init__`` method coerces that value, then so should ``__setitem__``.
"""
raise NotImplementedError(_not_implemented_message.format(
type(self), '__setitem__')
)
Expand All @@ -107,6 +139,16 @@ def __len__(self):
# type: () -> int
raise AbstractMethodError(self)

def __iter__(self):
"""Iterate over elements.

This needs to be implemented so that pandas recognizes extension arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a better doc-string here

as list-like. The default implementation makes successive calls to
``__getitem__``, which may be slower than necessary.
"""
for i in range(len(self)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not just raising AbstractMethodError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather you would iterate on some underlying value

We don't know how the subclass is storing things though, so there's no .values to iterate over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot

Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some context: If I remove __iter__, the object will still be iterable with essentially this implementation since it defines __getitem__ and __len__. We just need to implement it explicitly so that is_iterable works correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object) is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.

yield self[i]

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -167,6 +209,25 @@ def isna(self):
"""
raise AbstractMethodError(self)

def value_counts(self, dropna=True):
"""Compute a histogram of the counts of non-null values.

Parameters
----------
dropna : bool, default True
Don't include counts of NaN

Returns
-------
value_counts : Series
"""
from pandas import value_counts

if dropna:
self = self[~self.isna()]

return value_counts(np.array(self))

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand All @@ -184,8 +245,8 @@ def take(self, indexer, allow_fill=True, fill_value=None):
will be done. This short-circuits computation of a mask. Result is
undefined if allow_fill == False and -1 is present in indexer.
fill_value : any, default None
Fill value to replace -1 values with. By default, this uses
the missing value sentinel for this type, ``self._fill_value``.
Fill value to replace -1 values with. If applicable, this should
use the sentinel missing value for this type.

Notes
-----
Expand All @@ -198,17 +259,20 @@ def take(self, indexer, allow_fill=True, fill_value=None):

Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
and that the underlying structured array is stored as ``self.data``.
Then ``take`` may be written as
Suppose the extension array is backed by a NumPy array stored as
``self.data``. Then ``take`` may be written as

.. code-block:: python

def take(self, indexer, allow_fill=True, fill_value=None):
mask = indexer == -1
result = self.data.take(indexer)
result[mask] = self._fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there a reason we don't define _na_value? (e.g we do on indexes) as a property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was a self._fill_value, but because this was not used internally, we left it out for the moment. The discussion about that is a bit spread, but see eg #19520 (comment) (4th bullet point is about fill_value, and comment below with Tom's answer) and #19520 (comment) and the comments below that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

result[mask] = self._fill_value # NA for this type
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"self._fill_value" might be a bit confusing if that attribute is now removed

return type(self)(result)

See Also
--------
numpy.take
"""
raise AbstractMethodError(self)

Expand All @@ -230,17 +294,12 @@ def copy(self, deep=False):
# ------------------------------------------------------------------------
# Block-related methods
# ------------------------------------------------------------------------
@property
def _fill_value(self):
# type: () -> Any
"""The missing value for this type, e.g. np.nan"""
return None

def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this should be implemented only by subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very sensible default IMO, subclasses can always override if needed


@classmethod
def _concat_same_type(cls, to_concat):
Expand All @@ -257,6 +316,7 @@ def _concat_same_type(cls, to_concat):
"""
raise AbstractMethodError(cls)

@property
def _can_hold_na(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface change: This should have been a property.

# type: () -> bool
"""Whether your array can hold missing values. True by default.
Expand Down
12 changes: 9 additions & 3 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""Extend pandas with custom array types"""
import inspect

from pandas.errors import AbstractMethodError


Expand Down Expand Up @@ -106,7 +108,8 @@ def is_dtype(cls, dtype):

Parameters
----------
dtype : str or dtype
dtype : str, object, or type
The dtype to check.

Returns
-------
Expand All @@ -118,12 +121,15 @@ def is_dtype(cls, dtype):

1. ``cls.construct_from_string(dtype)`` is an instance
of ``cls``.
2. 'dtype' is ``cls`` or a subclass of ``cls``.
2. ``dtype`` is an object and is an instance of ``cls``
3. ``dtype`` is a class and is ``cls`` or a subclass of ``cls``.
"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
elif inspect.isclass(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have 2 conditions the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issubclass vs. isinstance :) Want to be able to handle .is_dtype(ExtensionDtype) and .is_dtype(ExtensionDtype()).

return issubclass(dtype, cls)
else:
return isinstance(dtype, cls)
2 changes: 1 addition & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1708,9 +1708,9 @@ def is_extension_array_dtype(arr_or_dtype):
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
arr_or_dtype = arr_or_dtype._values

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


Expand Down
2 changes: 2 additions & 0 deletions pandas/core/dtypes/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,8 @@ def _check(cls, inst):
ABCDateOffset = create_pandas_abc_type("ABCDateOffset", "_typ",
("dateoffset",))
ABCInterval = create_pandas_abc_type("ABCInterval", "_typ", ("interval", ))
ABCExtensionArray = create_pandas_abc_type("ABCExtensionArray", "_typ",
("extension",))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in practice, this ABCExtensionArray will only work for externally defined extension arrays? (as for the internal ones they have already another _type, like Categorical)
I am not really sure that is actually what we want? As then you can't use this to check if an array-like is an instance of ExtensionArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or, you can add here 'categorical' to the list of possible _typ's ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We'll need to continue to update that as we make our other internal extension arrays.



class _ABCGeneric(type):
Expand Down
51 changes: 28 additions & 23 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,16 @@
from pandas._libs import lib, missing as libmissing
from pandas._libs.tslib import NaT, iNaT
from .generic import (ABCMultiIndex, ABCSeries,
ABCIndexClass, ABCGeneric)
ABCIndexClass, ABCGeneric,
ABCExtensionArray)
from .common import (is_string_dtype, is_datetimelike,
is_datetimelike_v_numeric, is_float_dtype,
is_datetime64_dtype, is_datetime64tz_dtype,
is_timedelta64_dtype, is_interval_dtype,
is_complex_dtype, is_categorical_dtype,
is_complex_dtype,
is_string_like_dtype, is_bool_dtype,
is_integer_dtype, is_dtype_equal,
is_extension_array_dtype,
needs_i8_conversion, _ensure_object,
pandas_dtype,
is_scalar,
Expand Down Expand Up @@ -57,7 +59,8 @@ def _isna_new(obj):
# hack (for now) because MI registers as ndarray
elif isinstance(obj, ABCMultiIndex):
raise NotImplementedError("isna is not defined for MultiIndex")
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass)):
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
ABCExtensionArray)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above, this won't catch Categoricals ? (while isinstance(.., ExtensionArray) before did catch them)

return _isna_ndarraylike(obj)
elif isinstance(obj, ABCGeneric):
return obj._constructor(obj._data.isna(func=isna))
Expand Down Expand Up @@ -124,30 +127,29 @@ def _use_inf_as_na(key):


def _isna_ndarraylike(obj):

values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if is_categorical_dtype(values):
from pandas import Categorical
if not isinstance(values, Categorical):
values = values.values
result = values.isna()
elif is_interval_dtype(values):
from pandas import IntervalIndex
result = IntervalIndex(obj).isna()
if is_extension_array_dtype(obj):
if isinstance(obj, (ABCIndexClass, ABCSeries)):
values = obj._values
else:
values = obj
result = values.isna()
elif is_interval_dtype(values):
# TODO(IntervalArray): remove this if block
from pandas import IntervalIndex
result = IntervalIndex(obj).isna()
elif is_string_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already there, but I find the use of is_string_dtype rather confusing here, as intervals are not strings at all (but it just checks for string or object dtype ..).
But maybe can do

elif is_interval_dtype(values):
    ...
elif is string_dtype(values):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, pls fix is_string_dtype (to exclude interval as well as period dtypes (you might be able to say object and not extension_type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix itself once IntervalArray is a proper extension type. Added tests to confirm that extension types are not True for is_string_dtype and is_object_dtype.

# Working around NumPy ticket 1542
shape = values.shape

# Working around NumPy ticket 1542
shape = values.shape

if is_string_like_dtype(dtype):
result = np.zeros(values.shape, dtype=bool)
else:
result = np.empty(shape, dtype=bool)
vec = libmissing.isnaobj(values.ravel())
result[...] = vec.reshape(shape)
if is_string_like_dtype(dtype):
result = np.zeros(values.shape, dtype=bool)
else:
result = np.empty(shape, dtype=bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on what the expected is in the else, e.g. an object array of non-strings

vec = libmissing.isnaobj(values.ravel())
result[...] = vec.reshape(shape)

elif needs_i8_conversion(obj):
# this is the NaT pattern
Expand Down Expand Up @@ -406,4 +408,7 @@ def remove_na_arraylike(arr):
"""
Return array-like containing only true/non-NaN values, possibly empty.
"""
return arr[notna(lib.values_from_object(arr))]
if is_extension_array_dtype(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately values_from_object just calls .get_values() on the object if it exists. This smells like something is missing from the EA interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if / elif can be avoided. values_from_object gets and calls arr.get_values if defined, which returns an ndarray. But we need to call notna on the EA itself, since it has special missing value handling.

return arr[notna(arr)]
else:
return arr[notna(lib.values_from_object(arr))]
Loading