Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow storing ExtensionArrays in containers #19520

Merged
merged 139 commits into from
Feb 23, 2018
Merged
Show file tree
Hide file tree
Changes from 99 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
b7069ef
ENH: non-interval changes
TomAugspurger Feb 2, 2018
9cd92c7
COMPAT: py2 Super
TomAugspurger Feb 3, 2018
9211bbd
BUG: Use original object for extension array
TomAugspurger Feb 3, 2018
80f83a6
Consistent boxing / unboxing
TomAugspurger Feb 3, 2018
ca004d8
32-bit compat
TomAugspurger Feb 3, 2018
5d4a686
Add a test array
TomAugspurger Feb 5, 2018
9f4ad42
linting
TomAugspurger Feb 5, 2018
b1db4e8
Default __iter__
TomAugspurger Feb 5, 2018
00d6bb3
Tests for value_counts
TomAugspurger Feb 5, 2018
1608e3d
Implement value_counts
TomAugspurger Feb 5, 2018
52e2180
Py2 compat
TomAugspurger Feb 5, 2018
e6d06e2
Fixed dropna
TomAugspurger Feb 5, 2018
d356f19
Test fixups
TomAugspurger Feb 6, 2018
a6ae340
Started setitem
TomAugspurger Feb 6, 2018
41f09d8
REF/Clean: Internal / External values
TomAugspurger Feb 3, 2018
29cfd7c
Move to index base
TomAugspurger Feb 6, 2018
05eb9f3
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 6, 2018
82fd0c6
Setitem tests, decimal example
TomAugspurger Feb 7, 2018
8b1e7d6
Compat
TomAugspurger Feb 7, 2018
10af4b6
Fixed extension block tests.
TomAugspurger Feb 7, 2018
cd5f1eb
Clarify binop tests
TomAugspurger Feb 7, 2018
0a9d9fd
TST: Removed ops tests
TomAugspurger Feb 7, 2018
3185f4e
Cleanup unique handling
TomAugspurger Feb 7, 2018
ffa888a
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 7, 2018
5a59591
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 7, 2018
476f75d
Simplify object concat
TomAugspurger Feb 7, 2018
b15ee5a
Use values for intersection
TomAugspurger Feb 7, 2018
659073f
hmm
TomAugspurger Feb 7, 2018
b15ecac
More failing tests
TomAugspurger Feb 7, 2018
88b8f4f
remove bad test
TomAugspurger Feb 7, 2018
349ac1a
better setitem
TomAugspurger Feb 8, 2018
27ab045
Dropna works.
TomAugspurger Feb 8, 2018
8358fb1
Restore xfail test
TomAugspurger Feb 8, 2018
8ef34a9
Test Categorical
TomAugspurger Feb 8, 2018
340d11b
Xfail setitem tests
TomAugspurger Feb 8, 2018
8297888
TST: Skip JSON tests on py2
TomAugspurger Feb 8, 2018
7accb67
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 8, 2018
9b8d2a5
Additional testing
TomAugspurger Feb 8, 2018
9fbac29
More tests
TomAugspurger Feb 8, 2018
55305dc
ndarray_values
TomAugspurger Feb 8, 2018
0e63708
API: Default ExtensionArray.astype
TomAugspurger Feb 8, 2018
fbbbc8a
Simplify concat_as_object
TomAugspurger Feb 8, 2018
46a0a49
Py2 compat
TomAugspurger Feb 8, 2018
2c4445a
Set-ops ugliness
TomAugspurger Feb 8, 2018
5612cda
better docstrings
TomAugspurger Feb 8, 2018
b012c19
tolist
TomAugspurger Feb 8, 2018
d49e6aa
linting
TomAugspurger Feb 8, 2018
d7d31ee
Moved dtypes
TomAugspurger Feb 9, 2018
7b89f1b
clean
TomAugspurger Feb 9, 2018
b0dbffd
cleanup
TomAugspurger Feb 9, 2018
66b936f
NumPy compat
TomAugspurger Feb 9, 2018
32ee0ef
Use base _values for CategoricalIndex
TomAugspurger Feb 9, 2018
a9882e2
Update dev docs
TomAugspurger Feb 9, 2018
f53652a
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
2425621
cleanup
TomAugspurger Feb 9, 2018
fc337de
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
6abe9da
cleanup
TomAugspurger Feb 9, 2018
512fb89
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
c5db5da
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
0b112f2
cleanup
TomAugspurger Feb 9, 2018
170d0c7
Linting
TomAugspurger Feb 9, 2018
402620f
Precision in tests
TomAugspurger Feb 9, 2018
b556f83
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 9, 2018
268aabc
Linting
TomAugspurger Feb 9, 2018
d671259
Move to extension
TomAugspurger Feb 9, 2018
cb740ed
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 9, 2018
d9e8dd6
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 9, 2018
815d202
Push _ndarray_values to ExtensionArray
TomAugspurger Feb 11, 2018
a727b21
Clean up tolist
TomAugspurger Feb 11, 2018
f368c29
Move test locations
TomAugspurger Feb 11, 2018
d74c5c9
Fixed test
TomAugspurger Feb 12, 2018
0ec6958
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
8104ee5
REF: Update per comments
TomAugspurger Feb 12, 2018
4d08218
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
f8e29b9
lint
TomAugspurger Feb 12, 2018
0cd9faa
REF: Use _values for size and shape
TomAugspurger Feb 12, 2018
8fcdb70
PERF: Implement size, shape for IntervalIndex
TomAugspurger Feb 12, 2018
34a6a22
PERF: Avoid materializing values for PeriodIndex shape, size
TomAugspurger Feb 12, 2018
b4de5c4
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 12, 2018
c233c28
Merge remote-tracking branch 'upstream/master' into index-values
TomAugspurger Feb 13, 2018
d6e8051
Cleanup
TomAugspurger Feb 13, 2018
3af8a21
Override nbytes
TomAugspurger Feb 13, 2018
3a5118b
Merge branch 'index-values' into pandas-array-upstream+fu1
TomAugspurger Feb 13, 2018
1e8e87e
Remove unused change
TomAugspurger Feb 13, 2018
0f5e4f0
Docs
TomAugspurger Feb 13, 2018
1436d1d
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 13, 2018
c4dab88
Test cleanpu
TomAugspurger Feb 13, 2018
a312ba5
Always set PANDAS_TESTING_MODE
TomAugspurger Feb 13, 2018
758689f
Revert "Always set PANDAS_TESTING_MODE"
TomAugspurger Feb 13, 2018
02c3d40
Explicitly catch warnings or not
TomAugspurger Feb 13, 2018
9e17037
fastparquet warnings
TomAugspurger Feb 13, 2018
4599db4
Unicode literals strikes again.
TomAugspurger Feb 14, 2018
d34d9ca
Restore circle env var
TomAugspurger Feb 14, 2018
29d2528
More parquet test catching
TomAugspurger Feb 14, 2018
905d72e
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 14, 2018
412c951
No stacklevel
TomAugspurger Feb 14, 2018
78834f1
Lower bound on FP
TomAugspurger Feb 15, 2018
f8eac55
Exact bound for FP
TomAugspurger Feb 15, 2018
f09c863
Don't use fastpath for ExtensionBlock make_block
TomAugspurger Feb 15, 2018
cedb63d
Consistently use _values
TomAugspurger Feb 16, 2018
cae2c26
TST: Additional constructor tests
TomAugspurger Feb 16, 2018
8088096
CLN: de-nested a bit
TomAugspurger Feb 16, 2018
8aed325
_fill_value handling
TomAugspurger Feb 16, 2018
453728a
Handle user provided dtype in constructors.
TomAugspurger Feb 16, 2018
cc13c8d
Document ExtensionBlock._maybe_coerce_values
TomAugspurger Feb 16, 2018
f90ac07
Created ABCExtensionArray
TomAugspurger Feb 16, 2018
4a03b26
TST: Tests for is_object_dtype and is_string_dtype and EAs
TomAugspurger Feb 16, 2018
635223f
fixup! Handle user provided dtype in constructors.
TomAugspurger Feb 17, 2018
cf423a7
Doc for setitem
TomAugspurger Feb 17, 2018
2d1a66c
Split base tests
TomAugspurger Feb 17, 2018
c849865
Revert test_parquet changes
TomAugspurger Feb 17, 2018
c3ec822
API: Removed _fill_value from the interface
TomAugspurger Feb 17, 2018
f4cf45c
Push coercion to extension dtype till later
TomAugspurger Feb 17, 2018
9c5d479
Linting
TomAugspurger Feb 17, 2018
1175c0d
ERR: Better error message for coercion to 3rd party dtypes
TomAugspurger Feb 17, 2018
c816d99
CLN: Make take_nd EA aware
TomAugspurger Feb 17, 2018
9c9f59e
Revert sparse changes
TomAugspurger Feb 17, 2018
08af9a3
Other _typ for ABCExtensionArray
TomAugspurger Feb 17, 2018
2e992f7
Test cleanup and expansion.
TomAugspurger Feb 17, 2018
cc5cc3e
Copy if copy
TomAugspurger Feb 17, 2018
704ee67
TST: remove self param for fixture
TomAugspurger Feb 18, 2018
c262968
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 19, 2018
8bf0334
Remove unnescessary EA handling in Series ctor
TomAugspurger Feb 20, 2018
c8d88da
API: Removed value_counts
TomAugspurger Feb 20, 2018
24f3b60
More doc notes
TomAugspurger Feb 20, 2018
50bd5dd
Handle expanding a DataFrame with an EA
TomAugspurger Feb 20, 2018
879bc84
Added ExtensionDtype.__eq__
TomAugspurger Feb 20, 2018
33c9d1f
linting
TomAugspurger Feb 21, 2018
f07c166
REF: is_dtype_equal refactor
TomAugspurger Feb 21, 2018
79d43b1
Remove reference to dtype being a class
TomAugspurger Feb 21, 2018
91c629b
Merge remote-tracking branch 'upstream/master' into pandas-array-upst…
TomAugspurger Feb 21, 2018
a1ebf53
move
TomAugspurger Feb 22, 2018
aa57cad
Moved sparse check to take_nd
TomAugspurger Feb 22, 2018
c82748c
Docstring
TomAugspurger Feb 22, 2018
e919343
Split tests
TomAugspurger Feb 22, 2018
1ea74da
Revert index change
TomAugspurger Feb 22, 2018
0c41a34
Copy changes
TomAugspurger Feb 22, 2018
009bece
Simplify EA implementation names
TomAugspurger Feb 22, 2018
ea5562b
Linting
TomAugspurger Feb 22, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions circle.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ machine:
environment:
# these are globally set
MINICONDA_DIR: /home/ubuntu/miniconda3
PANDAS_TESTING_MODE: deprecate


database:
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
is_unsigned_integer_dtype, is_signed_integer_dtype,
is_integer_dtype, is_complex_dtype,
is_object_dtype,
is_extension_array_dtype,
is_categorical_dtype, is_sparse,
is_period_dtype,
is_numeric_dtype, is_float_dtype,
Expand Down Expand Up @@ -542,7 +543,7 @@ def value_counts(values, sort=True, ascending=False, normalize=False,

else:

if is_categorical_dtype(values) or is_sparse(values):
if is_extension_array_dtype(values) or is_sparse(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_sparse should be recognized as an extension array type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my todo list. I'll do it after interval.


# handle Categorical and sparse,
result = Series(values).values.value_counts(dropna=dropna)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be Series(values)._values ? Because I would expect this to fail for anything that is not a categorical (as a category series returns a categorical, but for now others return a ndarray there?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be ._values, but

but for now others return a ndarray there?

What types do you mean by others? Internally, our only type returning true here is Categorical (or sparse of is_sparse).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, forgot again for a moment that we don't yet have those other internal extension arrays apart from Categorical :)

Copy link
Contributor Author

@TomAugspurger TomAugspurger Feb 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder why we're doing Series(values).values here. I suspect to handle any of categorical / series / index inputs in a consistent way.

Probably better to make an _unbox_pandas method that extracts the underlying array from an array (no-op) or Series/Index (._values).

Expand Down
80 changes: 70 additions & 10 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,14 @@ class ExtensionArray(object):
* isna
* take
* copy
* _formatting_values
* _concat_same_type

Some additional methods are required to satisfy pandas' internal, private
block API.

* _concat_same_type
* _can_hold_na
* _formatting_values
* _fill_value

This class does not inherit from 'abc.ABCMeta' for performance reasons.
Methods and properties required by the interface raise
Expand All @@ -53,9 +53,6 @@ class ExtensionArray(object):
Extension arrays should be able to be constructed with instances of
the class, i.e. ``ExtensionArray(extension_array)`` should return
an instance, not error.

Additionally, certain methods and interfaces are required for proper
this array to be properly stored inside a ``DataFrame`` or ``Series``.
"""
# ------------------------------------------------------------------------
# Must be a Sequence
Expand Down Expand Up @@ -92,7 +89,37 @@ def __getitem__(self, item):
raise AbstractMethodError(self)

def __setitem__(self, key, value):
# type: (Any, Any) -> None
# type: (Union[int, np.ndarray], Any) -> None
"""Set one or more values inplace.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some notes on the semantics of setitem if people want to take a look.


Parameters
----------
key : int or ndarray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incongruous with the below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added slice

When called from, e.g. ``Series.__setitem__``, ``key`` will
always be an ndarray of integers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a boolean mask also converted to array of integers by pandas?

If so, maybe explicitly mention that the developer can also implement boolean mask and slicing (to be similar to ndarray), but that this is not required for pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean setitem seems to be broken right now. Adding a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the the above sentence in the docstring be updated?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could also be a slice

value : ExtensionDtype.type, Sequence[ExtensionDtype.type], or object
ExtensionArrays may
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may ... (something missing?)


Notes
-----
This method is not required to satisfy the interface. If an
ExtensionArray chooses to implement __setitem__, then some semantics
should be observed.

* Setting multiple values : ExtensionArrays should support setting
multiple values at once, ``key`` will be a sequence of integers.

* Broadcasting : For a sequence ``key`` and a scalar ``value``,
each position in ``key`` should be set to ``value``.

* Coercion : Most users will expect basic coercion to work. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very hard to support / do. I would not mention this, and its up to the EA subclass to do this (and to be honest kind of breaks the contract, IOW this is supposed to be an integer / slice and not a coercible)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed up to the EA author to decide on this, but this is about coercing value, not key, so it does not infer with the contract of what key can be

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section may not be appropriate for the docstring anyway. I'll remove it and add it to the narrative docs (which I've started on another branch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section may not be appropriate for the docstring anyway.

To clarify, we have two consumers of the docs. Library authors implementing an EA and users of that library's EA.

To the extent possible, I think we should write docstrings for users of the EA. We can put notes on expected implementation elsewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possibility (or in addition) is to leave some implementation comments as actual comments. Like

def method(..):
    """
    User oriented docstring
    """
    # some additional
    # developer oriented comments

as I think it is useful to keep some of those things in the code itself, for someone looking at the base class implementation.
I think eg __iter__, nbytes, .. are other examples where we could do such a split

example, a string like ``'2018-01-01'`` is coerced to a datetime
when setting on a datetime64ns array. In general, if the
``__init__`` method coerces that value, then so should ``__setitem__``.

When called from, e.g. ``Series.__setitem__``, ``key`` will always
be an ndarray of positions.
"""
raise NotImplementedError(_not_implemented_message.format(
type(self), '__setitem__')
)
Expand All @@ -107,6 +134,16 @@ def __len__(self):
# type: () -> int
raise AbstractMethodError(self)

def __iter__(self):
"""Iterate over elements.

This needs to be implemented so that pandas recognizes extension arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be a better doc-string here

as list-like. The default implementation makes successive calls to
``__getitem__``, which may be slower than necessary.
"""
for i in range(len(self)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you not just raising AbstractMethodError?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is tied intimately with the boxing concept of scalars, but this should be decoupled from getitem (yes its a valid implementation, but you would never actually do it this way), rather you would iterate on some underlying value and box it, not index into and as a side effect use getitem to box.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather you would iterate on some underlying value

We don't know how the subclass is storing things though, so there's no .values to iterate over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my point is that you cannot implement a default here, make the sub-classes do it. yes it might repeat some code, but its much better there i think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you cannot

Yes, we can (see the code). You "don't want" it, that something different :)
But I don't really understand why the above is not a good default implementation. I think a lot subclasses will do exactly this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And some context: If I remove __iter__, the object will still be iterable with essentially this implementation since it defines __getitem__ and __len__. We just need to implement it explicitly so that is_iterable works correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it still might be better raise here by default and force the subclass to implement this (even if this impl). It is SO important for subclasses to pay attention to this, it should just be required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't do much better than this implementation.
To give you some context: in geopandas, I have a cython method that converts a GeometryArray to numpy object array of Geometry objects, which takes ca 80ms on an array of 10000 points. Simply doing np.array([a[i] for i in range(len(a))], dtype=object) is ca 108 ms. This small difference is because by definition in each iteration I need to create a python object.

yield self[i]

# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -167,6 +204,25 @@ def isna(self):
"""
raise AbstractMethodError(self)

def value_counts(self, dropna=True):
"""Compute a histogram of the counts of non-null values.

Parameters
----------
dropna : bool, default True
Don't include counts of NaN

Returns
-------
value_counts : Series
"""
from pandas import value_counts

if dropna:
self = self[~self.isna()]

return value_counts(np.array(self))

# ------------------------------------------------------------------------
# Indexing methods
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -198,9 +254,8 @@ def take(self, indexer, allow_fill=True, fill_value=None):

Examples
--------
Suppose the extension array somehow backed by a NumPy structured array
and that the underlying structured array is stored as ``self.data``.
Then ``take`` may be written as
Suppose the extension array is backed by a NumPy array stored as
``self.data``. Then ``take`` may be written as

.. code-block:: python

Expand All @@ -209,6 +264,10 @@ def take(self, indexer, allow_fill=True, fill_value=None):
result = self.data.take(indexer)
result[mask] = self._fill_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there a reason we don't define _na_value? (e.g we do on indexes) as a property?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was a self._fill_value, but because this was not used internally, we left it out for the moment. The discussion about that is a bit spread, but see eg #19520 (comment) (4th bullet point is about fill_value, and comment below with Tom's answer) and #19520 (comment) and the comments below that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok that's fine. I think we do need / want a way to set this and a property is good, and _na_value is used elsewhere (eg. in Index)

return type(self)(result)

See Also
--------
numpy.take
"""
raise AbstractMethodError(self)

Expand Down Expand Up @@ -240,7 +299,7 @@ def _formatting_values(self):
# type: () -> np.ndarray
# At the moment, this has to be an array since we use result.dtype
"""An array of values to be printed in, e.g. the Series repr"""
raise AbstractMethodError(self)
return np.array(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, this should be implemented only by subclasses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very sensible default IMO, subclasses can always override if needed


@classmethod
def _concat_same_type(cls, to_concat):
Expand All @@ -257,6 +316,7 @@ def _concat_same_type(cls, to_concat):
"""
raise AbstractMethodError(cls)

@property
def _can_hold_na(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interface change: This should have been a property.

# type: () -> bool
"""Whether your array can hold missing values. True by default.
Expand Down
4 changes: 4 additions & 0 deletions pandas/core/arrays/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -2141,6 +2141,10 @@ def repeat(self, repeats, *args, **kwargs):
def _can_hold_na(self):
return True

@property
def _fill_value(self):
return np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I general this fill_value would depend on the type of the categories ...
Did you get an error somewhere if this was not added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... I'll fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be:

self.categories._na_value


@classmethod
def _concat_same_type(self, to_concat):
from pandas.core.dtypes.concat import _concat_categorical
Expand Down
12 changes: 9 additions & 3 deletions pandas/core/dtypes/base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
"""Extend pandas with custom array types"""
import inspect

from pandas.errors import AbstractMethodError


Expand Down Expand Up @@ -106,7 +108,8 @@ def is_dtype(cls, dtype):

Parameters
----------
dtype : str or dtype
dtype : str, object, or type
The dtype to check.

Returns
-------
Expand All @@ -118,12 +121,15 @@ def is_dtype(cls, dtype):

1. ``cls.construct_from_string(dtype)`` is an instance
of ``cls``.
2. 'dtype' is ``cls`` or a subclass of ``cls``.
2. ``dtype`` is an object and is an instance of ``cls``
3. 'dtype' is a class and is ``cls`` or a subclass of ``cls``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this have double-ticks?

"""
if isinstance(dtype, str):
try:
return isinstance(cls.construct_from_string(dtype), cls)
except TypeError:
return False
else:
elif inspect.isclass(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have 2 conditions the same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issubclass vs. isinstance :) Want to be able to handle .is_dtype(ExtensionDtype) and .is_dtype(ExtensionDtype()).

return issubclass(dtype, cls)
else:
return isinstance(dtype, cls)
2 changes: 1 addition & 1 deletion pandas/core/dtypes/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1708,9 +1708,9 @@ def is_extension_array_dtype(arr_or_dtype):
"""
from pandas.core.arrays import ExtensionArray

# we want to unpack series, anything else?
if isinstance(arr_or_dtype, (ABCIndexClass, ABCSeries)):
arr_or_dtype = arr_or_dtype._values

return isinstance(arr_or_dtype, (ExtensionDtype, ExtensionArray))


Expand Down
30 changes: 19 additions & 11 deletions pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,10 @@
is_datetimelike_v_numeric, is_float_dtype,
is_datetime64_dtype, is_datetime64tz_dtype,
is_timedelta64_dtype, is_interval_dtype,
is_complex_dtype, is_categorical_dtype,
is_complex_dtype,
is_string_like_dtype, is_bool_dtype,
is_integer_dtype, is_dtype_equal,
is_extension_array_dtype,
needs_i8_conversion, _ensure_object,
pandas_dtype,
is_scalar,
Expand Down Expand Up @@ -52,12 +53,15 @@ def isna(obj):


def _isna_new(obj):
from ..arrays import ExtensionArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't do this. rather define an ABCExtensionArray. instead. This is the purpose of these, to avoid smell like this.


if is_scalar(obj):
return libmissing.checknull(obj)
# hack (for now) because MI registers as ndarray
elif isinstance(obj, ABCMultiIndex):
raise NotImplementedError("isna is not defined for MultiIndex")
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass)):
elif isinstance(obj, (ABCSeries, np.ndarray, ABCIndexClass,
ExtensionArray)):
return _isna_ndarraylike(obj)
elif isinstance(obj, ABCGeneric):
return obj._constructor(obj._data.isna(func=isna))
Expand Down Expand Up @@ -124,17 +128,18 @@ def _use_inf_as_na(key):


def _isna_ndarraylike(obj):

values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if is_categorical_dtype(values):
from pandas import Categorical
if not isinstance(values, Categorical):
values = values.values
result = values.isna()
elif is_interval_dtype(values):
if is_extension_array_dtype(obj):
if isinstance(obj, (ABCIndexClass, ABCSeries)):
values = obj._values
else:
values = obj
result = values.isna()
elif is_string_dtype(dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already there, but I find the use of is_string_dtype rather confusing here, as intervals are not strings at all (but it just checks for string or object dtype ..).
But maybe can do

elif is_interval_dtype(values):
    ...
elif is string_dtype(values):
    ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of this, pls fix is_string_dtype (to exclude interval as well as period dtypes (you might be able to say object and not extension_type).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix itself once IntervalArray is a proper extension type. Added tests to confirm that extension types are not True for is_string_dtype and is_object_dtype.

if is_interval_dtype(values):
# TODO(IntervalArray): remove this if block
from pandas import IntervalIndex
result = IntervalIndex(obj).isna()
else:
Expand Down Expand Up @@ -406,4 +411,7 @@ def remove_na_arraylike(arr):
"""
Return array-like containing only true/non-NaN values, possibly empty.
"""
return arr[notna(lib.values_from_object(arr))]
if is_extension_array_dtype(arr):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ultimately values_from_object just calls .get_values() on the object if it exists. This smells like something is missing from the EA interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this if / elif can be avoided. values_from_object gets and calls arr.get_values if defined, which returns an ndarray. But we need to call notna on the EA itself, since it has special missing value handling.

return arr[notna(arr)]
else:
return arr[notna(lib.values_from_object(arr))]
21 changes: 9 additions & 12 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
is_categorical_dtype,
is_object_dtype,
is_extension_type,
is_extension_array_dtype,
is_datetimetz,
is_datetime64_any_dtype,
is_datetime64tz_dtype,
Expand Down Expand Up @@ -71,7 +72,7 @@
create_block_manager_from_arrays,
create_block_manager_from_blocks)
from pandas.core.series import Series
from pandas.core.arrays import Categorical
from pandas.core.arrays import Categorical, ExtensionArray
import pandas.core.algorithms as algorithms
from pandas.compat import (range, map, zip, lrange, lmap, lzip, StringIO, u,
OrderedDict, raise_with_traceback)
Expand Down Expand Up @@ -511,7 +512,7 @@ def _get_axes(N, K, index=index, columns=columns):
index, columns = _get_axes(len(values), 1)
return _arrays_to_mgr([values], columns, index, columns,
dtype=dtype)
elif is_datetimetz(values):
elif (is_datetimetz(values) or is_extension_array_dtype(values)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the dtype is an extension dtype? (like the check above for categorical)
But certainly fine with not supporting it for now.

Similar for Series, we need to either define what pd.Series(values, dtype=SomeExtensionDtype()) does, or either error on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [7]: pd.Series([0, 1, 2], dtype=cyberpandas.IPType())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-7fc01964da1d> in <module>()
----> 1 pd.Series([0, 1, 2], dtype=cyberpandas.IPType())

~/sandbox/pandas-ip/pandas/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    246             else:
    247                 data = _sanitize_array(data, index, dtype, copy,
--> 248                                        raise_cast_failure=True)
    249
    250                 data = SingleBlockManager(data, index, fastpath=True)

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _sanitize_array(data, index, dtype, copy, raise_cast_failure)
   3214         if dtype is not None:
   3215             try:
-> 3216                 subarr = _try_cast(data, False)
   3217             except Exception:
   3218                 if raise_cast_failure:  # pragma: no cover

~/sandbox/pandas-ip/pandas/pandas/core/series.py in _try_cast(arr, take_fast_path)
   3168             subarr = maybe_cast_to_datetime(arr, dtype)
   3169             if not is_extension_type(subarr):
-> 3170                 subarr = np.array(subarr, dtype=dtype, copy=copy)
   3171         except (ValueError, TypeError):
   3172             if is_categorical_dtype(dtype):

TypeError: data type not understood

How should we handle this? I'm fine with raising with a better error message. We don't know how to cast from (arbitrary array, extension_dtype) -> extension array.

Although... maybe we could support Series(extension_array, dtype=extension_dtype).

Oh hey, that "works"

In [10]: pd.Series(cyberpandas.IPArray([0, 1, 2]), dtype=cyberpandas.IPType())
Out[10]:
0    0.0.0.0
1    0.0.0.1
2    0.0.0.2
dtype: ip

But it only works since we ignore the dtype entirely :)

I will raise if data.dtype is not dtype equal with dtype I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also for things like astype(SomeExtensionDtype) we will need an informative error message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't a dateimetz an extension type now? why can't you remove the first part of the clauase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not yet. Just Categorical so far. I have Interval and Period started, but no PRs yet.

# GH19157
if columns is None:
columns = [0]
Expand Down Expand Up @@ -2819,15 +2820,15 @@ def reindexer(value):
# now align rows
value = reindexer(value).T

elif isinstance(value, Categorical):
elif isinstance(value, ExtensionArray):
value = value.copy()

elif isinstance(value, Index) or is_sequence(value):
from pandas.core.series import _sanitize_index

# turn me into an ndarray
value = _sanitize_index(value, self.index, copy=False)
if not isinstance(value, (np.ndarray, Index)):
if not isinstance(value, (np.ndarray, Index, ExtensionArray)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use is_array_like here (and maybe in other places)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rust remove this change until we support array-backed indexes.

if isinstance(value, list) and len(value) > 0:
value = maybe_convert_platform(value)
else:
Expand All @@ -2849,7 +2850,7 @@ def reindexer(value):
value = maybe_cast_to_datetime(value, value.dtype)

# return internal types directly
if is_extension_type(value):
if is_extension_type(value) or is_extension_array_dtype(value):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unfortunate intermediate stage where some of our internal extension types (sparse, datetime w/ tz) are not yet actual extension array types. We'll be able to clean this up this eventually.

return value

# broadcast across multiple columns if necessary
Expand Down Expand Up @@ -3386,12 +3387,8 @@ class max type
new_obj = self.copy()

def _maybe_casted_values(index, labels=None):
if isinstance(index, PeriodIndex):
values = index.astype(object).values
elif isinstance(index, DatetimeIndex) and index.tz is not None:
values = index
else:
values = index.values
values = index._values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a small comment here what the intent of the function is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I'm not really sure :) It also seems to just be used in one place, so this seems like an opportunity for cleanup.

This change is possible because PeriodIndex.values / _values is just .astype(object).values, and DatetimeIndex._values is self with a TZ.

if not isinstance(index, (PeriodIndex, DatetimeIndex)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are you trying to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is identical to what happened before in the else branch. Seems like it's doing some type inference / coercion.

if values.dtype == np.object_:
values = lib.maybe_convert_objects(values)

Expand Down Expand Up @@ -5640,7 +5637,7 @@ def count(self, axis=0, level=None, numeric_only=False):
if len(frame._get_axis(axis)) == 0:
result = Series(0, index=frame._get_agg_axis(axis))
else:
if frame._is_mixed_type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole count things should be cleaned up anyhow, can you add an issue about this, this optimization is not worth it an makes code more complex (not about your changes, but the mixed type stuff)

if frame._is_mixed_type or frame._data.any_extension_types:
result = notna(frame).sum(axis=axis)
else:
counts = notna(frame.values).sum(axis=axis)
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from pandas import compat

from pandas.core.accessor import CachedAccessor
from pandas.core.arrays import ExtensionArray
from pandas.core.dtypes.generic import (
ABCSeries, ABCDataFrame,
ABCMultiIndex,
Expand Down Expand Up @@ -2002,6 +2003,7 @@ def _format_with_header(self, header, na_rep='NaN', **kwargs):

if is_categorical_dtype(values.dtype):
values = np.array(values)

elif is_object_dtype(values.dtype):
values = lib.maybe_convert_objects(values, safe=1)

Expand Down Expand Up @@ -2601,7 +2603,7 @@ def get_value(self, series, key):
# if we have something that is Index-like, then
# use this, e.g. DatetimeIndex
s = getattr(series, '_values', None)
if isinstance(s, Index) and is_scalar(key):
if isinstance(s, (ExtensionArray, Index)) and is_scalar(key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_array_like might work here

try:
return s[key]
except (IndexError, ValueError):
Expand Down
Loading