-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Extending Pandas with custom types #19174
Changes from all commits
f22e48a
7181741
898b196
5912198
f008c87
f24e25d
e21c9ed
89c6d3f
62e1d3d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -108,3 +108,4 @@ doc/tmp.sv | |
doc/source/styled.xlsx | ||
doc/source/templates/ | ||
env/ | ||
.mypy_cache |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -140,3 +140,64 @@ As an example of fully-formed metadata: | |
'metadata': None} | ||
], | ||
'pandas_version': '0.20.0'} | ||
|
||
.. _developer.custom-array-types: | ||
|
||
Custom Array Types | ||
------------------ | ||
|
||
.. versionadded:: 0.23.0 | ||
|
||
.. warning:: | ||
Support for custom array types is experimental. | ||
|
||
Sometimes the NumPy type system isn't rich enough for your needs. Pandas has | ||
made a few extensions internally (e.g. ``Categorical``). While this has worked | ||
well for pandas, not all custom data types belong in pandas itself. | ||
|
||
Pandas defines an interface for custom arrays. Arrays implementing this | ||
interface will be stored correctly in ``Series`` or ``DataFrame``. The ABCs | ||
that must be implemented are | ||
|
||
1. :class:`ExtensionDtype` A class describing your data type itself. This is | ||
similar to a ``numpy.dtype``. | ||
2. :class:`ExtensionArray`: A container for your data. | ||
|
||
Throughout this document, we'll use the example of storing IPv6 addresses. An | ||
IPv6 address is 128 bits, so NumPy doesn't have a native data type for it. We'll | ||
model it as a structured array with two ``uint64`` fields, which together | ||
represent the 128-bit integer that is the IP Address. | ||
|
||
Extension Dtype | ||
''''''''''''''' | ||
|
||
This class should describe your data type. The most important fields are | ||
``name`` and ``base``: | ||
|
||
.. code-block:: python | ||
|
||
class IPv6Type(ExtensionDtype): | ||
name = 'IPv6' | ||
base = np.dtype([('hi', '>u8'), ('lo', '>u8')]) | ||
type = IPTypeType | ||
kind = 'O' | ||
fill_value = np.array([(0, 0)], dtype=base) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify that these fields can be properties, e.g., based on parameters set in the constructor. |
||
|
||
``base`` describe the underlying storage of individual items in your array. | ||
TODO: is this true? Or does ``.base`` refer to the original memory this | ||
is a view on? Different meanings for ``np.dtype.base`` vs. ``np.ndarray.base``? | ||
|
||
In our IPAddress case, we're using a NumPy structured array with two fields. | ||
|
||
Extension Array | ||
''''''''''''''' | ||
|
||
This is the actual array container for your data, similar to a ``Categorical``, | ||
and requires the most work to implement correctly. *pandas makes no assumptions | ||
about how you store the data*. You're free to use NumPy arrays or PyArrow | ||
arrays, or even just Python lists. That said, several of the methods required by | ||
the interface expect NumPy arrays as the return value. | ||
|
||
* ``dtype``: Should be an *instance* of your custom ``ExtensionType`` | ||
* ``formtting_values(self)``: Used for printing Series and DataFrame | ||
* ``concat_same_type(concat)``: Used in :func:`pd.concat` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
""" public toolkit API """ | ||
from . import types, extensions # noqa |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
from pandas.core.extensions import ( # noqa | ||
ExtensionArray, | ||
ExtensionDtype, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,6 +6,7 @@ | |
from warnings import warn, catch_warnings | ||
import numpy as np | ||
|
||
from pandas.core.extensions import ExtensionArray | ||
from pandas.core.dtypes.cast import ( | ||
maybe_promote, construct_1d_object_array_from_listlike) | ||
from pandas.core.dtypes.generic import ( | ||
|
@@ -22,7 +23,7 @@ | |
is_categorical, is_datetimetz, | ||
is_datetime64_any_dtype, is_datetime64tz_dtype, | ||
is_timedelta64_dtype, is_interval_dtype, | ||
is_scalar, is_list_like, | ||
is_scalar, is_list_like, is_extension_type, | ||
_ensure_platform_int, _ensure_object, | ||
_ensure_float64, _ensure_uint64, | ||
_ensure_int64) | ||
|
@@ -542,9 +543,12 @@ def value_counts(values, sort=True, ascending=False, normalize=False, | |
|
||
else: | ||
|
||
if is_categorical_dtype(values) or is_sparse(values): | ||
|
||
# handle Categorical and sparse, | ||
if (is_extension_type(values) and not | ||
is_datetime64tz_dtype(values)): | ||
# Need the not is_datetime64tz_dtype since it's actually | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. huh? this is not friendly |
||
# an ndarray. It doesn't have a `.values.value_counts`. | ||
# Perhaps we need a new is_extension_type method that | ||
# distinguishes these... | ||
result = Series(values).values.value_counts(dropna=dropna) | ||
result.name = name | ||
counts = result.values | ||
|
@@ -1323,6 +1327,8 @@ def take_nd(arr, indexer, axis=0, out=None, fill_value=np.nan, mask_info=None, | |
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill) | ||
elif is_interval_dtype(arr): | ||
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill) | ||
elif isinstance(arr, ExtensionArray): | ||
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill) | ||
|
||
if indexer is None: | ||
indexer = np.arange(arr.shape[axis], dtype=np.int64) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -880,7 +880,7 @@ def _map_values(self, mapper, na_action=None): | |
if isinstance(mapper, ABCSeries): | ||
# Since values were input this means we came from either | ||
# a dict or a series and mapper should be an index | ||
if is_extension_type(self.dtype): | ||
if is_extension_type(self): | ||
values = self._values | ||
else: | ||
values = self.values | ||
|
@@ -891,7 +891,8 @@ def _map_values(self, mapper, na_action=None): | |
return new_values | ||
|
||
# we must convert to python types | ||
if is_extension_type(self.dtype): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it needs to be iterable |
||
# TODO: is map part of the interface? | ||
if is_extension_type(self) and hasattr(self._values, 'map'): | ||
values = self._values | ||
if na_action is not None: | ||
raise NotImplementedError | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -43,6 +43,7 @@ | |
from pandas.io.formats.terminal import get_terminal_size | ||
from pandas.util._validators import validate_bool_kwarg | ||
from pandas.core.config import get_option | ||
from pandas.core.extensions import ExtensionArray | ||
|
||
|
||
def _cat_compare_op(op): | ||
|
@@ -409,6 +410,11 @@ def dtype(self): | |
"""The :class:`~pandas.api.types.CategoricalDtype` for this instance""" | ||
return self._dtype | ||
|
||
@property | ||
def _block_type(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm still thinking about how best to handle this. The conflict is
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we want to use CategoricalBlock instead of ExtensionBlock with categorical dtype? Because that would require more changes in the code? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, I'm looking through at the methods
Trying to balance complexity of implementation for 3rd-parties here. |
||
from pandas.core.internals import CategoricalBlock | ||
return CategoricalBlock | ||
|
||
@property | ||
def _constructor(self): | ||
return Categorical | ||
|
@@ -2131,6 +2137,15 @@ def repeat(self, repeats, *args, **kwargs): | |
return self._constructor(values=codes, categories=self.categories, | ||
ordered=self.ordered, fastpath=True) | ||
|
||
|
||
# TODO: Categorical does not currently implement | ||
# - concat_same_type | ||
# - can_hold_na | ||
# We don't need to implement these, since they're just for | ||
# Block things, and we only use CategoricalBlocks for categoricals. | ||
# We could move that logic from CategoricalBlock to Categorical, | ||
# but holding off for now. | ||
ExtensionArray.register(Categorical) | ||
# The Series.cat accessor | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -568,7 +568,6 @@ def is_string_dtype(arr_or_dtype): | |
""" | ||
|
||
# TODO: gh-15585: consider making the checks stricter. | ||
|
||
if arr_or_dtype is None: | ||
return False | ||
try: | ||
|
@@ -1624,11 +1623,13 @@ def is_bool_dtype(arr_or_dtype): | |
|
||
def is_extension_type(arr): | ||
""" | ||
Check whether an array-like is of a pandas extension class instance. | ||
Check whether an array-like is a pandas extension class instance. | ||
|
||
Extension classes include categoricals, pandas sparse objects (i.e. | ||
classes represented within the pandas library and not ones external | ||
to it like scipy sparse matrices), and datetime-like arrays. | ||
to it like scipy sparse matrices), and datetime-like arrays with | ||
timezones, or any third-party objects satisfying the pandas array | ||
interface. | ||
|
||
Parameters | ||
---------- | ||
|
@@ -1646,39 +1647,44 @@ def is_extension_type(arr): | |
False | ||
>>> is_extension_type(np.array([1, 2, 3])) | ||
False | ||
>>> | ||
|
||
Categoricals | ||
>>> cat = pd.Categorical([1, 2, 3]) | ||
>>> | ||
>>> is_extension_type(cat) | ||
True | ||
>>> is_extension_type(pd.Series(cat)) | ||
True | ||
|
||
pandas' Sparse arrays | ||
>>> is_extension_type(pd.SparseArray([1, 2, 3])) | ||
True | ||
>>> is_extension_type(pd.SparseSeries([1, 2, 3])) | ||
True | ||
>>> | ||
>>> from scipy.sparse import bsr_matrix | ||
>>> is_extension_type(bsr_matrix([1, 2, 3])) | ||
False | ||
>>> is_extension_type(pd.DatetimeIndex([1, 2, 3])) | ||
False | ||
|
||
pandas' datetime with timezone | ||
>>> is_extension_type(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern")) | ||
True | ||
>>> | ||
>>> dtype = DatetimeTZDtype("ns", tz="US/Eastern") | ||
>>> s = pd.Series([], dtype=dtype) | ||
>>> is_extension_type(s) | ||
True | ||
""" | ||
|
||
if is_categorical(arr): | ||
return True | ||
elif is_sparse(arr): | ||
return True | ||
elif is_datetimetz(arr): | ||
return True | ||
return False | ||
# XXX: we have many places where we call this with a `.dtype`, | ||
# instead of a type. Think about supporting that too... | ||
from pandas.core.extensions import ExtensionArray, ExtensionDtype | ||
return (isinstance(arr, ExtensionArray) or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this needs just to satisfy |
||
isinstance(getattr(arr, 'values', None), ExtensionArray) or | ||
# XXX: I don't like this getattr('dtype'), but I think it's | ||
# necessary since DatetimeIndex().values of a datetime w/ tz | ||
# is just a regular numpy array, and not an instance of | ||
# ExtensionArray. I think that's since | ||
# datetime (without tz) is *not* an extension type, but | ||
# datetime[tz] *is* an extension type. | ||
isinstance(getattr(arr, 'dtype', None), ExtensionDtype)) | ||
|
||
|
||
def is_complex_dtype(arr_or_dtype): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mentioned you removed
base
from the interface