Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add NDArrayBackedExtensionArray to public API #45544

Closed
wants to merge 31 commits into from
Closed
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
1f93779
ENH: add NDArrayBackedExtensionArray to public API
tswast Jan 21, 2022
522b548
add whatsnew
tswast Jan 21, 2022
ee4e23d
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Jan 23, 2022
945f840
add NDArrayBackedExtensionArray to pandas.core.arrays.__init__
tswast Jan 24, 2022
721ae11
add tests for extensions api
tswast Jan 24, 2022
ae68f9d
add docs
tswast Jan 24, 2022
05d0e08
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 24, 2022
1ad0338
Merge remote-tracking branch 'origin/python-db-dtypes-pandas-issue28'…
tswast Jan 24, 2022
38113c8
add autosummary for methods and attributes
tswast Jan 24, 2022
18ec784
remove unreferenced methods from docs
tswast Jan 24, 2022
2919f60
fix docstrings
tswast Jan 25, 2022
0c52366
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 25, 2022
319ac2b
use doc decorator
tswast Jan 26, 2022
8513863
add code samples and reference to test suite
tswast Jan 26, 2022
5309895
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Jan 26, 2022
827f483
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Mar 19, 2022
2cd9b31
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Apr 6, 2022
cc75eda
add missing methods to extension docs
tswast Apr 6, 2022
ca323bb
Merge remote-tracking branch 'origin/python-db-dtypes-pandas-issue28'…
tswast Apr 6, 2022
bfd31f0
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Apr 7, 2022
396da54
Merge branch 'main' into python-db-dtypes-pandas-issue28
jreback Apr 10, 2022
27cf80e
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast May 20, 2022
c716826
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Jun 7, 2022
f4df0e9
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Aug 25, 2022
8876b9a
clarify _validate_searchsorted_value and 2d backing array
tswast Aug 26, 2022
1bdd1cd
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Aug 29, 2022
4b0a948
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Nov 22, 2022
5920778
Merge branch 'python-db-dtypes-pandas-issue28' of github.com:tswast/p…
tswast Nov 22, 2022
38018e6
DOC: make insert docstring have single line summary
tswast Nov 23, 2022
9277cf5
Merge remote-tracking branch 'upstream/main' into python-db-dtypes-pa…
tswast Nov 23, 2022
0b86bd5
Merge branch 'main' into python-db-dtypes-pandas-issue28
tswast Nov 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 102 additions & 0 deletions doc/source/development/extending.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,6 +134,108 @@ by some other storage type, like Python lists.
See the `extension array source`_ for the interface definition. The docstrings
and comments contain guidance for properly implementing the interface.

:class:`~pandas.api.extensions.NDArrayBackedExtensionArray`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For ExtensionArrays backed by a single NumPy array, the
:class:`~pandas.api.extensions.NDArrayBackedExtensionArray` class can save you
some effort. It contains a private property ``_ndarray`` with the backing NumPy
array and implements the extension array interface.

Implement the following:

``_box_func``
Convert from array values to the type you wish to expose to users.

``_internal_fill_value``
Scalar used to denote ``NA`` value inside our ``self._ndarray``, e.g. ``-1``
for ``Categorical``, ``iNaT`` for ``Period``.

``_validate_scalar``
Convert from an object to a value which can be stored in the NumPy array.

``_validate_setitem_value``
Convert a value or values for use in setting a value or values in the backing
NumPy array.

``_validate_searchsorted_value``
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 2.0 i think this is going away and we'll re-use _validate_setitem_value for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified that most implementations will be identical to _validate_setitem_value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_validate_searchsorted_value is gone now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove _validate_searchsorted_value here

Convert a value for use in searching for a value in the backing NumPy array.

.. code-block:: python

class DateArray(NDArrayBackedExtensionArray):
_internal_fill_value = numpy.datetime64("NaT")

def __init__(self, values):
backing_array_dtype = "<M8[ns]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this a np.dtype object instead of a string

super().__init__(values=values, dtype=backing_array_dtype)

def _box_func(self, value):
if pandas.isna(x):
return pandas.NaT
return x.astype("datetime64[us]").item().date()

def _validate_scalar(self, scalar):
if pandas.isna(scalar):
return numpy.datetime64("NaT")
elif isinstance(scalar, datetime.date):
return pandas.Timestamp(
year=scalar.year, month=scalar.month, day=scalar.day
).to_datetime64()
else:
raise TypeError("Invalid value type", scalar)

def _validate_setitem_value(self, value):
if pandas.api.types.is_list_like(value):
return [self._validate_scalar(v) for v in value]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an ndarray of the same dtype as self._ndarray

return self._validate_scalar(value)

def _validate_searchsorted_value(self, value):
return self._validate_setitem_value(value)


To support 2D arrays, use the ``_from_backing_data`` helper function when a
method is called on multi-dimensional data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify the data should be of the same dtype as self._ndarray?


.. code-block:: python

class CustomArray(NDArrayBackedExtensionArray):

...

def min(self, *, axis: Optional[int] = None, skipna: bool = True, **kwargs):
pandas.compat.numpy.function.validate_minnumpy_validate_min((), kwargs)
result = pandas.core.nanops.nanmin(
values=self._ndarray, axis=axis, mask=self.isna(), skipna=skipna
)
if axis is None or self.ndim == 1:
return self._box_func(result)
return self._from_backing_data(result)


Subclass the tests in :mod:`pandas.tests.extension.base` in your test suite to
validate your implementation.

.. code-block:: python

@pytest.fixture
def data():
return CustomArray(numpy.arange(-10, 10, 1)


class Test2DCompat(base.NDArrayBacked2DTests):
pass


class TestComparisonOps(base.BaseComparisonOpsTests):
pass

...

class TestSetitem(base.BaseSetitemTests):
pass


.. _extending.extension.operator:

:class:`~pandas.api.extensions.ExtensionArray` operator support
Expand Down
21 changes: 21 additions & 0 deletions doc/source/reference/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ objects.
:template: autosummary/class_without_autosummary.rst

api.extensions.ExtensionArray
api.extensions.NDArrayBackedExtensionArray
arrays.PandasArray

.. We need this autosummary so that methods and attributes are generated.
Expand Down Expand Up @@ -62,6 +63,26 @@ objects.
api.extensions.ExtensionArray.ndim
api.extensions.ExtensionArray.shape
api.extensions.ExtensionArray.tolist
api.extensions.NDArrayBackedExtensionArray.dtype
api.extensions.NDArrayBackedExtensionArray.argmax
api.extensions.NDArrayBackedExtensionArray.argmin
api.extensions.NDArrayBackedExtensionArray.argsort
api.extensions.NDArrayBackedExtensionArray.astype
api.extensions.NDArrayBackedExtensionArray.dropna
api.extensions.NDArrayBackedExtensionArray.equals
api.extensions.NDArrayBackedExtensionArray.factorize
api.extensions.NDArrayBackedExtensionArray.fillna
api.extensions.NDArrayBackedExtensionArray.insert
api.extensions.NDArrayBackedExtensionArray.isin
api.extensions.NDArrayBackedExtensionArray.isna
api.extensions.NDArrayBackedExtensionArray.searchsorted
api.extensions.NDArrayBackedExtensionArray.shift
api.extensions.NDArrayBackedExtensionArray.take
api.extensions.NDArrayBackedExtensionArray.to_numpy
api.extensions.NDArrayBackedExtensionArray.tolist
api.extensions.NDArrayBackedExtensionArray.unique
api.extensions.NDArrayBackedExtensionArray.value_counts
api.extensions.NDArrayBackedExtensionArray.view

Additionally, we have some utility methods for ensuring your object
behaves correctly.
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ Other enhancements
- :meth:`to_numeric` now preserves float64 arrays when downcasting would generate values not representable in float32 (:issue:`43693`)
- :meth:`Series.reset_index` and :meth:`DataFrame.reset_index` now support the argument ``allow_duplicates`` (:issue:`44410`)
- :meth:`.GroupBy.min` and :meth:`.GroupBy.max` now supports `Numba <https://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`45428`)
- :class:`NDArrayBackedExtensionArray` now exposed in the public API. (:issue:`45544`)
- :func:`read_csv` now supports ``defaultdict`` as a ``dtype`` parameter (:issue:`41574`)
- :meth:`DataFrame.rolling` and :meth:`Series.rolling` now support a ``step`` parameter with fixed-length windows (:issue:`15354`)
- Implemented a ``bool``-dtype :class:`Index`, passing a bool-dtype array-like to ``pd.Index`` will now retain ``bool`` dtype instead of casting to ``object`` (:issue:`45061`)
Expand Down
2 changes: 2 additions & 0 deletions pandas/api/extensions/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from pandas.core.arrays import (
ExtensionArray,
ExtensionScalarOpsMixin,
NDArrayBackedExtensionArray,
)

__all__ = [
Expand All @@ -30,4 +31,5 @@
"take",
"ExtensionArray",
"ExtensionScalarOpsMixin",
"NDArrayBackedExtensionArray",
]
2 changes: 2 additions & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from pandas.core.arrays._mixins import NDArrayBackedExtensionArray
from pandas.core.arrays.base import (
ExtensionArray,
ExtensionOpsMixin,
Expand Down Expand Up @@ -32,6 +33,7 @@
"FloatingArray",
"IntegerArray",
"IntervalArray",
"NDArrayBackedExtensionArray",
"PandasArray",
"PeriodArray",
"period_array",
Expand Down
2 changes: 2 additions & 0 deletions pandas/core/arrays/_mixins.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,6 +116,7 @@ def _validate_scalar(self, value):

# ------------------------------------------------------------------------

@doc(ExtensionArray.view)
def view(self, dtype: Dtype | None = None) -> ArrayLike:
# We handle datetime64, datetime64tz, timedelta64, and period
# dtypes here. Everything else we pass through to the underlying
Expand Down Expand Up @@ -149,6 +150,7 @@ def view(self, dtype: Dtype | None = None) -> ArrayLike:
# Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]"
return arr.view(dtype=dtype) # type: ignore[arg-type]

@doc(ExtensionArray.view)
def take(
self: NDArrayBackedExtensionArrayT,
indices: TakeIndexer,
Expand Down
28 changes: 28 additions & 0 deletions pandas/tests/api/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import pandas as pd
from pandas import api
import pandas._testing as tm
from pandas.api import extensions


class Base:
Expand Down Expand Up @@ -280,6 +281,33 @@ def test_api(self):
self.check(api, self.allowed)


class TestExtensions(Base):
# top-level classes
classes = [
"ExtensionDtype",
"ExtensionArray",
"ExtensionScalarOpsMixin",
"NDArrayBackedExtensionArray",
]

# top-level functions
funcs = [
"register_extension_dtype",
"register_dataframe_accessor",
"register_index_accessor",
"register_series_accessor",
"take",
]

# misc
misc = ["no_default"]

def test_api(self):
checkthese = self.classes + self.funcs + self.misc

self.check(namespace=extensions, expected=checkthese)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not that familiar with this test file. what is being tested here?



class TestTesting(Base):
funcs = [
"assert_frame_equal",
Expand Down