diff --git a/doc/source/development/extending.rst b/doc/source/development/extending.rst index 8bee0452c2207..9a218e0afadfd 100644 --- a/doc/source/development/extending.rst +++ b/doc/source/development/extending.rst @@ -208,6 +208,155 @@ will 2. call ``result = op(values, ExtensionArray)`` 3. re-box the result in a ``Series`` +:class:`~pandas.api.extensions.ExtensionArray` Series Operations Support +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. versionadded:: 0.25.0 + +In addition to operators like `__mul__` and `__add__`, the pandas Series +namespace provides a long list of useful operations such as :meth:`Series.round`, +:meth:`Series.sum`, :meth:`Series.abs`, etc'. Some of these are handled by +pandas own algorithm implementations (via a dispatch function), while others +simply call an equivalent numpy function with data from the underlying array. +In order to support this operations in a new ExtensionArray, you must provide +an implementation for them. + +As of 0.25.0, pandas provides its own implementations for some +reduction operations such as min/max/sum/etc'. For your ExtensionArray +to support these methods, it must include an implementation of +:meth:`ExtensionArray._reduce`. See its docstring for a complete list of +the series operations it handles. Once your EA implements +:meth:`ExtensionArray._reduce`, your implementation will be cailled +whenever one of the related Series method is called. All these +methods are reduction functions, and so are expected to return a scalar value +of some type. + +Series operations which are not handled by :meth:`ExtensionArray._reduce`, +such as :meth:`Series.round`, will generally invoke an equivalent numpy +function with your extension array as the argument. Pandas only guarantees +that your array will be passed to a numpy function, it does not dictate +how your ExtensionArray should interact with numpy's dispatch logic +in order to achieve its goal, since there are several alternative ways +of achieving similar results. + +For the most basic support, the default implemntation of :meth:`ExtensionArray.__array__` +will transperantly convert your EA to a numpy object array. You can also +override it to return any numpy array which suits your case. However, +this solution usually falls short, becase any series methods you then +use casts your EA into an object ndarray, while you usually want the +result to remain an instance of your EA. + +In most cases, you will want to provide your own implementations of the +methods. This takes more work, but does a proper job of maintaining the +ExtensionArray's dtype through operations. Understanding how to do this +requires a more detailed understanding of how numpy functions operate on non +ndarray objects. + +Just as pandas handles some operation via :meth:`ExtensionArray._reduce` +and others by delegating to numpy, numpy makes a distinction between +between two types of operations: ufuncs (such as `np.floor`, `np.ceil`, +and `np.abs`), and non-ufuncs (for example `np.round`, and `np.repeat`). + +.. note:: + Although your methods will override numpy's own methods, they + are *not* required to return numpy arrays or builtin python types. In + fact, you will often want your method to return a new instance of your + :class:`pandas.api.extensions.ExtensionArray` as the return value. + +We will deal with ufuncs first. You can find a list of numpy's ufuncs here +(TBD). In order to support numpy ufuncs, a convenient approach is to implement +numpy's `__array_ufunc__` interface, specified in +`NEP-13 `__ +if your ExtensionArray implements a compliant `__array_ufunc__` interface, +when a numpy ufunc such as `np.floor` is invoked on your array, its +implementation of `__array_ufunc__` will be called first and given the +opportunity to compute the function. The return value needn't be a numpy +ndarray (though it can be). In general, you want the return value to be an +instance of your ExtensionArray. + +With ufuncs out of the way, we turn to the remaining numpy operations, such as +`np.round`. The simplest way to support these operations is to simply +implement a compatible method on your ExtensionArray. For example, if your +ExtensionArray has a compatible `round` method on your ExtensionArray, When +:meth:`Series.round` is called, it in turn calls `np.round(self.array)`, +passing your EA into numpy's dispatch logic. Numpy will detect that your EA +implements a compatible `round` method and use it instead of its own +version. As in the ufunc case, your implementation will perform +the calculation on its internal data, and then usually wrap the +result in anew instance of your EA class, and return that as the result. + +It is usually possible to write generic code to handle most ufuncs, +instead of providing a special case for each. For an example, see TBD. + +.. important:: + + When providing implementations of numpy functions such as `np.round`, + You muse ensure that the method signature is compatible with the numpy method + it implements. If the signatures do not match, numpy will ignore it. + + For example, the signature for `np.round` is `np.round(a, decimals=0, out=None)`. + if you implement a round function which omits the `out` keyword, + +.. code-block:: python + + def round(self, decimals=0): + pass + + +\... numpy will ignore it. The following will work however: + +.. code-block:: python + + def round(self, decimals=0, **kwds): + pass + + +An second possible approach to implementing individual operations, is to override +`__getattr__` in your ExtensionArray, and to intercept requests for method +names which you wish to support (such as `round`). For most functions, +you can return a dynamically generated function, which simply calls +the numpy function on your existing backing numeric array, wraps +the result in your ExtensionArray, and returns it. This approach can +reduce boilerplate significantly, but you do have to maintain a whitelist, +and may require more than one case, based on signature. + +A third possible approach, is to use the `__array_function__` mechanism +introduced by numpy's +`NEP-18 `__ +proposal. NEP-18 is an experimental mechanism introduced in numpy 1.16, and is +enabled by default starting with numpy 1.17 (to enable it in 1.16, you must +set the environment variable `NUMPY_EXPERIMENTAL_ARRAY_FUNCTION` in your +shell). NEP-18 is an "opt-in, all-in" solution, meaning that if you choose to +make use of it in your class, by implementing the `__array_function__` +interface, it will always be used when (non-ufunc) numpy methods are called +with an instance of your EA as the argument. Numpy will not make use of an `__array__` +method if you have one. If you include both a `__array_function__` and an +implementation of `round`, for example, numpy will always invoke `__array_function__` +when `np.round` is passed an instance of your EA. + +.. important:: + Even if you choose to implement `__array_function__`, you still need to + implement `__array_ufunc__` in order to override ufuncs. Each of these + two interfaces covers a seperate portion of numpy's functionality. + + +With this overview in hand, you hopefully have the necessary information in order +to develop rich, full-featured ExtensionArrays that seamlessly plug in to pandas. +EA support is still being actively worked on, so if you encounter a bug, or behaviour +which does not behave as described, please report it to the team. + +.. important:: + You are not required to provide implementations for the full complement of Series + operations in your ExtensionArray. In fact, some of them may not even make sense + within its context. You may also choose to add implementations incrementally, + as the need arises. + + +Formatting Extension Arrays +^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +TBD + .. _extending.extension.testing: Testing Extension Arrays