ENH: Extending Pandas with custom types #19174

TomAugspurger · 2018-01-10T20:30:06Z

Adds an interface for 3rd party libraries to define custom array types. When
pandas encounters these objects, we'll resist the temptation to cast them to
object dtype.

This isn't near being ready yet, but I wanted to make a PR to get some thoughts
on the general approach / scope. I wouldn't recommend reviewing the code in
detail yet.

Some questions:

Do we want to do this?

The big concern being the pandas2 transition. I've tried to hide all
block-related changes in an adapter class. I don't think that allowing
extension types will hinder the transition in a major way.
Should our current extension types implement the interface?

Categorical essentially satisfies it already.

Interval basically does, but there isn't an IntervalArray class really,
just an IntervalIndex.

Period is the same as Interval.

SparseArray is a bit weird since it's an actual subclass of ndarray.
Needs some more thought.

datetime-with-tz is also weird (ndarray, just with a custom type).
What to do with indexes?

Having an ExtensionIndex will be useful I think (value_counts, groupby,
etc.). But I haven't tackled it yet. Haven't really thought about indexing at
all, since it sounds hard.
How much of the ndarray implementation should we push onto the pandas array
interface?

Probably some others, but that should kick things off. For those interested in
following along, I'm using this over in
https://github.com/continuumio/pandas-ip. It's nice to have a concrete problem
to solve.

Closes #18767

Define an interface for custom types.

pep8speaks · 2018-01-10T20:30:35Z

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 12, 2018 at 20:10 Hours UTC

TomAugspurger · 2018-01-10T20:33:40Z

pandas/core/categorical.py

@@ -409,6 +410,11 @@ def dtype(self):
        """The :class:`~pandas.api.types.CategoricalDtype` for this instance"""
        return self._dtype

+    @property
+    def _block_type(self):


I'm still thinking about how best to handle this. The conflict is

We don't want 3rd party libs to worry about Blocks at all

We (maybe) want Categorical, etc. to be instances of ExtensionArray

We do want to use CategoricalBlock instead of the default ExtensionBlock

Why do we want to use CategoricalBlock instead of ExtensionBlock with categorical dtype? Because that would require more changes in the code?
As in theory /ideally, the categorical custom operations should be all defined in the array (Categorical) and the Block should not have to worry about the type?

Yeah, I'm looking through at the methods CategoricalBlock implements, that aren't part of the interface

fillna: should probably be added

interpolate: Actual interpolation probably isn't important enough, but I think this is called indirectly for fillna?

shift

to_native_types

Trying to balance complexity of implementation for 3rd-parties here.

shoyer

In general I think this is great. It might be worth trying to simplify the interface here a little bit, e.g., by insisting that it's only for 1D data.

My current understanding from @wesm (which I agree with) is that the plans is for pandas2 to a new namespace and exist along-side pandas. So this shouldn't be a problem from a forwards compatibility perspective.

shoyer · 2018-01-11T00:06:44Z

doc/source/developer.rst

+       base = np.dtype([('hi', '>u8'), ('lo', '>u8')])
+       type = IPTypeType
+       kind = 'O'
+       fill_value = np.array([(0, 0)], dtype=base)


Clarify that these fields can be properties, e.g., based on parameters set in the constructor.

shoyer · 2018-01-11T00:17:51Z

pandas/core/extensions.py

+    """
+    @property
+    @abc.abstractmethod
+    def type(self) -> type:


These should probably be comment type annotations instead :)

shoyer · 2018-01-11T00:20:10Z

pandas/core/extensions.py

+        pass
+
+    @abc.abstractmethod
+    def to_dense(self):


I don't understand what this does/indicates.

shoyer · 2018-01-11T00:22:25Z

pandas/core/extensions.py

+    # ------------------------------------------------------------------------
+    @abc.abstractmethod
+    def take(self, indexer, allow_fill=True, fill_value=None):
+        """For slicing"""


It would be good to clarify if this is the NumPy version of take, or the pandas internal version that uses -1 to indicate fill values.

If it's the later (as I suspect is the case) it could be valuable to expose some of our internal utility functions for this rather than force users to re-implement it. It is non-trivial to get right in a performant way.

shoyer · 2018-01-11T00:26:00Z

pandas/core/extensions.py

+
+    @property
+    @abc.abstractmethod
+    def shape(self) -> T.Tuple[int, ...]:


Perhaps all ExtensionArray objects should be required to be 1D?

That would be quite a bit simpler. Full support for ND data would probably require quite a few other methods (e.g., multidimensional indexing, transpose, multi-dimensional concatenation).

That's reasonable for now I think.

shoyer · 2018-01-11T00:29:21Z

pandas/core/extensions.py

+            return issubclass(dtype, cls)
+
+
+class ExtensionArray(metaclass=abc.ABCMeta):


Suppose you want to implement arithmetic or aggregations for your custom dtype (e.g., consider a units dtype). In principle almost any function that does computation might need to be overloaded in a dtype specific way. (This is what NumPy does for ufuncs.)

Would this be done via defining special methods on ExtensionArray? Or do we simply say that this isn't possible?

At the moment I've ignored arithmetic methods and reductions. I'm OK with saying it isn't possible for now, but am open to allowing it in the future.

I've been working on making Series/Index/DataFrame arithmetic methods behave predictably by sharing implementations (in particular datetime and timedelta types). Eventually the implementations need to get out of core.ops and into some type of mixin classes that get mixed into both Index subclasses and Block subclasses (e.g. TimestampVector --> DatetimeIndex, DatetimeBlock). The base classes for that look a lot like ExtensionArray here; it's probably worth figuring out how these relate sooner rather than later.

Agreed. If you look at core/period.py I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work, which doesn't feel great.

I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work, which doesn't feel great.

Which methods did you need to get from DatetimeIndexOpsMixin? There's a lot of Index-specific stuff in that class that isn't needed for a vector base class.

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

@jorisvandenbossche not at the moment, no. But the idea as I understand it is that PeriodArray would get mixed into something like PeriodBlock, then Series[Period] would dispatch certain operations to self._data.blocks[0].__add__ or whatever.

shoyer · 2018-01-11T00:31:19Z

pandas/tests/test_extension_pyarrow.py

+
+    @classmethod
+    def concat_same_type(cls, to_concat):
+        return cls(np.concatenate([arr.data.to_pandas() for arr in to_concat]))


You probably don't want to actually use np.concatenate() here, since converting to NumPy arrays can be slow and/or lossy.

shoyer · 2018-01-11T00:35:16Z

pandas/core/extensions.py

+    @property
+    @abc.abstractmethod
+    def base(self) -> np.dtype:
+        # TODO


It is not clear to me how this makes sense if the data is not backed by a NumPy array. I would be inclined to remove it from the interface if possible.

Perhaps it could be used by an optional NumpyExtensionArray base-class that pre-implements some methods if you are using a NumPy array as the backing store.

jbrockmendel · 2018-01-11T05:17:44Z

pandas/core/internals.py

@@ -76,7 +78,169 @@
 from pandas.compat import range, map, zip, u


-class Block(PandasObject):
+class BlockOpsMixin(object):


It looks like there is a core.internal_ops that is empty. Was some of this intended to go in there? I very much like the idea of having blocks ("vectors"?) implement their arithmetic methods, so Series/DataFrame can just pass through to those implementations. shoyer opened an issue to that effect IIRC.

Ah, these were temporarily in internal_ops and shared between a couple files, but then I made a refactor to remove internal_ops.

jreback

this needs to be done in a different way.

add a reasonable interface to the ExtensionDtype/Array
move pandas extensions to subclass this one by one
then actually change core code

its a bit counterintutive to do it this way, but you end up changing so much code that needs review I don't think it is avoidable.

There are many many things done to 'just make things work'. So these need to be cleaned up to be more general before you can even attempt this change.

jreback · 2018-01-11T11:16:34Z

pandas/core/algorithms.py

-            # handle Categorical and sparse,
+        if (is_extension_type(values) and not
+                is_datetime64tz_dtype(values)):
+            # Need the not is_datetime64tz_dtype since it's actually


huh? this is not friendly

jreback · 2018-01-11T11:17:11Z

pandas/core/base.py

@@ -891,7 +891,8 @@ def _map_values(self, mapper, na_action=None):
            return new_values

        # we must convert to python types
-        if is_extension_type(self.dtype):


it needs to be iterable

jreback · 2018-01-11T11:18:22Z

pandas/core/dtypes/common.py

+    # XXX: we have many places where we call this with a `.dtype`,
+    # instead of a type. Think about supporting that too...
+    from pandas.core.extensions import ExtensionArray, ExtensionDtype
+    return (isinstance(arr, ExtensionArray) or


this needs just to satisfy is_array_like (which is an iterable with a dtype)

TomAugspurger · 2018-01-11T14:03:20Z

move pandas extensions to subclass this one by one

Yeah, I wanted to make sure we were comfortable with that. So we'll have a PeriodArray / PeriodIndex in the same way we have a Categorical / CategoricalArray. Likewise with Interval.

…ream

TomAugspurger · 2018-01-12T20:12:02Z

Removed ExtensionDtype.base from the interface.
Removed ExtensionArray.to_dense from the interface.

Refactored PeriodIndex to be backed by PeriodArray, and IntervalIndex to be backed by IntervalArray.

Still a WIP.

TomAugspurger · 2018-01-12T20:15:41Z

FYI, I haven't thought through whether PeriodArray / IntervalArray should be part of the public API in the same way that Categorical is. One the one hand, you'll get them when you do pd.Series(pd.period_range("2000", periods=10)), so they should probably be documented, etc. On the other hand, it's just one more thing to worry about. Is it useful to generate a pandas.PeriodArray(...) on it's own?

jbrockmendel · 2018-01-12T22:09:20Z

pandas/core/dtypes/missing.py

@@ -128,7 +131,9 @@ def _isna_ndarraylike(obj):
    values = getattr(obj, 'values', obj)
    dtype = values.dtype

-    if is_string_dtype(dtype):
+    if isinstance(values, ExtensionArray):


Any reason not to duck-type this to check for objects that implement isna method? I had a use case a while back where it would have been nice.

I have a slight preference to using the interface here, instead of duck-typing, just since it's part of the interface (and I think isna is important enough to make it mandatory).

Virtual subclasses can register themselves if they want pd.isna to work, but don't want to inherit from ExtensionArray.

jorisvandenbossche · 2018-01-12T22:59:12Z

Thanks a lot for working on this! This is related to the work I have been doing on allowing custom Block types spiked by geopandas (but this is more general!), so I very much like to see this area improve. I should also test the implementation with geopandas.

Some general feedback:

I think the alternative we should consider is to better support custom Block types (although this might be harder in practice, as currently in many places arrays are passed around and not blocks), which I started for concat. From the developer point of view, I think this is very similar, as the additional methods on the ExtensionArray you need to implement, are basically just wrapped in the ExtensionBlock (I mean, writing an ExtensionArray or an ExtensionBlock will probably be more or less the same code).
I agree that from a user point of view (user of such an ExtensionArray), this array-way can be nicer. One of the main bottlenecks I think with the block-way you encountered with the ipaddresses was that you then cannot do Series(ipaddresses_array), as this doesn't work for blocks. Also, the actual block has a lot of additional methods making this less nice as the 'array' object that users are confronted with in the extension package.
However, I think this problem can also be solved in a different way (eg agreeing on Series et al checking for .dtype that is an ExtensionDtype subclass, or for a _data attribute that is a custom block, ...) when going with the "extension block" way.

When designing a new pandas, I think we want array objects for all dtypes (and then a dataframe is a collection of such arrays with more methods / indexing capabilities), so in that regard this is a nice move. On the other hand, it is maybe too late in the pandas evolution to go this route for current pandas itself as well? We already have blocks, and also adding those arrays might complicate things further? (the goal of having arrays should partly be to not have blocks)

This also relates to the question of to what extent the pandas arrays (PeriodArray, IntervalArray, DatetimeTZArray) should be part of the public interface. In the ideal case, I think the answer would be yes (and eg Series.values (or another more backward-compatible attribute) should return those arrays). But again, it might be too late for pandas and add additional user complexity to change this now.

What to do with indexes?

I think in principle it should be possible to have an Index with the underlying values an ExtensionArray (which is actually a good reason for the array-way, as in such a case we could try to combine things that are now duplicated in eg DatetimeIndex and Series with DatetimeBlock in a single place as the DatetimeArray)

Final note: I haven't really thought through in practice my comments above on using blocks instead, so it is quite possible that your current approach of an extension array is just much better/easier! (also for current pandas, and not only "in the ideal case of designing from scratch")

jorisvandenbossche · 2018-01-13T00:39:57Z

Sorry for my long post above. Trying to more shortly summarize my worries:
I like the idea of Arrays, but I have the feeling that implementing the internal extension types (categorical, datetimetz, etc) based on Arrays, if we do it properly, is a lot of work with lots of changes. And thus also many opportunities to introduce new bugs and regressions. So is that worth it for current pandas?
An option is start smaller and focus for now on only getting ExtensionArray working for actual external ones (which might also be more digestable change to review). On the other hand, the internal extensions are good dogfeeding to see if the extension interface works.

To add to my comments above, I think we should at some point also solicit some more feedback (on the mailing list) on the extension array interface.

We should also think about the broader implications of the change, eg for downstream users. Eg what happens if statsmodels or scikit-learn encounter dataframes with extension types (not our responsability?)

There are also plans to make the numpy dtypes more extensible, which are more concrete now with the funding I think. Not saying that we should wait on that (as it can still take a couple of years), but it might another reason to make our extension arrays not too public (in case we use arrays for our internal extension types).

jorisvandenbossche

Some more concrete comments!

jorisvandenbossche · 2018-01-12T23:03:12Z

doc/source/developer.rst

+'''''''''''''''
+
+This class should describe your data type. The most important fields are
+``name`` and ``base``:


you mentioned you removed base from the interface

jorisvandenbossche · 2018-01-12T23:29:29Z

pandas/core/extensions.py

+    @property
+    def type(self):
+        # type: () -> T.Any
+        """Typically a metaclass inheriting from 'type' with no methods."""


This is not a very clear explanation IMO

Probably since I don't understand what's it's used for :) In my next push, this won't be abstract. I'll have a default implementation that seems to work, so subclasses shouldn't have to worry about it.

jorisvandenbossche · 2018-01-12T23:29:44Z

pandas/core/categorical.py

@@ -409,6 +410,11 @@ def dtype(self):
        """The :class:`~pandas.api.types.CategoricalDtype` for this instance"""
        return self._dtype

+    @property
+    def _block_type(self):


Why do we want to use CategoricalBlock instead of ExtensionBlock with categorical dtype? Because that would require more changes in the code?
As in theory /ideally, the categorical custom operations should be all defined in the array (Categorical) and the Block should not have to worry about the type?

jorisvandenbossche · 2018-01-12T23:44:12Z

pandas/core/extensions.py

+            return issubclass(dtype, cls)
+
+
+class ExtensionArray(metaclass=abc.ABCMeta):


I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

jorisvandenbossche · 2018-01-12T23:52:57Z

pandas/core/extensions.py

+    @abc.abstractmethod
+    def __len__(self):
+        pass
+


Should there be a __setitem__ as well to be able to specify how assigning new data to the series with such an array should update the array?

jorisvandenbossche · 2018-01-12T23:57:52Z

pandas/core/extensions.py

+        return None
+
+    @abc.abstractmethod
+    def formatting_values(self):


I have the feeling that those block-related methods like formatting_values, concat_same_type, get_values, slice, .. should be private (so with underscore, but still part of the contract), because they are not that useful for direct usage of an ExtensionArray ?

Further, in geopandas I think I also needed to implement view and ravel on the underlying GeometryArray to get things working

I've gone back and forth on this. Apparently, including private methods in an interface is bad practice. But we have two sets of "users" here

library authors implementing the interface. To them, formatting_values is not private

End-users of those library authors. To them, formatting_values is private.

I'd be curious to hear other's thoughts here.

jorisvandenbossche · 2018-01-13T00:00:23Z

pandas/core/extensions.py

+
+    def slice(self, slicer):
+        # TODO: is this right?
+        # In general, no. Probably just remove it?


Currenlty in geopandas, I needed a custom slicer on the Block level (so I think we need at least something here or on the block level)

Sorry, by "remove" I meant remove the default implementation and make it abstract.

jorisvandenbossche · 2018-01-13T00:04:26Z

pandas/core/internals.py

+
+        if fill_tuple is None:
+            fill_value = self.fill_value
+            new_values = algos.take_nd(values, indexer, axis=axis,


Shouldn't this use the ExtensionArray's take method? (so values.take)

jorisvandenbossche · 2018-01-13T00:30:18Z

pandas/core/interval.py

+        return "{}({})".format(self.__class__.__name__, self._format_data())
+
+    @property
+    def values(self):


Should this be called values on the Array?
I think there are in general (although already not fully applicable in the Interval case, but it is for periods) two possible arrays: the actual storage array (eg the integers in case of Periods, the record array in case of ipaddress) or a numpy array with boxed objects.
Both should be accessible I think (although the first is maybe an specific Array class implementation detail and should only be used inside the class itself?), but not sure the second should be .values.
The question is also what __array__ should be if we add it?

jorisvandenbossche · 2018-01-13T00:32:56Z

pandas/core/period.py

+
+
+class PeriodArray(DatetimeIndexOpsMixin, ExtensionArray):
+    dtype = PeriodDtype()


shouldn't this dtype hold the freq when it is an actual instance?

jbrockmendel · 2018-01-13T02:06:21Z

I think the alternative we should consider is to better support custom Block types

@jorisvandenbossche You might want to weight in on the block-registry discussion in #19189.

TomAugspurger · 2018-01-13T20:41:25Z

Thanks Joris, some replies below. I'm still pretty convinced that a public ExtensionArray is better than a public ExtensionBlock. LMK if anything below is / isn't convincing.

I think the alternative we should consider is to better support custom Block types

Initially I had a custom block as part of the interface, but removed it since I
don't think 3rd-party libraries should have to worry about blocks. I agree that
it'll be about the same work for 3rd parties either way. And I think the
internal changes will look about the same too? Either way, when a user does
Series[CustomType].isna() we'll do either

ExtensionBlock.isna(), which calls their CustomArray.isna(), or
CustomBlock.isna()

My main motivation for array over block is not having to expose as much of
pandas' internals to the 3rd party dtypes. If we have an ExtensionBlock as
part of the interface, then we'll have to provide some class for a library
author to subclass. Presumably we'll want to share some code between that public
ExtensionBlock class and our own block classes, which limits the internal
refactors we can do without breaking API.

We already have blocks, and also adding those arrays might complicate things
further? (the goal of having arrays should partly be to not have blocks)

I think this argues for my approach :) If we have 3rd parties writing custom
blocks, then we end up with one new block per extension type. With my current
approach we only have a single new block, ExtensionBlock that can hold any
dtyped array.

I like the idea of Arrays, but I have the feeling that implementing the
internal extension types (categorical, datetimetz, etc) based on Arrays, if we
do it properly, is a lot of work with lots of changes.

FWIW, we don't have to refactor the current code to use extension arrays. We
can declare that they're specifically for 3rd-parties. But I do think it's a
worthwhile refactor since

It's a nice enhancement: we can stick periods / intervals into a Series
without casting to object.
It can be a cleanup, since we'll be able to remove some if is_categorical(); elif is_interval() ... blocks with if is_extension()
I'm mostly done with the refactor :)

Either way, I'll continue on with this since Categorical / Period /
Interval are good tests for what the interface needs to include.

There are also plans to make the numpy dtypes more extensible

Yes, I'll reach out to the NumPy devs when this is a bit more stable.

Review-wise, it may make sense to split this into multiple PRs:

jorisvandenbossche · 2018-01-13T22:47:09Z

My main motivation for array over block is not having to expose as much of
pandas' internals to the 3rd party dtypes. If we have an ExtensionBlock as
part of the interface, then we'll have to provide some class for a library
author to subclass.

Yes, this is probably a good reason. My idea was first: this doesn't really matter as it is only for the developers of third-party libraries and they can handle some complexity (if it was documented how to do, this explanation for CustomArray or CustomBlock would not be that different or much more complex for either case), and because I thought of our Blocks a bit is 'set in stone' for the current version. But I agree (after working a while with it) keeping open the possibility to do some refactorings / clean-up in this area is still potentially valuable (eg the non-consolidating discussion).

It's a nice enhancement: we can stick periods / intervals into a Series
without casting to object

Yes, that would be nice indeed (although it's not necessarily an argument as similar but different effort could have created blocks for them, but it would be a very nice consequence of the effort). But just to be clear, this is in the current pushed PR not yet the case?

Either way, I'll continue on with this since Categorical / Period /
Interval are good tests for what the interface needs to include.

Yep, I know, that's certainly a good reason to do that :-) The question however still is to what extent they would be exposed publicly (but that can be a separate discussion later on)

Review-wise, it may make sense to split this into multiple PRs:

From a reviewer standpoint, that would certainly be easier. But I am not sure how easy it would be for you. Any idea how you would want to split it?
Jeff proposed a possible stepwise approach above of first doing the extension interface and then only moving our internal ones. But it is maybe also an idea to start with our internal ones: for example just implement and use a PeriodArray (with the interface we expect it to be), that would make more clear what all the consequences internally are (but again, not sure how easy it would be to split this).

jreback · 2018-01-14T21:46:43Z

So I think a reasonable architecture would be to form a new set of modules.
pandas.core.extension, with sub-modules: block, array, dtype, exposing Block, Array, and Dtype primitives.

Then using these re-write CategoricalBlock as a Block subclass, Categorical as an Array sub-class, and CategoircalDtype as a Dtype subclass.

Then do the same for DatetimeTZBlock. Then add PeriodBlock and IntervalBlock support.

Then maybe even do this for Sparse.

Then of course, you would ideally just create a NumpyArray (that inherits Array) as well for internal conformity.

By this time I think the api will be very very clear of what needs to be in these super-classes in general.

These can all be done incrementally w/o disrupting things or causing very long and sweeping code reviews (which are bad generally).

This is not actually as much work as I am painting (at least for Categorical and dtz), as its mostly about figuring out what should go in the super classes and what should end up in the sub-class.

pandas has very full featured extension classes, so what better way to design an API than to use an extensive test suite and an already full featured implementation.

note that on purpose these extension should be 'separate' from the internals, the internals will simply 'use' these blocks generically and conform to there exposed API rather than peer inside, as we do now. I suspect on our internals can be quite a lot simpler i things are organized along these lines.

TomAugspurger · 2018-01-15T12:02:46Z

So I think a reasonable architecture would be to form a new set of modules.
pandas.core.extension, with sub-modules: block, array, dtype, exposing Block, Array, and Dtype primitives.

Then using these re-write CategoricalBlock as a Block subclass, Categorical as an Array sub-class, and CategoircalDtype as a Dtype subclass.

That sounds good. Two questions though:

By "expose" do you mean make part of the public API? I'd prefer that ExtensionBlock never be part of the public API, if possible.
Sharing code between internals.Block and extensions.block.ExtensionBlock. Specifically, should ExtensionBlock subclass Block? Common base class? I'll see what things look like.

jorisvandenbossche · 2018-01-15T12:20:57Z

IMO the goal of an ExtensionArray should be to not have to expose ExtensionBlock (we will have to see if it is fully possible, but at least it should be the goal I think).
BTW, I think it would make more sense to keep ExtensionBlock inside internals.

TomAugspurger · 2018-01-15T12:26:53Z

BTW, I think it would make more sense to keep ExtensionBlock inside internals.

Because of circular imports, or just for organization? That file is quite long.

jorisvandenbossche · 2018-01-15T12:34:00Z

I meant for file organization, that makes more sense IMO. If the file is too long, we can always make internals a directory. But if we are going to move datetimetz and categorical to use arrays, it should get a bit shorter ..

I would rather re-use the existing core.dtypes and core.internals and create a new core.arrays, than putting all of that under core.extension.dtypes/array/block. If we use it internally, it is strange to put it all under core.extension. We can still have a core.extension (or maybe directly in api.extension) that exposes the needed abstract classes for external usage.

jorisvandenbossche · 2018-01-15T12:35:34Z

Then of course, you would ideally just create a NumpyArray (that inherits Array) as well for internal conformity.

The only problem with this is that those would need to be 2D, and I think it would be nice to limit our array-based types (and extension interface) to 1D.

jreback · 2018-01-15T13:30:29Z

so mod of proposal; create a sub-dir:

pandas/core/internals with files
internals.py (existing)
block.py (Block)
array.py (Array)

dtype can be in pandas.core.dtypes (where it is now)

I the whole intent is allow very piecemeal and incremental changes that can be merged. IOW the very first think that should happen (after structure) is to make Array a super-class on any classes we want to move (e.g. Categorical, Index) - Index for support of Datetime w/tz (and Periods/Intervals). This will eventually form the basis for the extension class (which certainly can be exposed in the public api.extensions at some point)

TomAugspurger · 2018-01-15T13:35:27Z

Index for support of Datetime w/tz (and Periods/Intervals)

Not sure if you're suggesting this, but I don't think e.g. PeriodIndex should be an instance of ExtensionArray (indexes are immutable for one thing). Rather, we should have a new PeriodArray like https://github.com/pandas-dev/pandas/pull/19174/files#diff-9e2f3dade777933f51723f5b020a1c08, and PeriodIndex will compose with that just like CategoricalIndex composes with Categorical.

jreback · 2018-01-15T14:04:26Z

@TomAugspurger actually I am proposing ImmutableArray as well, though that just makes things even more complicated. So let's just stick to a Array which is mutable (make that very clear). But yes need to incorporate all of our existing functionaility.

TomAugspurger · 2018-01-15T15:49:59Z

@jreback are you in favor of removing CategoricalBlock, DatetimeTZBLock in favor of a single ExtensionBlock?

The tricky thing with datetimetz is that we don't have a DatetimeTZArray, like we would for a Categorical, Period, Interval. That means an ExtensionBlock[DatetimeTZ] doesn't have anything to dispatch to, there's no DatetimeTZ.isna() to call, since the values is just a regular ndarray.

jreback · 2018-01-15T15:58:44Z

no we will still have all of the existing blocks

BUT they will have common functions in a Block; the data is held in an Array (which wraps things like DTI and Categorical, or maybe u can simply make Array a base class as i said before)

CategoricalBlock and such are just implementations of a generic Block (what u r calling an ExtensuonBlock6

jbrockmendel · 2018-01-15T18:39:34Z

Not sure if you're suggesting this, but I don't think e.g. PeriodIndex should be an instance of ExtensionArray (indexes are immutable for one thing).

This, along with the mention a while back of PeriodArray subclassing DatetimelikeIndexOpsMixin, leads into the discussion of what exactly needs to be defined by an Array class. I've been making an effort to refactor some bits of the Index subclasses and Block subclasses to inherit from Array-like mixin classes. What I've found is:

values, _na_value, fill_value, _box_func are shared by both the Blocks and the Indexes, so can go into the Array class unchanged.
The Datetimelike arithmetic/comparison ops use some helper methods _nat_new and _shallow_copy that the Blocks don't. Unifying that internal API would be a big help.
The other properties that are non-trivial to share are ones that Indexes cache because they are immutable. @jreback's discussion of ImmutableArray might solve that problem.

Other Notes:

TimedeltaBlock.to_native_types has a FIXME note saying to sue Timedelta64Formatter, which TimedeltaIndex already does. to_native_types might belong in TimedeltaArray. (really I guess, any of the Block methods that dont care about position or index?)
This discussion is sprawling. Is there a minimal subset of the PR that will implement the feature you most care about?
update +1 to the idea of making internals a directory.

jbrockmendel · 2018-01-15T18:55:47Z

pandas/core/period.py

+    def equals(self, other):
+        if not isinstance(other, type(self)):
+            return False
+        return (self.freq == other.freq and


self.freq == other.freq may raise, in which case you generally want to return False here.

jbrockmendel · 2018-01-15T18:56:33Z

pandas/core/period.py

+        return result
+
+    def __iter__(self):
+        return iter(self._data)


Isn't this iter(self._box_func(x) for x in self._data)?

TomAugspurger · 2018-01-15T18:59:30Z

@jbrockmendel agreed in general.

This discussion is sprawling. Is there a minimal subset of the PR that will implement the feature you most care about?

No kidding :) Let's enumerate some TODOs.

I want some interface such that pandas will "do the right thing" when it sees an array-like thing satisfying that interface, even if that array-like thing is defined outside of pandas.
a.) Define that interface (just the ABCs)
b.) Modify pandas' internals to honor that interface (adding ExtensionBlock, isna, Series/Frame ctors, groupby, etc.)
An internal refactoring of Categorical to use that array interface
An internal refactoring / enhancement for PeriodArray
An internal refactoring / enhancement for IntervalArray

I think that 1 would be a relatively small, clean PR, but it wouldn't have any cleanup benefit.

Anyway, I'm partway through implementing 1 and 2 as a first PR. The sticking point is the inheritence hierarchy of blocks. Currently I have

- Block
| - ExtensionBlock
| | - CategoricalBlock
| | - DatetimeTZBlock

but there are a few issues to work out

TomAugspurger · 2018-01-16T15:26:04Z

Split the first section off to #19268

jbrockmendel · 2018-01-18T02:08:50Z

@TomAugspurger for the discussion of refactoring the arithmetic ops out of DatetimeIndexOpsMixin for PeriodArray, pls take a look at this branch and let me know if it resembles what you have in mind.

TomAugspurger · 2018-01-30T20:40:49Z

Closing in favor of #19268

ENH: Extending Pandas

f22e48a

Define an interface for custom types.

TomAugspurger added API Design Internals Related to non-user accessible pandas implementation labels Jan 10, 2018

TomAugspurger added this to the 0.23.0 milestone Jan 10, 2018

TomAugspurger commented Jan 10, 2018

View reviewed changes

shoyer reviewed Jan 11, 2018

View reviewed changes

jbrockmendel mentioned this pull request Jan 11, 2018

form_blocks vs make_block inconsistency #19179

Closed

jbrockmendel reviewed Jan 11, 2018

View reviewed changes

jreback requested changes Jan 11, 2018

View reviewed changes

jreback removed this from the 0.23.0 milestone Jan 11, 2018

Py2 compat

7181741

jbrockmendel mentioned this pull request Jan 11, 2018

CLN: unify logic for form_blocks and make_blocks #19189

Merged

4 tasks

TomAugspurger added 7 commits January 12, 2018 06:00

REF: Build PeriodIndex on PeriodArray

898b196

Removed interval ops

5912198

Start IntervalArray

f008c87

JSON example

f24e25d

API: Remove base and to_dense from the interface

e21c9ed

Fixup

89c6d3f

Merge remote-tracking branch 'upstream/master' into pandas-array-upst…

62e1d3d

…ream

chris-b1 mentioned this pull request Jan 12, 2018

Series[offsets] + Series[datetime64] fails #19211

Closed

jbrockmendel reviewed Jan 12, 2018

View reviewed changes

jorisvandenbossche reviewed Jan 13, 2018

View reviewed changes

jreback mentioned this pull request Jan 14, 2018

DEPR: SparseDataFrame and SparseSeries subclasses #19239

Closed

jbrockmendel mentioned this pull request Jan 15, 2018

_get_attributes_dict formalization #19249

Closed

jbrockmendel reviewed Jan 15, 2018

View reviewed changes

jbrockmendel mentioned this pull request Jan 16, 2018

CLN: Remove unused core.internals methods #19250

Merged

4 tasks

kernc mentioned this pull request Jan 21, 2018

ENH: SparseDataFrame/SparseSeries value assignment #17785

Closed

4 tasks

TomAugspurger changed the title ~~[WIP]ENH: Extending Pandas with custom types~~ ENH: Extending Pandas with custom types Jan 30, 2018

TomAugspurger closed this Jan 30, 2018

TomAugspurger deleted the pandas-array-upstream branch May 2, 2018 13:09

		return issubclass(dtype, cls)


		class ExtensionArray(metaclass=abc.ABCMeta):



		class PeriodArray(DatetimeIndexOpsMixin, ExtensionArray):
		dtype = PeriodDtype()

ENH: Extending Pandas with custom types #19174

ENH: Extending Pandas with custom types #19174

Conversation

TomAugspurger commented Jan 10, 2018

pep8speaks commented Jan 10, 2018 • edited Loading

Comment last updated on January 12, 2018 at 20:10 Hours UTC

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger Jan 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 11, 2018

TomAugspurger commented Jan 12, 2018

TomAugspurger commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 12, 2018

jorisvandenbossche commented Jan 13, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Jan 13, 2018

TomAugspurger commented Jan 13, 2018

jorisvandenbossche commented Jan 13, 2018

jreback commented Jan 14, 2018

TomAugspurger commented Jan 15, 2018

jorisvandenbossche commented Jan 15, 2018

TomAugspurger commented Jan 15, 2018

jorisvandenbossche commented Jan 15, 2018

jorisvandenbossche commented Jan 15, 2018

jreback commented Jan 15, 2018

TomAugspurger commented Jan 15, 2018

jreback commented Jan 15, 2018 • edited Loading

TomAugspurger commented Jan 15, 2018

jreback commented Jan 15, 2018

jbrockmendel commented Jan 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Jan 15, 2018

TomAugspurger commented Jan 16, 2018

jbrockmendel commented Jan 18, 2018

TomAugspurger commented Jan 30, 2018

pep8speaks commented Jan 10, 2018 •

edited

Loading

TomAugspurger Jan 12, 2018 •

edited

Loading

jreback commented Jan 15, 2018 •

edited

Loading

jbrockmendel commented Jan 15, 2018 •

edited

Loading