Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Extending Pandas with custom types #19174

Closed

Conversation

TomAugspurger
Copy link
Contributor

Adds an interface for 3rd party libraries to define custom array types. When
pandas encounters these objects, we'll resist the temptation to cast them to
object dtype.

This isn't near being ready yet, but I wanted to make a PR to get some thoughts
on the general approach / scope. I wouldn't recommend reviewing the code in
detail yet.

Some questions:

  1. Do we want to do this?

    The big concern being the pandas2 transition. I've tried to hide all
    block-related changes in an adapter class. I don't think that allowing
    extension types will hinder the transition in a major way.

  2. Should our current extension types implement the interface?

    Categorical essentially satisfies it already.

    Interval basically does, but there isn't an IntervalArray class really,
    just an IntervalIndex.

    Period is the same as Interval.

    SparseArray is a bit weird since it's an actual subclass of ndarray.
    Needs some more thought.

    datetime-with-tz is also weird (ndarray, just with a custom type).

  3. What to do with indexes?

    Having an ExtensionIndex will be useful I think (value_counts, groupby,
    etc.). But I haven't tackled it yet. Haven't really thought about indexing at
    all, since it sounds hard.

  4. How much of the ndarray implementation should we push onto the pandas array
    interface?

Probably some others, but that should kick things off. For those interested in
following along, I'm using this over in
https://github.com/continuumio/pandas-ip. It's nice to have a concrete problem
to solve.

Closes #18767

Define an interface for custom types.
@TomAugspurger TomAugspurger added API Design Internals Related to non-user accessible pandas implementation labels Jan 10, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Jan 10, 2018
@pep8speaks
Copy link

pep8speaks commented Jan 10, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 12, 2018 at 20:10 Hours UTC

@@ -409,6 +410,11 @@ def dtype(self):
"""The :class:`~pandas.api.types.CategoricalDtype` for this instance"""
return self._dtype

@property
def _block_type(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still thinking about how best to handle this. The conflict is

  1. We don't want 3rd party libs to worry about Blocks at all
  2. We (maybe) want Categorical, etc. to be instances of ExtensionArray
  3. We do want to use CategoricalBlock instead of the default ExtensionBlock

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to use CategoricalBlock instead of ExtensionBlock with categorical dtype? Because that would require more changes in the code?
As in theory /ideally, the categorical custom operations should be all defined in the array (Categorical) and the Block should not have to worry about the type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm looking through at the methods CategoricalBlock implements, that aren't part of the interface

  1. fillna: should probably be added
  2. interpolate: Actual interpolation probably isn't important enough, but I think this is called indirectly for fillna?
  3. shift
  4. to_native_types

Trying to balance complexity of implementation for 3rd-parties here.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is great. It might be worth trying to simplify the interface here a little bit, e.g., by insisting that it's only for 1D data.

My current understanding from @wesm (which I agree with) is that the plans is for pandas2 to a new namespace and exist along-side pandas. So this shouldn't be a problem from a forwards compatibility perspective.

base = np.dtype([('hi', '>u8'), ('lo', '>u8')])
type = IPTypeType
kind = 'O'
fill_value = np.array([(0, 0)], dtype=base)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify that these fields can be properties, e.g., based on parameters set in the constructor.

"""
@property
@abc.abstractmethod
def type(self) -> type:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should probably be comment type annotations instead :)

pass

@abc.abstractmethod
def to_dense(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this does/indicates.

# ------------------------------------------------------------------------
@abc.abstractmethod
def take(self, indexer, allow_fill=True, fill_value=None):
"""For slicing"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to clarify if this is the NumPy version of take, or the pandas internal version that uses -1 to indicate fill values.

If it's the later (as I suspect is the case) it could be valuable to expose some of our internal utility functions for this rather than force users to re-implement it. It is non-trivial to get right in a performant way.


@property
@abc.abstractmethod
def shape(self) -> T.Tuple[int, ...]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps all ExtensionArray objects should be required to be 1D?

That would be quite a bit simpler. Full support for ND data would probably require quite a few other methods (e.g., multidimensional indexing, transpose, multi-dimensional concatenation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable for now I think.

return issubclass(dtype, cls)


class ExtensionArray(metaclass=abc.ABCMeta):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose you want to implement arithmetic or aggregations for your custom dtype (e.g., consider a units dtype). In principle almost any function that does computation might need to be overloaded in a dtype specific way. (This is what NumPy does for ufuncs.)

Would this be done via defining special methods on ExtensionArray? Or do we simply say that this isn't possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I've ignored arithmetic methods and reductions. I'm OK with saying it isn't possible for now, but am open to allowing it in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on making Series/Index/DataFrame arithmetic methods behave predictably by sharing implementations (in particular datetime and timedelta types). Eventually the implementations need to get out of core.ops and into some type of mixin classes that get mixed into both Index subclasses and Block subclasses (e.g. TimestampVector --> DatetimeIndex, DatetimeBlock). The base classes for that look a lot like ExtensionArray here; it's probably worth figuring out how these relate sooner rather than later.

Copy link
Contributor Author

@TomAugspurger TomAugspurger Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. If you look at core/period.py I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work, which doesn't feel great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work, which doesn't feel great.

Which methods did you need to get from DatetimeIndexOpsMixin? There's a lot of Index-specific stuff in that class that isn't needed for a vector base class.

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

@jorisvandenbossche not at the moment, no. But the idea as I understand it is that PeriodArray would get mixed into something like PeriodBlock, then Series[Period] would dispatch certain operations to self._data.blocks[0].__add__ or whatever.


@classmethod
def concat_same_type(cls, to_concat):
return cls(np.concatenate([arr.data.to_pandas() for arr in to_concat]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't want to actually use np.concatenate() here, since converting to NumPy arrays can be slow and/or lossy.

@property
@abc.abstractmethod
def base(self) -> np.dtype:
# TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me how this makes sense if the data is not backed by a NumPy array. I would be inclined to remove it from the interface if possible.

Perhaps it could be used by an optional NumpyExtensionArray base-class that pre-implements some methods if you are using a NumPy array as the backing store.

@@ -76,7 +78,169 @@
from pandas.compat import range, map, zip, u


class Block(PandasObject):
class BlockOpsMixin(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is a core.internal_ops that is empty. Was some of this intended to go in there? I very much like the idea of having blocks ("vectors"?) implement their arithmetic methods, so Series/DataFrame can just pass through to those implementations. shoyer opened an issue to that effect IIRC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, these were temporarily in internal_ops and shared between a couple files, but then I made a refactor to remove internal_ops.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be done in a different way.

  1. add a reasonable interface to the ExtensionDtype/Array
  2. move pandas extensions to subclass this one by one
  3. then actually change core code

its a bit counterintutive to do it this way, but you end up changing so much code that needs review I don't think it is avoidable.

There are many many things done to 'just make things work'. So these need to be cleaned up to be more general before you can even attempt this change.

# handle Categorical and sparse,
if (is_extension_type(values) and not
is_datetime64tz_dtype(values)):
# Need the not is_datetime64tz_dtype since it's actually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh? this is not friendly

@@ -891,7 +891,8 @@ def _map_values(self, mapper, na_action=None):
return new_values

# we must convert to python types
if is_extension_type(self.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it needs to be iterable

# XXX: we have many places where we call this with a `.dtype`,
# instead of a type. Think about supporting that too...
from pandas.core.extensions import ExtensionArray, ExtensionDtype
return (isinstance(arr, ExtensionArray) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs just to satisfy is_array_like (which is an iterable with a dtype)

@jreback jreback removed this from the 0.23.0 milestone Jan 11, 2018
@TomAugspurger
Copy link
Contributor Author

move pandas extensions to subclass this one by one

Yeah, I wanted to make sure we were comfortable with that. So we'll have a PeriodArray / PeriodIndex in the same way we have a Categorical / CategoricalArray. Likewise with Interval.

@TomAugspurger
Copy link
Contributor Author

  1. Removed ExtensionDtype.base from the interface.
  2. Removed ExtensionArray.to_dense from the interface.

Refactored PeriodIndex to be backed by PeriodArray, and IntervalIndex to be backed by IntervalArray.

Still a WIP.

@TomAugspurger
Copy link
Contributor Author

FYI, I haven't thought through whether PeriodArray / IntervalArray should be part of the public API in the same way that Categorical is. One the one hand, you'll get them when you do pd.Series(pd.period_range("2000", periods=10)), so they should probably be documented, etc. On the other hand, it's just one more thing to worry about. Is it useful to generate a pandas.PeriodArray(...) on it's own?

@@ -128,7 +131,9 @@ def _isna_ndarraylike(obj):
values = getattr(obj, 'values', obj)
dtype = values.dtype

if is_string_dtype(dtype):
if isinstance(values, ExtensionArray):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to duck-type this to check for objects that implement isna method? I had a use case a while back where it would have been nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slight preference to using the interface here, instead of duck-typing, just since it's part of the interface (and I think isna is important enough to make it mandatory).

Virtual subclasses can register themselves if they want pd.isna to work, but don't want to inherit from ExtensionArray.

@jorisvandenbossche
Copy link
Member

Thanks a lot for working on this! This is related to the work I have been doing on allowing custom Block types spiked by geopandas (but this is more general!), so I very much like to see this area improve. I should also test the implementation with geopandas.

Some general feedback:

I think the alternative we should consider is to better support custom Block types (although this might be harder in practice, as currently in many places arrays are passed around and not blocks), which I started for concat. From the developer point of view, I think this is very similar, as the additional methods on the ExtensionArray you need to implement, are basically just wrapped in the ExtensionBlock (I mean, writing an ExtensionArray or an ExtensionBlock will probably be more or less the same code).
I agree that from a user point of view (user of such an ExtensionArray), this array-way can be nicer. One of the main bottlenecks I think with the block-way you encountered with the ipaddresses was that you then cannot do Series(ipaddresses_array), as this doesn't work for blocks. Also, the actual block has a lot of additional methods making this less nice as the 'array' object that users are confronted with in the extension package.
However, I think this problem can also be solved in a different way (eg agreeing on Series et al checking for .dtype that is an ExtensionDtype subclass, or for a _data attribute that is a custom block, ...) when going with the "extension block" way.

When designing a new pandas, I think we want array objects for all dtypes (and then a dataframe is a collection of such arrays with more methods / indexing capabilities), so in that regard this is a nice move. On the other hand, it is maybe too late in the pandas evolution to go this route for current pandas itself as well? We already have blocks, and also adding those arrays might complicate things further? (the goal of having arrays should partly be to not have blocks)

This also relates to the question of to what extent the pandas arrays (PeriodArray, IntervalArray, DatetimeTZArray) should be part of the public interface. In the ideal case, I think the answer would be yes (and eg Series.values (or another more backward-compatible attribute) should return those arrays). But again, it might be too late for pandas and add additional user complexity to change this now.

What to do with indexes?

I think in principle it should be possible to have an Index with the underlying values an ExtensionArray (which is actually a good reason for the array-way, as in such a case we could try to combine things that are now duplicated in eg DatetimeIndex and Series with DatetimeBlock in a single place as the DatetimeArray)

Final note: I haven't really thought through in practice my comments above on using blocks instead, so it is quite possible that your current approach of an extension array is just much better/easier! (also for current pandas, and not only "in the ideal case of designing from scratch")

@jorisvandenbossche
Copy link
Member

Sorry for my long post above. Trying to more shortly summarize my worries:
I like the idea of Arrays, but I have the feeling that implementing the internal extension types (categorical, datetimetz, etc) based on Arrays, if we do it properly, is a lot of work with lots of changes. And thus also many opportunities to introduce new bugs and regressions. So is that worth it for current pandas?
An option is start smaller and focus for now on only getting ExtensionArray working for actual external ones (which might also be more digestable change to review). On the other hand, the internal extensions are good dogfeeding to see if the extension interface works.

To add to my comments above, I think we should at some point also solicit some more feedback (on the mailing list) on the extension array interface.

We should also think about the broader implications of the change, eg for downstream users. Eg what happens if statsmodels or scikit-learn encounter dataframes with extension types (not our responsability?)

There are also plans to make the numpy dtypes more extensible, which are more concrete now with the funding I think. Not saying that we should wait on that (as it can still take a couple of years), but it might another reason to make our extension arrays not too public (in case we use arrays for our internal extension types).

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more concrete comments!

'''''''''''''''

This class should describe your data type. The most important fields are
``name`` and ``base``:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned you removed base from the interface

@property
def type(self):
# type: () -> T.Any
"""Typically a metaclass inheriting from 'type' with no methods."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a very clear explanation IMO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably since I don't understand what's it's used for :) In my next push, this won't be abstract. I'll have a default implementation that seems to work, so subclasses shouldn't have to worry about it.

@@ -409,6 +410,11 @@ def dtype(self):
"""The :class:`~pandas.api.types.CategoricalDtype` for this instance"""
return self._dtype

@property
def _block_type(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to use CategoricalBlock instead of ExtensionBlock with categorical dtype? Because that would require more changes in the code?
As in theory /ideally, the categorical custom operations should be all defined in the array (Categorical) and the Block should not have to worry about the type?

return issubclass(dtype, cls)


class ExtensionArray(metaclass=abc.ABCMeta):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had PeriodArray inherit from DatetimeIndexOpsMixin to get arithmetic ops to work

But does a Series with periods also rely on those methods? (which for real extensions would be important; and if not, I am not sure we should add it to PeriodArray et al)

@abc.abstractmethod
def __len__(self):
pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a __setitem__ as well to be able to specify how assigning new data to the series with such an array should update the array?

return None

@abc.abstractmethod
def formatting_values(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that those block-related methods like formatting_values, concat_same_type, get_values, slice, .. should be private (so with underscore, but still part of the contract), because they are not that useful for direct usage of an ExtensionArray ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, in geopandas I think I also needed to implement view and ravel on the underlying GeometryArray to get things working

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back and forth on this. Apparently, including private methods in an interface is bad practice. But we have two sets of "users" here

  1. library authors implementing the interface. To them, formatting_values is not private
  2. End-users of those library authors. To them, formatting_values is private.

I'd be curious to hear other's thoughts here.


def slice(self, slicer):
# TODO: is this right?
# In general, no. Probably just remove it?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currenlty in geopandas, I needed a custom slicer on the Block level (so I think we need at least something here or on the block level)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, by "remove" I meant remove the default implementation and make it abstract.


if fill_tuple is None:
fill_value = self.fill_value
new_values = algos.take_nd(values, indexer, axis=axis,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use the ExtensionArray's take method? (so values.take)

return "{}({})".format(self.__class__.__name__, self._format_data())

@property
def values(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be called values on the Array?
I think there are in general (although already not fully applicable in the Interval case, but it is for periods) two possible arrays: the actual storage array (eg the integers in case of Periods, the record array in case of ipaddress) or a numpy array with boxed objects.
Both should be accessible I think (although the first is maybe an specific Array class implementation detail and should only be used inside the class itself?), but not sure the second should be .values.
The question is also what __array__ should be if we add it?



class PeriodArray(DatetimeIndexOpsMixin, ExtensionArray):
dtype = PeriodDtype()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this dtype hold the freq when it is an actual instance?

@jbrockmendel
Copy link
Member

I think the alternative we should consider is to better support custom Block types

@jorisvandenbossche You might want to weight in on the block-registry discussion in #19189.

@TomAugspurger
Copy link
Contributor Author

Thanks Joris, some replies below. I'm still pretty convinced that a public ExtensionArray is better than a public ExtensionBlock. LMK if anything below is / isn't convincing.

I think the alternative we should consider is to better support custom Block types

Initially I had a custom block as part of the interface, but removed it since I
don't think 3rd-party libraries should have to worry about blocks. I agree that
it'll be about the same work for 3rd parties either way. And I think the
internal changes will look about the same too? Either way, when a user does
Series[CustomType].isna() we'll do either

  1. ExtensionBlock.isna(), which calls their CustomArray.isna(), or
  2. CustomBlock.isna()

My main motivation for array over block is not having to expose as much of
pandas' internals to the 3rd party dtypes. If we have an ExtensionBlock as
part of the interface, then we'll have to provide some class for a library
author to subclass. Presumably we'll want to share some code between that public
ExtensionBlock class and our own block classes, which limits the internal
refactors we can do without breaking API.

We already have blocks, and also adding those arrays might complicate things
further? (the goal of having arrays should partly be to not have blocks)

I think this argues for my approach :) If we have 3rd parties writing custom
blocks, then we end up with one new block per extension type. With my current
approach we only have a single new block, ExtensionBlock that can hold any
dtyped array.

I like the idea of Arrays, but I have the feeling that implementing the
internal extension types (categorical, datetimetz, etc) based on Arrays, if we
do it properly, is a lot of work with lots of changes.

FWIW, we don't have to refactor the current code to use extension arrays. We
can declare that they're specifically for 3rd-parties. But I do think it's a
worthwhile refactor since

  1. It's a nice enhancement: we can stick periods / intervals into a Series
    without casting to object.
  2. It can be a cleanup, since we'll be able to remove some if is_categorical(); elif is_interval() ... blocks with if is_extension()
  3. I'm mostly done with the refactor :)

Either way, I'll continue on with this since Categorical / Period /
Interval are good tests for what the interface needs to include.

There are also plans to make the numpy dtypes more extensible

Yes, I'll reach out to the NumPy devs when this is a bit more stable.

Review-wise, it may make sense to split this into multiple PRs:

@jorisvandenbossche
Copy link
Member

My main motivation for array over block is not having to expose as much of
pandas' internals to the 3rd party dtypes. If we have an ExtensionBlock as
part of the interface, then we'll have to provide some class for a library
author to subclass.

Yes, this is probably a good reason. My idea was first: this doesn't really matter as it is only for the developers of third-party libraries and they can handle some complexity (if it was documented how to do, this explanation for CustomArray or CustomBlock would not be that different or much more complex for either case), and because I thought of our Blocks a bit is 'set in stone' for the current version. But I agree (after working a while with it) keeping open the possibility to do some refactorings / clean-up in this area is still potentially valuable (eg the non-consolidating discussion).

It's a nice enhancement: we can stick periods / intervals into a Series
without casting to object

Yes, that would be nice indeed (although it's not necessarily an argument as similar but different effort could have created blocks for them, but it would be a very nice consequence of the effort). But just to be clear, this is in the current pushed PR not yet the case?

Either way, I'll continue on with this since Categorical / Period /
Interval are good tests for what the interface needs to include.

Yep, I know, that's certainly a good reason to do that :-) The question however still is to what extent they would be exposed publicly (but that can be a separate discussion later on)

Review-wise, it may make sense to split this into multiple PRs:

From a reviewer standpoint, that would certainly be easier. But I am not sure how easy it would be for you. Any idea how you would want to split it?
Jeff proposed a possible stepwise approach above of first doing the extension interface and then only moving our internal ones. But it is maybe also an idea to start with our internal ones: for example just implement and use a PeriodArray (with the interface we expect it to be), that would make more clear what all the consequences internally are (but again, not sure how easy it would be to split this).

@jreback
Copy link
Contributor

jreback commented Jan 14, 2018

So I think a reasonable architecture would be to form a new set of modules.
pandas.core.extension, with sub-modules: block, array, dtype, exposing Block, Array, and Dtype primitives.

Then using these re-write CategoricalBlock as a Block subclass, Categorical as an Array sub-class, and CategoircalDtype as a Dtype subclass.

Then do the same for DatetimeTZBlock. Then add PeriodBlock and IntervalBlock support.

Then maybe even do this for Sparse.

Then of course, you would ideally just create a NumpyArray (that inherits Array) as well for internal conformity.

By this time I think the api will be very very clear of what needs to be in these super-classes in general.

These can all be done incrementally w/o disrupting things or causing very long and sweeping code reviews (which are bad generally).

This is not actually as much work as I am painting (at least for Categorical and dtz), as its mostly about figuring out what should go in the super classes and what should end up in the sub-class.

pandas has very full featured extension classes, so what better way to design an API than to use an extensive test suite and an already full featured implementation.

note that on purpose these extension should be 'separate' from the internals, the internals will simply 'use' these blocks generically and conform to there exposed API rather than peer inside, as we do now. I suspect on our internals can be quite a lot simpler i things are organized along these lines.

@TomAugspurger
Copy link
Contributor Author

So I think a reasonable architecture would be to form a new set of modules.
pandas.core.extension, with sub-modules: block, array, dtype, exposing Block, Array, and Dtype primitives.

Then using these re-write CategoricalBlock as a Block subclass, Categorical as an Array sub-class, and CategoircalDtype as a Dtype subclass.

That sounds good. Two questions though:

  1. By "expose" do you mean make part of the public API? I'd prefer that ExtensionBlock never be part of the public API, if possible.
  2. Sharing code between internals.Block and extensions.block.ExtensionBlock. Specifically, should ExtensionBlock subclass Block? Common base class? I'll see what things look like.

@jorisvandenbossche
Copy link
Member

IMO the goal of an ExtensionArray should be to not have to expose ExtensionBlock (we will have to see if it is fully possible, but at least it should be the goal I think).
BTW, I think it would make more sense to keep ExtensionBlock inside internals.

@TomAugspurger
Copy link
Contributor Author

BTW, I think it would make more sense to keep ExtensionBlock inside internals.

Because of circular imports, or just for organization? That file is quite long.

@jorisvandenbossche
Copy link
Member

I meant for file organization, that makes more sense IMO. If the file is too long, we can always make internals a directory. But if we are going to move datetimetz and categorical to use arrays, it should get a bit shorter ..

I would rather re-use the existing core.dtypes and core.internals and create a new core.arrays, than putting all of that under core.extension.dtypes/array/block. If we use it internally, it is strange to put it all under core.extension. We can still have a core.extension (or maybe directly in api.extension) that exposes the needed abstract classes for external usage.

@jorisvandenbossche
Copy link
Member

Then of course, you would ideally just create a NumpyArray (that inherits Array) as well for internal conformity.

The only problem with this is that those would need to be 2D, and I think it would be nice to limit our array-based types (and extension interface) to 1D.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2018

so mod of proposal; create a sub-dir:

pandas/core/internals with files
internals.py (existing)
block.py (Block)
array.py (Array)

dtype can be in pandas.core.dtypes (where it is now)

I the whole intent is allow very piecemeal and incremental changes that can be merged. IOW the very first think that should happen (after structure) is to make Array a super-class on any classes we want to move (e.g. Categorical, Index) - Index for support of Datetime w/tz (and Periods/Intervals). This will eventually form the basis for the extension class (which certainly can be exposed in the public api.extensions at some point)

@TomAugspurger
Copy link
Contributor Author

Index for support of Datetime w/tz (and Periods/Intervals)

Not sure if you're suggesting this, but I don't think e.g. PeriodIndex should be an instance of ExtensionArray (indexes are immutable for one thing). Rather, we should have a new PeriodArray like https://github.com/pandas-dev/pandas/pull/19174/files#diff-9e2f3dade777933f51723f5b020a1c08, and PeriodIndex will compose with that just like CategoricalIndex composes with Categorical.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2018

@TomAugspurger actually I am proposing ImmutableArray as well, though that just makes things even more complicated. So let's just stick to a Array which is mutable (make that very clear). But yes need to incorporate all of our existing functionaility.

@TomAugspurger
Copy link
Contributor Author

@jreback are you in favor of removing CategoricalBlock, DatetimeTZBLock in favor of a single ExtensionBlock?

The tricky thing with datetimetz is that we don't have a DatetimeTZArray, like we would for a Categorical, Period, Interval. That means an ExtensionBlock[DatetimeTZ] doesn't have anything to dispatch to, there's no DatetimeTZ.isna() to call, since the values is just a regular ndarray.

@jreback
Copy link
Contributor

jreback commented Jan 15, 2018

no we will still have all of the existing blocks

BUT they will have common functions in a Block; the data is held in an Array (which wraps things like DTI and Categorical, or maybe u can simply make Array a base class as i said before)

CategoricalBlock and such are just implementations of a generic Block (what u r calling an ExtensuonBlock6

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 15, 2018

Not sure if you're suggesting this, but I don't think e.g. PeriodIndex should be an instance of ExtensionArray (indexes are immutable for one thing).

This, along with the mention a while back of PeriodArray subclassing DatetimelikeIndexOpsMixin, leads into the discussion of what exactly needs to be defined by an Array class. I've been making an effort to refactor some bits of the Index subclasses and Block subclasses to inherit from Array-like mixin classes. What I've found is:

  • values, _na_value, fill_value, _box_func are shared by both the Blocks and the Indexes, so can go into the Array class unchanged.

  • The Datetimelike arithmetic/comparison ops use some helper methods _nat_new and _shallow_copy that the Blocks don't. Unifying that internal API would be a big help.

  • The other properties that are non-trivial to share are ones that Indexes cache because they are immutable. @jreback's discussion of ImmutableArray might solve that problem.


Other Notes:

  • TimedeltaBlock.to_native_types has a FIXME note saying to sue Timedelta64Formatter, which TimedeltaIndex already does. to_native_types might belong in TimedeltaArray. (really I guess, any of the Block methods that dont care about position or index?)

  • This discussion is sprawling. Is there a minimal subset of the PR that will implement the feature you most care about?

  • update +1 to the idea of making internals a directory.

def equals(self, other):
if not isinstance(other, type(self)):
return False
return (self.freq == other.freq and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.freq == other.freq may raise, in which case you generally want to return False here.

return result

def __iter__(self):
return iter(self._data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this iter(self._box_func(x) for x in self._data)?

@TomAugspurger
Copy link
Contributor Author

@jbrockmendel agreed in general.

This discussion is sprawling. Is there a minimal subset of the PR that will implement the feature you most care about?

No kidding :) Let's enumerate some TODOs.

  1. I want some interface such that pandas will "do the right thing" when it sees an array-like thing satisfying that interface, even if that array-like thing is defined outside of pandas.
    a.) Define that interface (just the ABCs)
    b.) Modify pandas' internals to honor that interface (adding ExtensionBlock, isna, Series/Frame ctors, groupby, etc.)
  2. An internal refactoring of Categorical to use that array interface
  3. An internal refactoring / enhancement for PeriodArray
  4. An internal refactoring / enhancement for IntervalArray

I think that 1 would be a relatively small, clean PR, but it wouldn't have any cleanup benefit.

Anyway, I'm partway through implementing 1 and 2 as a first PR. The sticking point is the inheritence hierarchy of blocks. Currently I have

- Block
| - ExtensionBlock
| | - CategoricalBlock
| | - DatetimeTZBlock

but there are a few issues to work out

@TomAugspurger
Copy link
Contributor Author

Split the first section off to #19268

@jbrockmendel
Copy link
Member

@TomAugspurger for the discussion of refactoring the arithmetic ops out of DatetimeIndexOpsMixin for PeriodArray, pls take a look at this branch and let me know if it resembles what you have in mind.

@TomAugspurger TomAugspurger changed the title [WIP]ENH: Extending Pandas with custom types ENH: Extending Pandas with custom types Jan 30, 2018
@TomAugspurger
Copy link
Contributor Author

Closing in favor of #19268

@TomAugspurger TomAugspurger deleted the pandas-array-upstream branch May 2, 2018 13:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants