Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion #1

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

wesm
Copy link
Owner

@wesm wesm commented Mar 14, 2020

Based on https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267, we are discussing a "protocol" method (potentially called __dataframe__) similar to __array__ for data frame-like data. The consensus so far is that this protocol should not force conversions to a particular data frame memory model like pandas. Instead, it may provide access to its metadata and data in the form desired by the user of the __dataframe__ protocol.

Questions:

  • What if any arguments should be in the abstract to_numpy method (respectively, to_arrow method)?
  • Think about abstract Column API for parametric types like categorical
  • What kind of multiple-column-selection and row selection APIs should be added

The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later.

Copy link

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm I really like this, there are some minor differences between the model here and Modin's model worth discussing. Some are noted inline, others I will add here:

Slicing/Masking/Projecting/Selecting
There is only one method to extract a subset of columns and rows in Modin: called mask. It accepts either labels or position, based on keywords provided. This interface follows software engineering best practices of keeping external interfaces narrow and implementations deep.

def mask(columns_by_label=None, columns_by_index=None, rows_by_label=None, rows_by_index=None)

I think it's cleaner than treating all four as separate methods, and all calls will be the same.

Types
Modin does not enforce types or have its own system of types and does not place requirements on types. I think this is better.

At a high level, I think the interface should be as narrow as possible.

dataframe.py Outdated
pass


class MutableDataFrame(DataFrame, MutableMapping):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of dropping this completely. Ideally this interface can discourage mutability because in general it's bad practice. It is still possible to use the immutable DataFrame for mutable legacy systems as is, but moving away from this will be better for dataframes as a whole in my opinion.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Modin, the mutability is implemented at the API layer, but underneath is completely immutable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think that mutability makes sense in an interchange API like this. I put this here to "stir the pot" and elicit reactions from others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when it comes to performance/memory usage, it's important for the users/libs to be able to modify values inplace. But I guess you could argue we could convert everything first to numpy arrays and then deal with that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I see with this is that an implementer of the protocol/API may have to deal with converting from foreign input. For example:

df[col] = data

so if data is a NumPy array and df doesn't use NumPy arrays internally, then it's a can of worms to account for all of the different potential kinds of data that the user may pass in

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another downfall of inplace is that it is often deceptive and leaky. In pandas, the overwhelming majority of the time it is not really inplace, it is copying at some point, so users who think they are being more efficient or using best practices are often not getting the desired effect under the hood. Being functional (df.<op> returns another dataframe) is better because external libraries should ideally not have to think about optimizing calls for performance or memory, but rather trust that those who implemented the library did a good job.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be true on pandas, but some simple arithmetic operations on numpy show a different story, for isntance, if a, and b are random float64 arrays is 10000x10000, we have:

In [12]: %timeit np.add(a, b, out=a)                                                                                                                                                                                                   
116 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [13]: %timeit np.add(a, b)                                                                                                                                                                                                          
483 ms ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This may also just be indicative of us preferably having to be operating on a more low level layer (like numpy) rather than the data frame implementations, if we want to do arithmetics.

Copy link

@devin-petersohn devin-petersohn Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more thing to consider I did not mention before: From a library perspective, editing input data inplace is very destructive to the user's environment.

Often there is a disconnect between users and those who maintain or create libraries. You will hear and see a lot of arguments about performance, but in reality users just do not care as much about performance as they do other things. It is easy for a library maintainer to focus on performance, it is a raw number we can try to optimize. However the user cares more about their own productivity and about prediction accuracy. A real world difference of a few seconds or even minutes is noise to a user who is interpreting results.

Also minor note: Your timeit example does not make sense in practice because each time the In[12] trial updates a. The original a values are gone after the first iteration. It's a minor point because it doesn't actually affect performance. The outputs of In[12] and In[13] are not the same.

Ignoring this correctness issue for a moment, you can improve the performance of the In[13] line by almost a factor of 2 if you do the following (provided you would want to store the output in some variable c):

%%timeit c = np.copy(a); np.add(a, b, out=c)

This gets to my point about performance. Optimizing for the behaviors of some other library that is out of your control is a dangerous business to be in, especially when you destroy a user's environment to do so. If you are the user destroying your own environment, that's perfectly fine, but the purpose of this interface is for libraries to avoid depending directly on pandas (forcing conversion into pandas). From a uniform interface perspective, we want to discourage in place updates because it is a leaky abstraction (should dataframe consumers really know about or rely on internal state?) and destructive to the user's environment.

Users may think they want this, but the same thing can be functionally achieved by df = df.dont_update_inplace(). The only argument to keep it is a performance argument, and I feel this is a very weak argument. If a dataframe library is not effectively managing memory or has performance issues that is an issue the dataframe library should fix, not something to be attempted to be worked around by manually changing internal state in place.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance is something we really care about in sklearn. We may not always be optimal, but we try, and when we do change things in place, it's not the user's data we're changing. It's an intermediate array we have in our methods. The optimizations we do even touch the level of an array being f-continuous or c-continuous, depending on the arithmatics we need to apply to the data. So at least for our usecase, these details really do matter.

# Classes representing a column in a DataFrame


class Column(ABC):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of having a column-specific interface vs just a DataFrame with one column? In Modin, underneath everything is a DataFrame and Column/Series objects are views into the dataframe implemented at a higher layer. I'm interested to hear the motivation for a column-specific interface.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics that most existing applications / users when they do df[col_name] is to get something that semantically references a 1-dimensional array, or something convertible to a 1-dimensional array. This seems to hold not just in Python but other PLs that have data frames:

  • In R, df$col_name yields a (1-D) R vector
  • In Julia's DataFrames.jl, df.col_name yields an Array of some type

In all the Apache Arrow implementations there's a logical / semantic distinction between a RecordBatch / Table and a field of a RecordBatch / Table (which yield an Array or ChunkedArray, respectively)

Secondly, the semantics and API of conversion methods at the column level vs. at the data frame level may differ. For example, suppose we wanted to select some number of columns from a data frame and then convert them to a dict of NumPy arrays, it might make sense to write

df.select_columns(names).to_numpy_dict()

or similar. I'm interested to see what scikit-learn and matplotlib developers think.

dataframe.py Outdated
return self.column_by_name(key)

@abstractmethod
def column_by_name(self, key: Hashable) -> Column:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any plan for an interface that allows getting multiple columns at a time?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think multiple-column-selection and row slicing should also be in the API. I didn't write anything down for those yet to see if there's agreement on the simplest concepts (getting access to the data in a single column)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the api define the behaviour when the column name is not unique vs key not found

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, open to ideas. Though we should probably avoid polymorphic output types like in some pandas APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about avoiding polymorphic output types. In Modin this is solved by not having a Column type, so slicing/indexing always returns another DataFrame.

dataframe.py Outdated
# DataFrame: the main public API


class DataFrame(ABC, Mapping):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other things I would add to the DataFrame interface:

  • dtypes equivalent: some mapping from column name to type
  • some way to slice/subset the rows. See the overall comment for more on this.
  • to_numpy and to_arrow at the DataFrame level

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent that pandas be able to implement this interface? If so, the .values attribute may cause issues, since it's a property returning an ndarray.

We're set up to deprecate DataFrame.values, with the alternative .to_numpy() which has been available for a few releases now. But we haven't made the deprecation yet since it'll be a bit disruptive.

This type of protocol would I think give enough justification for a gradual deprecation though.

Copy link
Owner Author

@wesm wesm Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger -- not "implement" in the way you're describing. Instead:

class PandasWrapper(dataframe_protocol.DataFrame):
    pass

# in pandas
class DataFrame:

    def __dataframe__(self):
        return PandasWrapper(self)

dataframe.py Outdated
pass

@abstractmethod
def iter_column_names(self) -> Iterable[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we need both this and column_names? Can we leave it to the application to materialize and just have column_names return an Iterable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm
Copy link
Owner Author

wesm commented Mar 15, 2020

Types: Modin does not enforce types or have its own system of types and does not place requirements on types. I think this is better.

Maybe not surprisingly I think types and schemas are good, and libraries like pandas being a bit "loosey goosey" about type metadata has IMHO caused a lot of problems over the last 10 years.

If the producer of a data frame can expose type metadata without requiring a potentially costly conversion, then it seems reasonable to me to permit this. If you know that a column contains integers you might pass different arguments to to_numpy then if it were strings.

As far as this interface having its "own system of types" -- what would be the alternative, to return the metadata of the internal data representation (e.g. a NumPy dtype)? That seems contrary to the goals of this project, which is to avoid exposing details of the data frame producer to the data frame consumer.

I think the requirements for df[col_name].type can also be relaxed so that a data frame producer can return a "type not indicated" value.

Like above I'm interested to see what consumer projects (that don't want to depend on pandas, say) think about this.

@adrinjalali
Copy link

Regarding conversion between different data frames, how would libraries convert between the types? It'd be nice to have an API similar to what __array__ does. What I mean is to have this work, with reasonable overhead, minimal memory copy, and assuming they all implement the proposed DataFrame API:

pd.DataFrame(xarray.DataArray(np.ndarray(...)))

Not sure how easy/challenging it is to make it happen though.

@wesm
Copy link
Owner Author

wesm commented Mar 16, 2020

@adrinjalali that is effectively what we are discussing here (starting from https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267) -- i.e. thinking about what is the API of the object returned by a __dataframe__ method, so you would have

class XArray:
    ...

    def __dataframe__(self):
        # This class implements the API being discussed here
        return XArrayDataFrame(self)

The __array__ method returns a numpy.ndarray, but we have to determine what kind of object __dataframe__ returns and what behavior that object has

EDIT: I just added some comments about this to the PR summary for clarity

@wesm wesm changed the title Draft strawman data frame interchange protocol for discussion Draft strawman data frame "__dataframe__interchange protocol for discussion Mar 16, 2020
@wesm wesm changed the title Draft strawman data frame "__dataframe__interchange protocol for discussion Draft strawman data frame "__dataframe__" interchange protocol for discussion Mar 16, 2020
@TomAugspurger
Copy link

With respect to a __array__ analog for dataframe, doesn't that necessitate more dedicated methods like a __pandas_dataframe__, __modin_daframe__, etc? The intent of these methods is to give objects control over producing a concrete ndarray. So the concrete dataframe (say pandas.DataFrame's constructor) would need to check the object for a __pandas_dataframe__ implementation and hand of control of the construction to the object with the __pandas_dataframe__ method.

I'd be curious to know if there's value in a more generic __dataframe__ method, and if so what it would produce. Or do we think that consumers of this protocol fall in one of two camps:

  1. They definitely want a pandas DataFrame regardless of the input, so they call pd.DataFrame(data).
  2. They want any dataframe-like object, and so they just require that the input be an object implementing this protocol.

@wesm
Copy link
Owner Author

wesm commented Mar 17, 2020

I'd be curious to know if there's value in a more generic __dataframe__ method

This is exactly what is being proposed here and what I understood to be the spirit of the discussion in https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. We need to determine what object is returned by __dataframe__ though -- what is in this PR is a proposal for that object's interface.

Both camps of users are served by this.

  1. If pandas encounters a foreign object that implements __dataframe__, then pd.DataFrame(foreign_object) will work

  2. Right, if a library wants to accept data frame-like data but not depend on pandas, then they can rather accept any object that has __dataframe__

@adrinjalali
Copy link

I'm still not sure how we can have a unified object returned by __dataframe__ in this scenario. To me, this proposal has two aspects to it, and both are equally important:

The first category is a unified interface for libraries to depend on, which can be used to extract information from the dataframe. Reading feature names, dtypes, etc, is in this category.

The second one is enabling the ecosystem to easily work with one another in an efficient way. If I compare __dataframe__ to __array__, then I'd imagine that the details of how the object is converted to a specific dataframe, should be done in the __datafram__ method, and not the [pandas, xarray, ...] constructor which reads the output of __dataframe__; the same way that numpy expects the output of __array__ to be an ndarray. And to me this needs to be done by the object which implements __dataframe__, since it's the one who knows how to efficiently convert its internal data structures into a given dataframe. So I'd expect for the __dataframe__ to either accept an argument such as pandas, xarray, etc, or to have specialized methods such as __xarray_datarray__, __pandas_dataframe__, etc.

That said, for the dataframes to work nicely with one another, they don't have to implement any of these specialized __*__ methods. They can start by implementing the interface in the first category, and then if such a dataframe is given to pd.DataFrame, the constructor will first check if it has a __pandas_datafram__ implemented, and if not, it will use the public methods to extract the information it needs to create a dataframe.

I hope this makes it a bit more clear on what I meant before.

Copy link

@datapythonista datapythonista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it would simplify the discussion if we split the discussion between:

  1. The dataframe interface (without going into the details of how to access the data of a column)
  2. The interface of a column


@property
@abstractmethod
def num_columns(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would rather use len(df.column_names) than having this method, this API is going to be huge, and I don't think this adds much value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree strongly here. df.column_names is expected to return a Python Sequence and getting the length of that Python object may not be free. I may want to call into an explicit function in C++ for this for example.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I still have df.column_names as Sequence. Maybe it should be Iterator instead

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kkraus14 do you mind expanding on the use cases you have in mind?

I don't have a strong opinion on not having the number of columns as a property or method. It's more a preference on not making users remember on more method name in an API, if it's not needed. So if this is useful for some cases, happy to have it.

What I've got in mind is that dataframes would have a known and somehow reduced number of columns (let's say in the largest examples I can think of we're talking about few thousands). If that was the case, I think column_names being simply a Python list would be probably good enough, and IMO would make things easier. Accessing the columns is not something that I have in mind you would like to implement in a loop (materializing the list if it's not saved internally as a python list inside a loop, of course looping over the list is expected). And computing the length of it would be trivial. Even if you prefer a C++ implementation, I guess you could create a list-like class, which supports the length, indexing, iterating, and you could avoid having the whole list as python objects.

But for your comment, I think I'm wrong in my assumptions. So, would be great to know better what you have in mind, and which use cases have you seen where having the columns name as a Python list is a bad idea. If you have some code examples on how the protocol dataframe would be used, that's even better.

Thanks!


@property
@abstractmethod
def num_rows(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I don't think we need an alias for len(df), I think using len() is the Pythonic way, and it's enough.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a certain ambiguity to len(df), though. Because the length of a dict/mapping is the number of keys that it has (which here is the number of columns). We take for granted from pandas that len(df) returns the number of rows

In R

> df <- data.frame(a=c(1,2,3,4), b=c(5,6,7,8))
> length(df)
[1] 2
> nrow(df)
[1] 4
> ncol(df)
[1] 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on that.

To me it makes more sense that len(df) is the number of rows, and not of columns. And I'd avoid making a dataframe behave like a dict of its columns. But that's just my preference, and of course that's a reasonable choice.

But my point here was more to have just one way of doing things, and not repeat num_rows or num_columns for what can be done with len. I think we should follow the recommendation from the zen of Python There should be one-- and preferably only one --obvious way to do it.. Makes things simple for users IMO.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @wesm here that len(df) is ambiguous and prefer the explicit num_rows property.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO len should denote the length of list(df) so you have to decide if it's an iterable of rows or columns.

...or just punt on it, let len(df) and iter(df) raise and make the properties rows(self) -> Iterable[Row] and columns(self) -> Iterable[Column] return iterables (which themselves implement __len__)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this, couple of reasons why len(df) can not be appropriate (besides some people finding it ambiguous on which dimension it should return).

CPython limits the dunder method __len__ to an int64:

Python 3.7.6 | packaged by conda-forge | (default, Jun  1 2020, 18:57:50) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class LenTooBig:
...     def __len__(self):
...         return 2 ** 64 + 1
... 
>>> len(LenTooBig())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: cannot fit 'int' into an index-sized integer

That is many trillions of data points, that not sure if anyone will ever reach (surely not in-memory representations), but still worth mentioning that the limitation doesn't exist in regular methods.

Another thing, with distributed systems in mind, does it make sense that this is a property (or __len__)? Personally, if I see:

df.num_rows  # also applies to `len(df)`

I'm mostly assuming that num_rows is known / easily available. But for distributed systems that can be an expensive operation. Would it make more sense something like df.count_rows()? Also, I personally find more obvious to see that df.count_rows() can fail (and return an exception), than df.num_rows, which I'd expect to always work, be fast, and not consume significant resources.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it a function makes sense to me for the reasons you listed. It's probably best in this utilitarian interface to not have __len__

dataframe.py Outdated
pass

@abstractmethod
def iter_column_names(self) -> Iterable[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataframe.py Outdated
"""
pass

# TODO: Should this be a method or property?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A property IMHO


# TODO: Should this be a method or property?
@property
def row_names(self) -> Sequence[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm -1 on row names / indices. IMO that shouldn't be part of the standard (of course pandas and any other project can have this as a feature, but shouldn't be here).

Of course we could have this optional, but I think it's more a pandas thing, and we shouldn't add complexity to the standard.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R's DataFrame also has row labels, so it's not specific to pandas. Are there any dataframe implementations that don't have row labels?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I look at it if some implementations return None while others return a Sequence it is not very harmful.

As one can of worms, row_names may need to return an object with the same API as columns. So this might better return Column

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most dataframe implementations have row labels. But in the Python world I think that is because everybody is trying to replicate the pandas API for compatibility.

Having row labels is surely a good idea for many dataframe implementations. My point is that not having them is also a good idea to me. So, I would leave them out of this proposal, at least initially.

IIRC @maartenbreddels mentioned that he'd make Vaex behave more like SQL regarding indexing. If I had to implement a dataframe library from scratch I'd also have indices as regular columns that can be indexed. There would surely be some drawbacks compared to pandas, but I think the API for users would be much simpler, and I prefer that.

In any case, not trying to say that projects shouldn't have row labels. Just that my preference is to not include them here. It narrows the discussion, and it avoids biasing towards the R/pandas approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want any kind of invariant around np.asarray(dataframe)? To me the most natural case is something like np.concat([col for col in df.values()], axis=1)

But if you're storing your row labels "in band" (as part of the columns in the dataframe) then your row labels will be mixed with the values.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both R and pandas data frames have row names. So I don't see a big downside of having an API that allows them to make their row names available to consumers. It does not mean that other libraries need to implement row names

While it isn't required from the protocol perspective, people will end up writing code using these optional features unnecessarily, i.e. in Pandas using df.reset_index()[...] versus df.iloc[...]. Instead of these features being used specifically when necessary, they end up getting used everywhere which becomes almost a forcing hand for libraries to have to support the feature.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a consumer writes code that assumes there are always row names, even though the interface says that they are optional, isn't that the consumer's problem? The abstract API is a contract for producer and consumer, so it wouldn't be appropriate for consumers to de facto violate the contract by making an optional feature required.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the consumer's problem until there's enough people with the same problem (not wanting to rewrite code that assumes row names) that it becomes a problem for producers.

For what it's worth, I fought against having row indices in cuDF and ultimately lost the battle from the overwhelming amount of libraries we wanted to integrate with and user code assuming row indices purely because they were built against Pandas originally and it became too large of a battle to continue fighting.

I'm still against row indices, but would rather them be required than optional.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I guess that one's mostly my fault.

In R, row names are optional and it doesn't seem to cause problems. In pandas the .index has a functional purpose beyond purely communicating data, that may be part of why people are asking for it. In this interface, there is no computational functionality, purely one-way data transfer, so that seems less problematic.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pandas the .index has a functional purpose beyond purely communicating data, that may be part of why people are asking for it.

In our case an Index is just a Column that we provide the syntactic sugar expected from an Index around it. If someone wants to grab rows by index value, we currently build a hash table and then probe that hash table each time (we have plans to memoize it in the future). No one has pointed this out or complained about performance of this thus far (outside of CUDA developers writing the libcudf C++ library 😆), so I don't think it has to do with computational functionality and is purely the data transfer aspects.

dataframe.py Outdated
"""
raise NotImplementedError("row_names")

def __getitem__(self, key: Hashable) -> Column:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would be to force column names to be str. I think the advantages in simplifying this are more than the limitation it imposes. Then __getitem__ can unambiguously perform all reasonable operations, get one item (by name or position), get multiple, and indexing.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be careful about going down the route of having a heavily-overloaded __getitem__ in this interface. It may be better to be as explicit and unambiguous as possible. For example, I am not even sure that df[start:end] should be available

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This interface should perhaps not even have a __getitem__ because it's really ambiguous and specific to implementations. It should be explicitly get_columns or get_rows to avoid ambiguity between implementations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. I'd surely avoid too much overloading, but what is too much is difficult to say. May be some examples can help see what is our preference in practice:

df.column_by_name('col2')
df.column_by_position(1)
df.columns_by_name('col2', 'col3')  # or receive a list instead of `*args`
...
df['col2']
df[1]
df['col2', 'col3']  # note that the param of `__getitem__` is a tuple, not a list like in df[['col2', 'col3']]

Not sure what's the best option, may be something in between, but I'm biased towards the second set. But surely avoiding any ambiguity or something too crazy. And I think requiring column names to be string, makes things easier. For example if a tuple is accepted (it is in your code, as it's hashable), things become more complex and easier to be ambiguous (see the last line of my example).

So, the exact API surely requires some discussion, but my main point was that I'd personally limit key to be a string. I understand that many people will prefer the flexibility over the simplicity, but that's my preference.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be explicitly get_columns or get_rows to avoid ambiguity between implementations.

I think this is a very good point. Do we want this PR to be both the API for end users and downstream projects accepting generic dataframes? From @devin-petersohn comment may be we should be discussing the API only for the latter (downstream projects), and leave the end user API to each project. If we do that, I'd avoid using any magic method, and prefix every method so there are no collisions with the public API of the implementations (e.g. __dataframe_columns__, __dataframe_num_rows__).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of the discussion so far is that objects providing data frame-like data should have a single "export" method __dataframe__ that returns an implementation of a "data frame interface" object. I do not think that e.g. pandas.DataFrame should be a subclass of this object. This PR is intending to define some of the APIs that this object should contain

@GaelVaroquaux
Copy link

GaelVaroquaux commented Mar 18, 2020 via email

frame column. This metadata does not guarantee an specific underlying data
representation
"""
def __eq__(self, other: 'DataType'):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting a mypy error here...

pandas\wesm\dataframe.py:19: error: Argument 1 of "__eq__" is incompatible with supertype "object"; supertype defines the 
argument type as "object"
pandas\wesm\dataframe.py:19: note: It is recommended for "__eq__" to work with arbitrary objects, for example:
pandas\wesm\dataframe.py:19: note:     def __eq__(self, other: object) -> bool:
pandas\wesm\dataframe.py:19: note:         if not isinstance(other, DataType):
pandas\wesm\dataframe.py:19: note:             return NotImplemented
pandas\wesm\dataframe.py:19: note:         return <logic to compare two DataType instances>

should we follow the mypy recommedation?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we follow the mypy recommedation?

I see no reason not to


@property
@abstractmethod
def num_rows(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns the number of rows in the DataFrame (if known)

Suggested change
def num_rows(self):
def num_rows(self) -> Optional[int]:

should the api use Optional types or raise (maybe custom Exception) if not known?

@Dr-Irv
Copy link

Dr-Irv commented Mar 19, 2020

One thing that occurred to me is that the interface is completely column-oriented. But if you have a large file or database with lots of rows, and you read it in row by row, you won't have complete column vectors until you have read everything in, and then you might have some issues with memory.

So my question is whether the interface should define a protocol for creating a DataFrame from row-oriented data.

Copy link

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be done separately, but we should consider adding an .attrs attribute to the interface as a way of propagating metadata about the dataset (cc @philippjfr). Pandas, xarray, and h5py all implement the same interface.


# TODO: Should this be a method or property?
@property
def row_names(self) -> Sequence[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note / leading question here: What's the argument for having DataFrame.row_names but not Column.row_names? I think if we want them for DataFrame then we also want them for Column (likewise if we don't want them).

@wesm
Copy link
Owner Author

wesm commented Apr 8, 2020

I made another pass on this per feedback here. I removed all the dunder methods in the interest of being as conservative / explicit as possible. Take a look

Comment on lines +169 to +181
def to_numpy(self):
"""
Access column's data as a NumPy array. Recommended to return a view if
able but not required
"""
raise NotImplementedError("Conversion to NumPy not available")

def to_arrow(self, **kwargs):
"""
Access column's data in the Apache Arrow format as pyarrow.Array or
ChunkedArray. Recommended to return a view if able but not required
"""
raise NotImplementedError("Conversion to Arrow not available")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these APIs be generalized more to something like to_array and to_masked_array? I imagine users will use these APIs as building blocks and if we're for example using a GPU we wouldn't want to either return a different typed GPU-backed object or cause an unnecessary host --> device copy.

For example, Pandas uses .values to return a numpy / Pandas array whereas in cuDF we return a cupy array to keep everything on the GPU as much as possible.

Copy link
Owner Author

@wesm wesm Apr 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand your comment.

The idea of this interface is that the data representation is not known. So we will add various methods to Column (what is here is NOT exhaustive, maybe that is where there is confusion) to allow the consumer to request the data be returned to them in the representation that they prefer. I put NumPy and Arrow as two example concrete representations

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is if we have to_numpy and to_cupy, the standard will become to_numpy where anyone then looking to change to a different array implementation will have to go back and rework their code. If it was instead to_array and the object returned is expected to return something supporting some form of array protocols, it allows for writing more general code that can be run anywhere.

Maybe I was unclear as to whether to_numpy and to_arrow were placeholders for generic conversion functions for somewhat generic objects or if they were specific to returning numpy arrays and arrow arrays.

Copy link
Owner Author

@wesm wesm Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. There is no "standard" here, and this effort is not trying to establish a standard representation. The idea is that some projects are implemented against a certain data representation (e.g. NumPy arrays) -- so indeed these methods are here so that a producer can return their data to the consumer in the representation that they are looking for. matplotlib is an example -- it deals almost exclusively in NumPy arrays so rather than have matplotlib take responsibility for the details of converting from some foreign data frame object to a NumPy array, this conversion is the responsibility of the producer of this interface.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is if we have to_numpy and to_cupy, the standard will become to_numpy where anyone then looking to change to a different array implementation will have to go back and rework their code.

Also for clarity, efforts to achieve duck-typing between array libraries is pretty orthogonal to this project -- if such a duck-typed array standard is developed and adopted / normalized, then this interface can add it. The use case here is strictly abstract data interchange that does not force producers to convert to a particular intermediate format like pandas.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, that makes sense. My fear is if protocols can't be relied on entirely, the default path consumers will take is to use .to_numpy() where ideally they could use something more generic like .to_array() which produces something array-like which supports calling numpy functions against it. Without something like this, a GPU library like cuDF that will want to support this protocol would have to make an awful choice of either returning something like a cupy array from the to_numpy() call to keep data on the GPU and do things efficiently but possibly break for some people / libraries integrating downstream who were expecting specifically host arrays or have to_numpy() explicitly copy the data to host, which then possibly throws away the performance we may have gained from using the GPU.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I don't know that it makes sense to be opinionated in this way at this level. We can provide people with choices and they can decide what makes sense for them. I don't see a problem with having a to_$something method as long as that $something is documented and something that people can understand how to use. So you can provide both the "general array like" path and the "NumPy" path and people can choose. That you have a preference about which one you want people to use seems like it shouldn't factor into the interface

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is if the "NumPy" path is included in the interface whereas the "general array like" path isn't, most producers will likely only implement the "NumPy" path which will make almost all consumers consume the "NumPy" path. This then becomes a problem for us trying to integrate efficiently of trying to convince everyone to change to the "general array like" path as opposed to that being the typical path.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you like to propose an API to add? In the PR description I wrote "The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later." What is written here is not exhaustive.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, I missed that. Will add a suggestion to kickoff discussions on adding the generic API I was talking about.


@property
@abstractmethod
def num_columns(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree strongly here. df.column_names is expected to return a Python Sequence and getting the length of that Python object may not be free. I may want to call into an explicit function in C++ for this for example.


@property
@abstractmethod
def num_rows(self):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @wesm here that len(df) is ambiguous and prefer the explicit num_rows property.


# TODO: Should this be a method or property?
@property
def row_names(self) -> Sequence[Any]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also -1 on row names / indices. If it becomes an optional part of the protocol but say Pandas implements it, I imagine all of the other libraries will end up having their hands forced to support it from users wanting Pandas compatibility as opposed to building a new DataFrame standard.

dataframe.py Outdated
pass

@abstractmethod
def get_column_by_name(self, name: Hashable) -> Column:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love for name to be more strongly typed than Hashable to allow better interoperability with other languages. I would be happy to say name should be a String but that has obvious downsides as well.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the perspective of one library exporting data to another, it's unclear that we should prevent a library from exporting / making available its column names if they are not strings. Even Hashable may even be too strict here since even if the name is not hashable, it can be looked up by a linear equality search.

Comment on lines +264 to +269
def to_dict_of_numpy(self):
"""
Convert DataFrame to a dict with column names as keys and values the
corresponding columns converted to NumPy arrays
"""
raise NotImplementedError("TODO")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be generalized to something like to_dict_of_array? My concern is we don't want standard APIs that strongly expect a numpy object where we'd want to plug in a GPU object.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above

Metadata for this column. Default implementation returns empty dict
"""
return {}

Copy link

@kkraus14 kkraus14 Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposing an additional API here for a more generic array API. I think that the constraint of supporting __array_function__ protocol helps to prevent this from becoming too overloaded, but wanted to kickoff discussion

Suggested change
def to_array_function_protocol_object(self):
"""
Access column's data as an array object which supports the
`__array_function__` protocol. Recommended to return a view if able but
not required
"""
raise NotImplementedError("Conversion to array not available")

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will open another suggestion around to_dict_of_array once this is settled.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. From my perspective a critical matter is making sure that this doesn't get mixed up with APIs whose purpose is to provide data in a particular memory representation. So I think the term "array" in this context is may confuse, so to_array_function_protocol or something more explicit would be clearer (and __array_function__ can be available also which calls this function).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective a critical matter is making sure that this doesn't get mixed up with APIs whose purpose is to provide data in a particular memory representation.

Agreed.

So I think the term "array" in this context is may confuse, so to_array_function_protocol or something more explicit would be clearer (and array_function can be available also which calls this function).

to_array_function_protocol_object is probably the most explicit because we're exporting an object that supports the protocol as opposed to the protocol implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_array_function_protocol_object is probably the most explicit

That's... incredibly cumbersome :(

If to_array is too ambiguous how about to_arraylike

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason that I'm against to_array is then you have this situation

  • to_numpy -> does data export to a concrete memory representation
  • to_arrow -> does data export to a concrete memory representation
  • to_array or to_array_like -> what does this do? Based on what I see above it's not doing data export

What is "array" and what is "array_like"? It's too ambiguous.

IMHO this project is first and foremost about data export at library boundaries (without one library having knowledge of the internals of the other). If you want to also expose object protocols (like __array_function__) that's fine, but that needs to not be commingled with the data export APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_array_like could export either a numpy array or a pyarrow array, or anything which implemented the array protocol. A consumer of this method could then write a generic function, agnostic to whatever the concrete representation was - so long as they stuck to methods implemented in the array protocol.

Copy link
Owner Author

@wesm wesm Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this goes beyond the scope of what problem is trying to be solved in this first project iteration. There is exporting "data" and exporting "interfaces". If we can't first agree about exporting "data" then I think this project will be kaputt and we should all disengage.

Copy link

@kkraus14 kkraus14 Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • to_numpy -> does data export to a concrete memory representation

From my perspective, in addition to exporting to a concrete memory representation this takes a step too far and exports to a specific object container type. It would be great if we could generalize this to exporting any object that follows a concrete memory representation as opposed to tightly confining it to a specific object. I.E. (ignore the cumbersome names for now)

  • to_array_interface_object -> does data export to an object that supports __array_interface__ protocol
  • to_arrow_array_interface_object -> does data export to an object that supports __arrow_array_interface__ protocol (I know this doesn't exist, just for example)
  • to_buffer_protocol_object -> does data export to an object that exposes Python buffer protocol
  • to_cuda_array_interface_object -> does data export to an object that supports __cuda_array_interface__ protocol

These protocols are specifically defined surrounding memory representation so it makes sense to use them from my perspective.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reached the limit of how much time I can invest in this for now so I'll wait a few weeks or a month for others to express their opinions. For me the things I care most about are:

  • Clear and unambiguous names (even if they are "cumbersome")
  • Clear distinction between methods that yield objects implementing some other protocol / interface (like the buffer protocol) and methods that yield particular containers. If a consumer wants a numpy.ndarray, I think they should be able to ask for that

Per my comment on the Discourse thread, I do think taking a step back and writing a requirements doc might be helpful to avoid diversions into discussions of non-goals that take up a lot of time and energy.

dataframe.py Outdated

@property
@abstractmethod
def column_names(self) -> Sequence[Any]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about a columns property that returns a ColumnCollection that itself has a names property, and similarly for rows

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above. I am not sure this API should be in the business of syntactic sugar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed. For many applications I could probably get by with a restricted API but I'd still want to code a nice interface which used idiomatic python constructs. The proposed interface here strikes me as being extremely cumbersome to use in practice. Just one viewpoint though, there may well be other considerations to take into account other than aesthetics.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed.

This is not a requirement / goal of this project (see the original discussion thread).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly the motivating use case at the the start of that thread:

Bigger picture: programming to an interface

If the discussion there has veered off from that course then it is probably a good idea to write a more formal proposal to capture the current state.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take it up with people on Discourse. If it turns out that people don't want interchange / export APIs I'm going to walk away from this project. I've already spent a lot of time on it.

"""
raise NotImplementedError("select_columns_by_name")

def to_dict_of_numpy(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, using type hints could avoid cumbersome method names:

  def to_dict(self) -> Dict[str, np.ndarray]:
    ...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1, I think the method name needs to unambiguously describe what it's doing

"""
Return the column at the indicated position
"""
pass
Copy link

@dhirschfeld dhirschfeld Apr 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.

If we instead extend the columns(self) -> ColumnCollection: model we can have an api like:

In [79]: df = DataFrame('A', 'B', 'C')

In [80]: df
Out[80]: DataFrame(['A', 'B', 'C'])

In [81]: df.columns
Out[81]: ColumnCollection(['A', 'B', 'C'])

In [82]: df.columns.idx(0)
Out[82]: Column(name='A')

In [83]: df.columns.idx(0, 2)
Out[83]: DataFrame(['A', 'C'])

In [84]: df.columns['A']
Out[84]: Column(name='A')

In [85]: df.columns['A','C']
Out[85]: DataFrame(['A', 'C'])

...which, at least to me, seems a lot more natural.

Code:
from typing import List
from dataclasses import dataclass


@dataclass
class Column(object):
    name: str


class ColumnCollection(object):
    _columns: List[Column]
    
    def __init__(self, *columns: Column):
        self._columns = list(columns)
    
    @property
    def names(self) -> List[str]:
        return [col.name for col in self._columns]

    def idx(self, *indices: int) -> ColumnCollection:
        columns = [self._columns[idx] for idx in indices]
        if len(columns) == 1:
            return columns[0]
        return DataFrame(
            *(col.name for col in columns)
        )
    
    def __getitem__(self, names: str) -> Union[Column, ColumnCollection]:
        columns = dict(zip(self.names, self._columns))
        if len(names) == 1:
            return columns[names[0]]
        return DataFrame(*names)
    
    def __len__(self):
        return len(self._columns)

    def __iter__(self):
        for col in self._columns:
            yield col
        
    def __repr__(self):
        columns = "', '".join(self.names)
        return f"ColumnCollection(['{columns}'])"


class DataFrame(object):
    _columns: ColumnCollection
    
    def __init__(self, *columns):
        self._columns = ColumnCollection(
            *(Column(col) for col in columns)
        )
        
    def __repr__(self):
        columns = "', '".join(self.columns.names)
        return f"DataFrame(['{columns}'])"   

    @property
    def columns(self):
        return self._columns

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.

I strongly feel that syntactic sugar / conveniences (which get non-objective very quickly) belong at a higher level than this API, which is about library developers implementing unambiguous data export APIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in this as a potential way to write analytics functions which are DataFrame implementation agnostic. i.e. instead of coding to the pandas API I would code to this API hence its usage would be front-and-centre in my code in the same way that pandas now is. Maybe this isn't supposed to cater to that end-user concern though?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm wesm changed the title Draft strawman data frame "__dataframe__" interchange protocol for discussion Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion Apr 9, 2020
@wesm
Copy link
Owner Author

wesm commented Apr 9, 2020

I relaxed hashability of column names and changed column_names to return Iterable. PTAL

pass

@property
def attrs(self) -> Mapping:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be something more precise, e.g. Mapping[str, str]?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in xarray attrs can be arbitrary metadata so it should probably be Mapping[str, Any] (although it's usually strings in practice).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen keys as bytes (e.g. Arrow's metadata is bytes -- though UTF-8 bytes would be preferred) so Mapping[Hashable, Any] might be the most general (maybe that's the default for Mapping, not sure)

@wesm
Copy link
Owner Author

wesm commented Apr 10, 2020

If someone would like write access on this repository to help lead this effort please let me know. I'm juggling a few too many projects so need to step away from this for a while

"""

def __init__(self, index_type: IntegerType, category_type: DataType,
ordered: bool = False):
Copy link

@datapythonista datapythonista Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand better the reason to have the types of the data and the categories in the type.

With an example, I think a category column could be represented with something equivalent to:

class CategoryColumn:
    dtype = 'category'
    index = pyarrow.array([0, 1, 0, 2, 0, 1], type='int8')
    categories = pyarrow.array(['red', 'green', 'blue'], type='string')

So, when accessing the data, I can imagine that we may be interested in one of two options:

  1. The underlying data (to copy the structure from implementation to implementation for example)
  2. To access the actual data (forgetting about the internal representation)

For the first option, an example of code copying the data to a numpy array could be:

for source_column in dataframe:
    if source_column.dtype == 'category':
        copy_of_index = numpy.array(source_column.index, dtype=source_column.index.dtype)

The example is very simplistic, but it shows that we can get the type of the index (int8 in the previous example) from the underlying column (and not from the data type). It feels like if we have it again in the data type it will just be a duplicate that will be need to keep in sync. Same would apply to the data type of the categories.

What would be the main advantage of this implementation, as opposed to have just a categorical type with only the ordered parameter (or two types categorical and ordered_categorical)? I guess it decouples the schema from the data, but I'm unsure if it's worth the extra complexity. I guess we could just identify data types by their name as a string among implementations, instead of making them implement all these subclasses, if we don't have parameters. Which I guess would make things simpler.

Am I missing something? Does it make sense what I say?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty common for users to compare dtypes across columns. With this proposed change it would have a category column with int32 codes and say int64 categories have the same dtype as int8 codes and string categories, which may lead to confusion / frustration for end users.

I'm a +1 to including the index / categories types in the dtype as it makes things more explicit for end users.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's lots of reasons to have access to the complete schema, from preallocating space for data (in-memory or on disk) to building query plans (including kernel selection) before data arrives. Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.

That said, if you have an application where the index or category type is simply not known, you could use an Any type to indicate that the actual type must be observed in the actual data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?

Unfortunately our users don't share the "why" too often, but one use case is checking the output of dtype inference from something like a read_csv call.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.

Thanks @wesm, that makes sense. But I don't fully understand this part. Given a categorical column, there is access to the index and category types in my example (there was a typo in my previous code, may be that's where I didn't make myself clear). It's just the place where it is found that is different (in the array instead of in the dtype).

In your proposal:

my_column.dtype.index_type

In my example would be something like:

my_column.underlying_categorical_data.index.dtype

I understand that in my example there wouldn't be access to the underlying types if a dataframe schema is used without the actual dataframe. Then, with your proposal we could still know the internal types of the categorical, while in my example those will be undefined, since they should be in the data, which doesn't exist. Is this what you have in mind? If it is, in what case are we considering that a schema can live without the data?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the actual data is potentially costly, I think? So that would make getting the type of the categories costly, while I assume accessing it from the dtype itself would not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accessing the actual data is potentially costly, I think?

Good point. In the PR, the type (aka dtype) is a property of the Column class, so accessing the index type would be

class Column:
    @property
    def type(self):
        pass

my_column.type.index_type

The data is not specified, may be that's the problem, but if we agree the categorical data as two subcolumns, one for the index, one for the categories, then accessing the index type with an unparametrized categorical dtype would be:

my_column.index_subcolumn.type

I may be missing something, but I would say in both cases the type is metadata of the Column class that doesn't require accessing the actual data. Not sure if in certain architectures (distributed, GPU...) accessing a subcolumn could be expensive, but in pure Python both cases would be two lookups in the namespace.

I'm not sure what would imply if we consider implementations using Arrow's Dict instead of the two arrays separately. Would that be costly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.