-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft strawman data frame "__dataframe__" interchange / data export protocol for discussion #1
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm I really like this, there are some minor differences between the model here and Modin's model worth discussing. Some are noted inline, others I will add here:
Slicing/Masking/Projecting/Selecting
There is only one method to extract a subset of columns and rows in Modin: called mask
. It accepts either labels or position, based on keywords provided. This interface follows software engineering best practices of keeping external interfaces narrow and implementations deep.
def mask(columns_by_label=None, columns_by_index=None, rows_by_label=None, rows_by_index=None)
I think it's cleaner than treating all four as separate methods, and all calls will be the same.
Types
Modin does not enforce types or have its own system of types and does not place requirements on types. I think this is better.
At a high level, I think the interface should be as narrow as possible.
dataframe.py
Outdated
pass | ||
|
||
|
||
class MutableDataFrame(DataFrame, MutableMapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of dropping this completely. Ideally this interface can discourage mutability because in general it's bad practice. It is still possible to use the immutable DataFrame for mutable legacy systems as is, but moving away from this will be better for dataframes as a whole in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Modin, the mutability is implemented at the API layer, but underneath is completely immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think that mutability makes sense in an interchange API like this. I put this here to "stir the pot" and elicit reactions from others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when it comes to performance/memory usage, it's important for the users/libs to be able to modify values inplace. But I guess you could argue we could convert everything first to numpy arrays and then deal with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem I see with this is that an implementer of the protocol/API may have to deal with converting from foreign input. For example:
df[col] = data
so if data
is a NumPy array and df
doesn't use NumPy arrays internally, then it's a can of worms to account for all of the different potential kinds of data
that the user may pass in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another downfall of inplace is that it is often deceptive and leaky. In pandas, the overwhelming majority of the time it is not really inplace, it is copying at some point, so users who think they are being more efficient or using best practices are often not getting the desired effect under the hood. Being functional (df.<op>
returns another dataframe) is better because external libraries should ideally not have to think about optimizing calls for performance or memory, but rather trust that those who implemented the library did a good job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That may be true on pandas, but some simple arithmetic operations on numpy show a different story, for isntance, if a
, and b
are random float64
arrays is 10000x10000
, we have:
In [12]: %timeit np.add(a, b, out=a)
116 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [13]: %timeit np.add(a, b)
483 ms ± 200 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This may also just be indicative of us preferably having to be operating on a more low level layer (like numpy) rather than the data frame implementations, if we want to do arithmetics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's one more thing to consider I did not mention before: From a library perspective, editing input data inplace is very destructive to the user's environment.
Often there is a disconnect between users and those who maintain or create libraries. You will hear and see a lot of arguments about performance, but in reality users just do not care as much about performance as they do other things. It is easy for a library maintainer to focus on performance, it is a raw number we can try to optimize. However the user cares more about their own productivity and about prediction accuracy. A real world difference of a few seconds or even minutes is noise to a user who is interpreting results.
Also minor note: Your timeit
example does not make sense in practice because each time the In[12]
trial updates a
. The original a
values are gone after the first iteration. It's a minor point because it doesn't actually affect performance. The outputs of In[12]
and In[13]
are not the same.
Ignoring this correctness issue for a moment, you can improve the performance of the In[13]
line by almost a factor of 2 if you do the following (provided you would want to store the output in some variable c
):
%%timeit c = np.copy(a); np.add(a, b, out=c)
This gets to my point about performance. Optimizing for the behaviors of some other library that is out of your control is a dangerous business to be in, especially when you destroy a user's environment to do so. If you are the user destroying your own environment, that's perfectly fine, but the purpose of this interface is for libraries to avoid depending directly on pandas (forcing conversion into pandas). From a uniform interface perspective, we want to discourage in place updates because it is a leaky abstraction (should dataframe consumers really know about or rely on internal state?) and destructive to the user's environment.
Users may think they want this, but the same thing can be functionally achieved by df = df.dont_update_inplace()
. The only argument to keep it is a performance argument, and I feel this is a very weak argument. If a dataframe library is not effectively managing memory or has performance issues that is an issue the dataframe library should fix, not something to be attempted to be worked around by manually changing internal state in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The performance is something we really care about in sklearn. We may not always be optimal, but we try, and when we do change things in place, it's not the user's data we're changing. It's an intermediate array we have in our methods. The optimizations we do even touch the level of an array being f-continuous or c-continuous, depending on the arithmatics we need to apply to the data. So at least for our usecase, these details really do matter.
# Classes representing a column in a DataFrame | ||
|
||
|
||
class Column(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of having a column-specific interface vs just a DataFrame with one column? In Modin, underneath everything is a DataFrame and Column/Series objects are views into the dataframe implemented at a higher layer. I'm interested to hear the motivation for a column-specific interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics that most existing applications / users when they do df[col_name]
is to get something that semantically references a 1-dimensional array, or something convertible to a 1-dimensional array. This seems to hold not just in Python but other PLs that have data frames:
- In R,
df$col_name
yields a (1-D) R vector - In Julia's DataFrames.jl,
df.col_name
yields anArray
of some type
In all the Apache Arrow implementations there's a logical / semantic distinction between a RecordBatch / Table and a field of a RecordBatch / Table (which yield an Array or ChunkedArray, respectively)
Secondly, the semantics and API of conversion methods at the column level vs. at the data frame level may differ. For example, suppose we wanted to select some number of columns from a data frame and then convert them to a dict of NumPy arrays, it might make sense to write
df.select_columns(names).to_numpy_dict()
or similar. I'm interested to see what scikit-learn and matplotlib developers think.
dataframe.py
Outdated
return self.column_by_name(key) | ||
|
||
@abstractmethod | ||
def column_by_name(self, key: Hashable) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any plan for an interface that allows getting multiple columns at a time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think multiple-column-selection and row slicing should also be in the API. I didn't write anything down for those yet to see if there's agreement on the simplest concepts (getting access to the data in a single column)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the api define the behaviour when the column name is not unique vs key not found
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, open to ideas. Though we should probably avoid polymorphic output types like in some pandas APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about avoiding polymorphic output types. In Modin this is solved by not having a Column type, so slicing/indexing always returns another DataFrame.
dataframe.py
Outdated
# DataFrame: the main public API | ||
|
||
|
||
class DataFrame(ABC, Mapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other things I would add to the DataFrame
interface:
dtypes
equivalent: some mapping from column name to type- some way to slice/subset the rows. See the overall comment for more on this.
to_numpy
andto_arrow
at the DataFrame level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent that pandas be able to implement this interface? If so, the .values
attribute may cause issues, since it's a property returning an ndarray.
We're set up to deprecate DataFrame.values, with the alternative .to_numpy()
which has been available for a few releases now. But we haven't made the deprecation yet since it'll be a bit disruptive.
This type of protocol would I think give enough justification for a gradual deprecation though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger -- not "implement" in the way you're describing. Instead:
class PandasWrapper(dataframe_protocol.DataFrame):
pass
# in pandas
class DataFrame:
def __dataframe__(self):
return PandasWrapper(self)
dataframe.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def iter_column_names(self) -> Iterable[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we need both this and column_names
? Can we leave it to the application to materialize and just have column_names
return an Iterable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @devin-petersohn
Maybe not surprisingly I think types and schemas are good, and libraries like pandas being a bit "loosey goosey" about type metadata has IMHO caused a lot of problems over the last 10 years. If the producer of a data frame can expose type metadata without requiring a potentially costly conversion, then it seems reasonable to me to permit this. If you know that a column contains integers you might pass different arguments to As far as this interface having its "own system of types" -- what would be the alternative, to return the metadata of the internal data representation (e.g. a NumPy I think the requirements for Like above I'm interested to see what consumer projects (that don't want to depend on pandas, say) think about this. |
Regarding conversion between different data frames, how would libraries convert between the types? It'd be nice to have an API similar to what pd.DataFrame(xarray.DataArray(np.ndarray(...))) Not sure how easy/challenging it is to make it happen though. |
@adrinjalali that is effectively what we are discussing here (starting from https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267) -- i.e. thinking about what is the API of the object returned by a class XArray:
...
def __dataframe__(self):
# This class implements the API being discussed here
return XArrayDataFrame(self) The EDIT: I just added some comments about this to the PR summary for clarity |
With respect to a I'd be curious to know if there's value in a more generic
|
This is exactly what is being proposed here and what I understood to be the spirit of the discussion in https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267. We need to determine what object is returned by Both camps of users are served by this.
|
I'm still not sure how we can have a unified object returned by The first category is a unified interface for libraries to depend on, which can be used to extract information from the dataframe. Reading feature names, dtypes, etc, is in this category. The second one is enabling the ecosystem to easily work with one another in an efficient way. If I compare That said, for the dataframes to work nicely with one another, they don't have to implement any of these specialized I hope this makes it a bit more clear on what I meant before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it would simplify the discussion if we split the discussion between:
- The dataframe interface (without going into the details of how to access the data of a column)
- The interface of a column
|
||
@property | ||
@abstractmethod | ||
def num_columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I would rather use len(df.column_names)
than having this method, this API is going to be huge, and I don't think this adds much value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree strongly here. df.column_names
is expected to return a Python Sequence and getting the length of that Python object may not be free. I may want to call into an explicit function in C++ for this for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I still have df.column_names
as Sequence
. Maybe it should be Iterator
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14 do you mind expanding on the use cases you have in mind?
I don't have a strong opinion on not having the number of columns as a property or method. It's more a preference on not making users remember on more method name in an API, if it's not needed. So if this is useful for some cases, happy to have it.
What I've got in mind is that dataframes would have a known and somehow reduced number of columns (let's say in the largest examples I can think of we're talking about few thousands). If that was the case, I think column_names
being simply a Python list would be probably good enough, and IMO would make things easier. Accessing the columns is not something that I have in mind you would like to implement in a loop (materializing the list if it's not saved internally as a python list inside a loop, of course looping over the list is expected). And computing the length of it would be trivial. Even if you prefer a C++ implementation, I guess you could create a list-like class, which supports the length, indexing, iterating, and you could avoid having the whole list as python objects.
But for your comment, I think I'm wrong in my assumptions. So, would be great to know better what you have in mind, and which use cases have you seen where having the columns name as a Python list is a bad idea. If you have some code examples on how the protocol dataframe would be used, that's even better.
Thanks!
|
||
@property | ||
@abstractmethod | ||
def num_rows(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I don't think we need an alias for len(df)
, I think using len()
is the Pythonic way, and it's enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a certain ambiguity to len(df)
, though. Because the length of a dict/mapping is the number of keys that it has (which here is the number of columns). We take for granted from pandas that len(df)
returns the number of rows
In R
> df <- data.frame(a=c(1,2,3,4), b=c(5,6,7,8))
> length(df)
[1] 2
> nrow(df)
[1] 4
> ncol(df)
[1] 2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on that.
To me it makes more sense that len(df)
is the number of rows, and not of columns. And I'd avoid making a dataframe behave like a dict of its columns. But that's just my preference, and of course that's a reasonable choice.
But my point here was more to have just one way of doing things, and not repeat num_rows
or num_columns
for what can be done with len
. I think we should follow the recommendation from the zen of Python There should be one-- and preferably only one --obvious way to do it.
. Makes things simple for users IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @wesm here that len(df)
is ambiguous and prefer the explicit num_rows
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO len
should denote the length of list(df)
so you have to decide if it's an iterable of rows or columns.
...or just punt on it, let len(df)
and iter(df)
raise and make the properties rows(self) -> Iterable[Row]
and columns(self) -> Iterable[Column]
return iterables (which themselves implement __len__
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this, couple of reasons why len(df)
can not be appropriate (besides some people finding it ambiguous on which dimension it should return).
CPython limits the dunder method __len__
to an int64:
Python 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> class LenTooBig:
... def __len__(self):
... return 2 ** 64 + 1
...
>>> len(LenTooBig())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: cannot fit 'int' into an index-sized integer
That is many trillions of data points, that not sure if anyone will ever reach (surely not in-memory representations), but still worth mentioning that the limitation doesn't exist in regular methods.
Another thing, with distributed systems in mind, does it make sense that this is a property (or __len__
)? Personally, if I see:
df.num_rows # also applies to `len(df)`
I'm mostly assuming that num_rows
is known / easily available. But for distributed systems that can be an expensive operation. Would it make more sense something like df.count_rows()
? Also, I personally find more obvious to see that df.count_rows()
can fail (and return an exception), than df.num_rows
, which I'd expect to always work, be fast, and not consume significant resources.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making it a function makes sense to me for the reasons you listed. It's probably best in this utilitarian interface to not have __len__
dataframe.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def iter_column_names(self) -> Iterable[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to @devin-petersohn
dataframe.py
Outdated
""" | ||
pass | ||
|
||
# TODO: Should this be a method or property? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A property IMHO
|
||
# TODO: Should this be a method or property? | ||
@property | ||
def row_names(self) -> Sequence[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm -1 on row names / indices. IMO that shouldn't be part of the standard (of course pandas and any other project can have this as a feature, but shouldn't be here).
Of course we could have this optional, but I think it's more a pandas thing, and we shouldn't add complexity to the standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R's DataFrame also has row labels, so it's not specific to pandas. Are there any dataframe implementations that don't have row labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I look at it if some implementations return None
while others return a Sequence
it is not very harmful.
As one can of worms, row_names
may need to return an object with the same API as columns. So this might better return Column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most dataframe implementations have row labels. But in the Python world I think that is because everybody is trying to replicate the pandas API for compatibility.
Having row labels is surely a good idea for many dataframe implementations. My point is that not having them is also a good idea to me. So, I would leave them out of this proposal, at least initially.
IIRC @maartenbreddels mentioned that he'd make Vaex behave more like SQL regarding indexing. If I had to implement a dataframe library from scratch I'd also have indices as regular columns that can be indexed. There would surely be some drawbacks compared to pandas, but I think the API for users would be much simpler, and I prefer that.
In any case, not trying to say that projects shouldn't have row labels. Just that my preference is to not include them here. It narrows the discussion, and it avoids biasing towards the R/pandas approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want any kind of invariant around np.asarray(dataframe)
? To me the most natural case is something like np.concat([col for col in df.values()], axis=1)
But if you're storing your row labels "in band" (as part of the columns in the dataframe) then your row labels will be mixed with the values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both R and pandas data frames have row names. So I don't see a big downside of having an API that allows them to make their row names available to consumers. It does not mean that other libraries need to implement row names
While it isn't required from the protocol perspective, people will end up writing code using these optional features unnecessarily, i.e. in Pandas using df.reset_index()[...]
versus df.iloc[...]
. Instead of these features being used specifically when necessary, they end up getting used everywhere which becomes almost a forcing hand for libraries to have to support the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a consumer writes code that assumes there are always row names, even though the interface says that they are optional, isn't that the consumer's problem? The abstract API is a contract for producer and consumer, so it wouldn't be appropriate for consumers to de facto violate the contract by making an optional feature required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the consumer's problem until there's enough people with the same problem (not wanting to rewrite code that assumes row names) that it becomes a problem for producers.
For what it's worth, I fought against having row indices in cuDF and ultimately lost the battle from the overwhelming amount of libraries we wanted to integrate with and user code assuming row indices purely because they were built against Pandas originally and it became too large of a battle to continue fighting.
I'm still against row indices, but would rather them be required than optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I guess that one's mostly my fault.
In R, row names are optional and it doesn't seem to cause problems. In pandas the .index
has a functional purpose beyond purely communicating data, that may be part of why people are asking for it. In this interface, there is no computational functionality, purely one-way data transfer, so that seems less problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pandas the
.index
has a functional purpose beyond purely communicating data, that may be part of why people are asking for it.
In our case an Index is just a Column that we provide the syntactic sugar expected from an Index around it. If someone wants to grab rows by index value, we currently build a hash table and then probe that hash table each time (we have plans to memoize it in the future). No one has pointed this out or complained about performance of this thus far (outside of CUDA developers writing the libcudf C++ library 😆), so I don't think it has to do with computational functionality and is purely the data transfer aspects.
dataframe.py
Outdated
""" | ||
raise NotImplementedError("row_names") | ||
|
||
def __getitem__(self, key: Hashable) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to force column names to be str
. I think the advantages in simplifying this are more than the limitation it imposes. Then __getitem__
can unambiguously perform all reasonable operations, get one item (by name or position), get multiple, and indexing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should be careful about going down the route of having a heavily-overloaded __getitem__
in this interface. It may be better to be as explicit and unambiguous as possible. For example, I am not even sure that df[start:end]
should be available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface should perhaps not even have a __getitem__
because it's really ambiguous and specific to implementations. It should be explicitly get_columns
or get_rows
to avoid ambiguity between implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you. I'd surely avoid too much overloading, but what is too much is difficult to say. May be some examples can help see what is our preference in practice:
df.column_by_name('col2')
df.column_by_position(1)
df.columns_by_name('col2', 'col3') # or receive a list instead of `*args`
...
df['col2']
df[1]
df['col2', 'col3'] # note that the param of `__getitem__` is a tuple, not a list like in df[['col2', 'col3']]
Not sure what's the best option, may be something in between, but I'm biased towards the second set. But surely avoiding any ambiguity or something too crazy. And I think requiring column names to be string, makes things easier. For example if a tuple is accepted (it is in your code, as it's hashable), things become more complex and easier to be ambiguous (see the last line of my example).
So, the exact API surely requires some discussion, but my main point was that I'd personally limit key
to be a string. I understand that many people will prefer the flexibility over the simplicity, but that's my preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be explicitly
get_columns
orget_rows
to avoid ambiguity between implementations.
I think this is a very good point. Do we want this PR to be both the API for end users and downstream projects accepting generic dataframes? From @devin-petersohn comment may be we should be discussing the API only for the latter (downstream projects), and leave the end user API to each project. If we do that, I'd avoid using any magic method, and prefix every method so there are no collisions with the public API of the implementations (e.g. __dataframe_columns__
, __dataframe_num_rows__
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding of the discussion so far is that objects providing data frame-like data should have a single "export" method __dataframe__
that returns an implementation of a "data frame interface" object. I do not think that e.g. pandas.DataFrame
should be a subclass of this object. This PR is intending to define some of the APIs that this object should contain
I think we should be careful about going down the route of having a heavily-overloaded __getitem__ in this interface.
+1
|
frame column. This metadata does not guarantee an specific underlying data | ||
representation | ||
""" | ||
def __eq__(self, other: 'DataType'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getting a mypy error here...
pandas\wesm\dataframe.py:19: error: Argument 1 of "__eq__" is incompatible with supertype "object"; supertype defines the
argument type as "object"
pandas\wesm\dataframe.py:19: note: It is recommended for "__eq__" to work with arbitrary objects, for example:
pandas\wesm\dataframe.py:19: note: def __eq__(self, other: object) -> bool:
pandas\wesm\dataframe.py:19: note: if not isinstance(other, DataType):
pandas\wesm\dataframe.py:19: note: return NotImplemented
pandas\wesm\dataframe.py:19: note: return <logic to compare two DataType instances>
should we follow the mypy recommedation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we follow the mypy recommedation?
I see no reason not to
|
||
@property | ||
@abstractmethod | ||
def num_rows(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns the number of rows in the DataFrame (if known)
def num_rows(self): | |
def num_rows(self) -> Optional[int]: |
should the api use Optional types or raise (maybe custom Exception) if not known?
One thing that occurred to me is that the interface is completely column-oriented. But if you have a large file or database with lots of rows, and you read it in row by row, you won't have complete column vectors until you have read everything in, and then you might have some issues with memory. So my question is whether the interface should define a protocol for creating a DataFrame from row-oriented data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done separately, but we should consider adding an .attrs
attribute to the interface as a way of propagating metadata about the dataset (cc @philippjfr). Pandas, xarray, and h5py all implement the same interface.
|
||
# TODO: Should this be a method or property? | ||
@property | ||
def row_names(self) -> Sequence[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note / leading question here: What's the argument for having DataFrame.row_names
but not Column.row_names
? I think if we want them for DataFrame then we also want them for Column (likewise if we don't want them).
I made another pass on this per feedback here. I removed all the dunder methods in the interest of being as conservative / explicit as possible. Take a look |
def to_numpy(self): | ||
""" | ||
Access column's data as a NumPy array. Recommended to return a view if | ||
able but not required | ||
""" | ||
raise NotImplementedError("Conversion to NumPy not available") | ||
|
||
def to_arrow(self, **kwargs): | ||
""" | ||
Access column's data in the Apache Arrow format as pyarrow.Array or | ||
ChunkedArray. Recommended to return a view if able but not required | ||
""" | ||
raise NotImplementedError("Conversion to Arrow not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these APIs be generalized more to something like to_array
and to_masked_array
? I imagine users will use these APIs as building blocks and if we're for example using a GPU we wouldn't want to either return a different typed GPU-backed object or cause an unnecessary host --> device copy.
For example, Pandas uses .values
to return a numpy / Pandas array whereas in cuDF we return a cupy array to keep everything on the GPU as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand your comment.
The idea of this interface is that the data representation is not known. So we will add various methods to Column
(what is here is NOT exhaustive, maybe that is where there is confusion) to allow the consumer to request the data be returned to them in the representation that they prefer. I put NumPy and Arrow as two example concrete representations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is if we have to_numpy
and to_cupy
, the standard will become to_numpy
where anyone then looking to change to a different array implementation will have to go back and rework their code. If it was instead to_array
and the object returned is expected to return something supporting some form of array protocols, it allows for writing more general code that can be run anywhere.
Maybe I was unclear as to whether to_numpy
and to_arrow
were placeholders for generic conversion functions for somewhat generic objects or if they were specific to returning numpy arrays and arrow arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. There is no "standard" here, and this effort is not trying to establish a standard representation. The idea is that some projects are implemented against a certain data representation (e.g. NumPy arrays) -- so indeed these methods are here so that a producer can return their data to the consumer in the representation that they are looking for. matplotlib is an example -- it deals almost exclusively in NumPy arrays so rather than have matplotlib take responsibility for the details of converting from some foreign data frame object to a NumPy array, this conversion is the responsibility of the producer of this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is if we have to_numpy and to_cupy, the standard will become to_numpy where anyone then looking to change to a different array implementation will have to go back and rework their code.
Also for clarity, efforts to achieve duck-typing between array libraries is pretty orthogonal to this project -- if such a duck-typed array standard is developed and adopted / normalized, then this interface can add it. The use case here is strictly abstract data interchange that does not force producers to convert to a particular intermediate format like pandas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, that makes sense. My fear is if protocols can't be relied on entirely, the default path consumers will take is to use .to_numpy()
where ideally they could use something more generic like .to_array()
which produces something array-like which supports calling numpy functions against it. Without something like this, a GPU library like cuDF that will want to support this protocol would have to make an awful choice of either returning something like a cupy array from the to_numpy()
call to keep data on the GPU and do things efficiently but possibly break for some people / libraries integrating downstream who were expecting specifically host arrays or have to_numpy()
explicitly copy the data to host, which then possibly throws away the performance we may have gained from using the GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I don't know that it makes sense to be opinionated in this way at this level. We can provide people with choices and they can decide what makes sense for them. I don't see a problem with having a to_$something
method as long as that $something
is documented and something that people can understand how to use. So you can provide both the "general array like" path and the "NumPy" path and people can choose. That you have a preference about which one you want people to use seems like it shouldn't factor into the interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is if the "NumPy" path is included in the interface whereas the "general array like" path isn't, most producers will likely only implement the "NumPy" path which will make almost all consumers consume the "NumPy" path. This then becomes a problem for us trying to integrate efficiently of trying to convince everyone to change to the "general array like" path as opposed to that being the typical path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you like to propose an API to add? In the PR description I wrote "The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later." What is written here is not exhaustive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, I missed that. Will add a suggestion to kickoff discussions on adding the generic API I was talking about.
|
||
@property | ||
@abstractmethod | ||
def num_columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree strongly here. df.column_names
is expected to return a Python Sequence and getting the length of that Python object may not be free. I may want to call into an explicit function in C++ for this for example.
|
||
@property | ||
@abstractmethod | ||
def num_rows(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @wesm here that len(df)
is ambiguous and prefer the explicit num_rows
property.
|
||
# TODO: Should this be a method or property? | ||
@property | ||
def row_names(self) -> Sequence[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also -1 on row names / indices. If it becomes an optional part of the protocol but say Pandas implements it, I imagine all of the other libraries will end up having their hands forced to support it from users wanting Pandas compatibility as opposed to building a new DataFrame standard.
dataframe.py
Outdated
pass | ||
|
||
@abstractmethod | ||
def get_column_by_name(self, name: Hashable) -> Column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love for name to be more strongly typed than Hashable
to allow better interoperability with other languages. I would be happy to say name should be a String
but that has obvious downsides as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the perspective of one library exporting data to another, it's unclear that we should prevent a library from exporting / making available its column names if they are not strings. Even Hashable may even be too strict here since even if the name is not hashable, it can be looked up by a linear equality search.
def to_dict_of_numpy(self): | ||
""" | ||
Convert DataFrame to a dict with column names as keys and values the | ||
corresponding columns converted to NumPy arrays | ||
""" | ||
raise NotImplementedError("TODO") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be generalized to something like to_dict_of_array
? My concern is we don't want standard APIs that strongly expect a numpy object where we'd want to plug in a GPU object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above
Metadata for this column. Default implementation returns empty dict | ||
""" | ||
return {} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposing an additional API here for a more generic array API. I think that the constraint of supporting __array_function__
protocol helps to prevent this from becoming too overloaded, but wanted to kickoff discussion
def to_array_function_protocol_object(self): | |
""" | |
Access column's data as an array object which supports the | |
`__array_function__` protocol. Recommended to return a view if able but | |
not required | |
""" | |
raise NotImplementedError("Conversion to array not available") | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will open another suggestion around to_dict_of_array
once this is settled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright. From my perspective a critical matter is making sure that this doesn't get mixed up with APIs whose purpose is to provide data in a particular memory representation. So I think the term "array" in this context is may confuse, so to_array_function_protocol
or something more explicit would be clearer (and __array_function__
can be available also which calls this function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my perspective a critical matter is making sure that this doesn't get mixed up with APIs whose purpose is to provide data in a particular memory representation.
Agreed.
So I think the term "array" in this context is may confuse, so to_array_function_protocol or something more explicit would be clearer (and array_function can be available also which calls this function).
to_array_function_protocol_object
is probably the most explicit because we're exporting an object that supports the protocol as opposed to the protocol implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_array_function_protocol_object
is probably the most explicit
That's... incredibly cumbersome :(
If to_array
is too ambiguous how about to_arraylike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason that I'm against to_array
is then you have this situation
to_numpy
-> does data export to a concrete memory representationto_arrow
-> does data export to a concrete memory representationto_array
orto_array_like
-> what does this do? Based on what I see above it's not doing data export
What is "array" and what is "array_like"? It's too ambiguous.
IMHO this project is first and foremost about data export at library boundaries (without one library having knowledge of the internals of the other). If you want to also expose object protocols (like __array_function__
) that's fine, but that needs to not be commingled with the data export APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_array_like
could export either a numpy array or a pyarrow array, or anything which implemented the array protocol. A consumer of this method could then write a generic function, agnostic to whatever the concrete representation was - so long as they stuck to methods implemented in the array protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this goes beyond the scope of what problem is trying to be solved in this first project iteration. There is exporting "data" and exporting "interfaces". If we can't first agree about exporting "data" then I think this project will be kaputt and we should all disengage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to_numpy
-> does data export to a concrete memory representation
From my perspective, in addition to exporting to a concrete memory representation this takes a step too far and exports to a specific object container type. It would be great if we could generalize this to exporting any object that follows a concrete memory representation as opposed to tightly confining it to a specific object. I.E. (ignore the cumbersome names for now)
to_array_interface_object
-> does data export to an object that supports__array_interface__
protocolto_arrow_array_interface_object
-> does data export to an object that supports__arrow_array_interface__
protocol (I know this doesn't exist, just for example)to_buffer_protocol_object
-> does data export to an object that exposes Python buffer protocolto_cuda_array_interface_object
-> does data export to an object that supports__cuda_array_interface__
protocol
These protocols are specifically defined surrounding memory representation so it makes sense to use them from my perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reached the limit of how much time I can invest in this for now so I'll wait a few weeks or a month for others to express their opinions. For me the things I care most about are:
- Clear and unambiguous names (even if they are "cumbersome")
- Clear distinction between methods that yield objects implementing some other protocol / interface (like the buffer protocol) and methods that yield particular containers. If a consumer wants a
numpy.ndarray
, I think they should be able to ask for that
Per my comment on the Discourse thread, I do think taking a step back and writing a requirements doc might be helpful to avoid diversions into discussions of non-goals that take up a lot of time and energy.
dataframe.py
Outdated
|
||
@property | ||
@abstractmethod | ||
def column_names(self) -> Sequence[Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about a columns
property that returns a ColumnCollection
that itself has a names
property, and similarly for rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above. I am not sure this API should be in the business of syntactic sugar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed. For many applications I could probably get by with a restricted API but I'd still want to code a nice interface which used idiomatic python constructs. The proposed interface here strikes me as being extremely cumbersome to use in practice. Just one viewpoint though, there may well be other considerations to take into account other than aesthetics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user I'd like to write generic code which would work.regardless of the concrete type of DataFrame passed.
This is not a requirement / goal of this project (see the original discussion thread).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly the motivating use case at the the start of that thread:
Bigger picture: programming to an interface
If the discussion there has veered off from that course then it is probably a good idea to write a more formal proposal to capture the current state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take it up with people on Discourse. If it turns out that people don't want interchange / export APIs I'm going to walk away from this project. I've already spent a lot of time on it.
""" | ||
raise NotImplementedError("select_columns_by_name") | ||
|
||
def to_dict_of_numpy(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, using type hints could avoid cumbersome method names:
def to_dict(self) -> Dict[str, np.ndarray]:
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1, I think the method name needs to unambiguously describe what it's doing
""" | ||
Return the column at the indicated position | ||
""" | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.
If we instead extend the columns(self) -> ColumnCollection:
model we can have an api like:
In [79]: df = DataFrame('A', 'B', 'C')
In [80]: df
Out[80]: DataFrame(['A', 'B', 'C'])
In [81]: df.columns
Out[81]: ColumnCollection(['A', 'B', 'C'])
In [82]: df.columns.idx(0)
Out[82]: Column(name='A')
In [83]: df.columns.idx(0, 2)
Out[83]: DataFrame(['A', 'C'])
In [84]: df.columns['A']
Out[84]: Column(name='A')
In [85]: df.columns['A','C']
Out[85]: DataFrame(['A', 'C'])
...which, at least to me, seems a lot more natural.
Code:
from typing import List
from dataclasses import dataclass
@dataclass
class Column(object):
name: str
class ColumnCollection(object):
_columns: List[Column]
def __init__(self, *columns: Column):
self._columns = list(columns)
@property
def names(self) -> List[str]:
return [col.name for col in self._columns]
def idx(self, *indices: int) -> ColumnCollection:
columns = [self._columns[idx] for idx in indices]
if len(columns) == 1:
return columns[0]
return DataFrame(
*(col.name for col in columns)
)
def __getitem__(self, names: str) -> Union[Column, ColumnCollection]:
columns = dict(zip(self.names, self._columns))
if len(names) == 1:
return columns[names[0]]
return DataFrame(*names)
def __len__(self):
return len(self._columns)
def __iter__(self):
for col in self._columns:
yield col
def __repr__(self):
columns = "', '".join(self.names)
return f"ColumnCollection(['{columns}'])"
class DataFrame(object):
_columns: ColumnCollection
def __init__(self, *columns):
self._columns = ColumnCollection(
*(Column(col) for col in columns)
)
def __repr__(self):
columns = "', '".join(self.columns.names)
return f"DataFrame(['{columns}'])"
@property
def columns(self):
return self._columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a Python dev perspective (who would like to use this functionality) the get/select methods seem a bit awkward.
I strongly feel that syntactic sugar / conveniences (which get non-objective very quickly) belong at a higher level than this API, which is about library developers implementing unambiguous data export APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in this as a potential way to write analytics functions which are DataFrame implementation agnostic. i.e. instead of coding to the pandas API I would code to this API hence its usage would be front-and-centre in my code in the same way that pandas now is. Maybe this isn't supposed to cater to that end-user concern though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. This was discussed at length in https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267
I relaxed hashability of column names and changed |
pass | ||
|
||
@property | ||
def attrs(self) -> Mapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be something more precise, e.g. Mapping[str, str]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least in xarray attrs can be arbitrary metadata so it should probably be Mapping[str, Any]
(although it's usually strings in practice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen keys as bytes (e.g. Arrow's metadata is bytes -- though UTF-8 bytes would be preferred) so Mapping[Hashable, Any]
might be the most general (maybe that's the default for Mapping
, not sure)
If someone would like write access on this repository to help lead this effort please let me know. I'm juggling a few too many projects so need to step away from this for a while |
""" | ||
|
||
def __init__(self, index_type: IntegerType, category_type: DataType, | ||
ordered: bool = False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand better the reason to have the types of the data and the categories in the type.
With an example, I think a category column could be represented with something equivalent to:
class CategoryColumn:
dtype = 'category'
index = pyarrow.array([0, 1, 0, 2, 0, 1], type='int8')
categories = pyarrow.array(['red', 'green', 'blue'], type='string')
So, when accessing the data, I can imagine that we may be interested in one of two options:
- The underlying data (to copy the structure from implementation to implementation for example)
- To access the actual data (forgetting about the internal representation)
For the first option, an example of code copying the data to a numpy array could be:
for source_column in dataframe:
if source_column.dtype == 'category':
copy_of_index = numpy.array(source_column.index, dtype=source_column.index.dtype)
The example is very simplistic, but it shows that we can get the type of the index (int8
in the previous example) from the underlying column (and not from the data type). It feels like if we have it again in the data type it will just be a duplicate that will be need to keep in sync. Same would apply to the data type of the categories.
What would be the main advantage of this implementation, as opposed to have just a categorical type with only the ordered
parameter (or two types categorical
and ordered_categorical
)? I guess it decouples the schema from the data, but I'm unsure if it's worth the extra complexity. I guess we could just identify data types by their name as a string among implementations, instead of making them implement all these subclasses, if we don't have parameters. Which I guess would make things simpler.
Am I missing something? Does it make sense what I say?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty common for users to compare dtypes across columns. With this proposed change it would have a category column with int32
codes and say int64
categories have the same dtype as int8
codes and string
categories, which may lead to confusion / frustration for end users.
I'm a +1 to including the index / categories types in the dtype as it makes things more explicit for end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's lots of reasons to have access to the complete schema, from preallocating space for data (in-memory or on disk) to building query plans (including kernel selection) before data arrives. Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.
That said, if you have an application where the index or category type is simply not known, you could use an Any
type to indicate that the actual type must be observed in the actual data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's useful to know. Just to understand better, do you have any example on why users compare dtypes across columns?
Unfortunately our users don't share the "why" too often, but one use case is checking the output of dtype inference from something like a read_csv
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not having access to the index type and category type would (IMHO) make a lot of things much more difficult.
Thanks @wesm, that makes sense. But I don't fully understand this part. Given a categorical column, there is access to the index and category types in my example (there was a typo in my previous code, may be that's where I didn't make myself clear). It's just the place where it is found that is different (in the array instead of in the dtype).
In your proposal:
my_column.dtype.index_type
In my example would be something like:
my_column.underlying_categorical_data.index.dtype
I understand that in my example there wouldn't be access to the underlying types if a dataframe schema is used without the actual dataframe. Then, with your proposal we could still know the internal types of the categorical, while in my example those will be undefined, since they should be in the data, which doesn't exist. Is this what you have in mind? If it is, in what case are we considering that a schema can live without the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the actual data is potentially costly, I think? So that would make getting the type of the categories costly, while I assume accessing it from the dtype itself would not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing the actual data is potentially costly, I think?
Good point. In the PR, the type
(aka dtype
) is a property of the Column
class, so accessing the index type would be
class Column:
@property
def type(self):
pass
my_column.type.index_type
The data is not specified, may be that's the problem, but if we agree the categorical data as two subcolumns, one for the index, one for the categories, then accessing the index type with an unparametrized categorical dtype would be:
my_column.index_subcolumn.type
I may be missing something, but I would say in both cases the type is metadata of the Column
class that doesn't require accessing the actual data. Not sure if in certain architectures (distributed, GPU...) accessing a subcolumn could be expensive, but in pure Python both cases would be two lookups in the namespace.
I'm not sure what would imply if we consider implementations using Arrow's Dict
instead of the two arrays separately. Would that be costly?
Based on https://discuss.ossdata.org/t/a-dataframe-protocol-for-the-pydata-ecosystem/267, we are discussing a "protocol" method (potentially called
__dataframe__
) similar to__array__
for data frame-like data. The consensus so far is that this protocol should not force conversions to a particular data frame memory model like pandas. Instead, it may provide access to its metadata and data in the form desired by the user of the__dataframe__
protocol.Questions:
to_numpy
method (respectively,to_arrow
method)?The APIs proposed here are not intended to be to the exclusion of others -- more APIs can be proposed and added later.