-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data exchange formats #29
Comments
I'm a very strong -1 to this approach. Once you have something like I pushed in wesm/dataframe-protocol#1 to move to using more protocols / standards to allow for more efficient interoperability. I.E. instead of From my perspective, we should define something like a |
Thanks for the feedback @kkraus14. Just to make sure I understand (what you say here, and in Wes' PR thread). You are happy with the general approach, but you'd want that instead of columns having a class StandardArray:
"""We should create a spec for this I guess."""
def __array__(self):
...
class Column:
def to_array_like(self) -> StandardArray:
... Am I understanding correctly? Sorry, it's a bit difficult to understand in detail what you propose without seeing code. |
To concur with Marc's question for clarification, @kkraus14 it is not fully clear to me what exactly you are objecting. It seems that you are opposed to the specific So personally, I am +1 on this mechanism, and the exact methods that the returned proxy object will have is of course further to be discussed (as there are already other issues open about other aspects of the interface, like for the number of rows / columns) |
My fear is that defining what's allowed / standard in this mechanism is going to be extremely problematic and lead people towards using Pandas / Numpy as the interop through the exchange, which we'd obviously like to avoid. Once there's Instead of this mechanism I'd like to see something akin to https://arrow.apache.org/docs/format/CDataInterface.html or https://numpy.org/doc/stable/reference/arrays.interface.html#python-side. You could imagine something along the lines of: class cuDFBuffer:
"""
The public cuDF buffer class.
"""
def __init__(self, data):
self.data = data
def __buffer__(self):
"""
Produces a dictionary object following the buffer protocol spec to
describe the memory of the buffer.
This could likely piggyback off of an array protocol spec for data
exchange.
"""
return {
"size": self.size, # Number of bytes in the buffer
"ptr": self.ptr, # Pointer to the buffer as an integer
"read_only": self.read_only, # Whether the buffer is read only
"version": 0 # Version number of the protocol
}
class cuDFColumn:
"""
The public cuDF column class.
"""
def __init__(self, buffers):
self.buffers = buffers
def __column__(self):
"""
Produces a dictionary object following the column protocol spec
to describe the memory layout of the column.
"""
return {
"dtype": self.dtype, # Format string of the dtype
"name": self.name, # Name of the column
"length": self.size, # Number of elements in the column
"null_count": self.null_count, # Number of null elements, optional
"offset": self.offset, # Number of elements to offset the column
"buffers": self.buffers, # Buffers underneath the column, each
# object in this iterator must expose
# buffer protocol
"children": self.children, # Children columns underneath the column,
# each object in this iterator must
# expose column protocol
"version": 0 # Version number of the protocol
}
class cuDFDataFrame:
"""
The public cuDF dataframe class.
"""
def __init__(self, columns):
self.columns = columns
def __dataframe__(self):
"""
Produces a dictionary object following the dataframe protocol spec
to describe the memory layout of the dataframe.
"""
return {
"name": self.name, # Name of the dataframe
"columns": self.columns, # Columns underneath the dataframe, each
# object in this iterator must expose
# column protocol
"version": 0 # Version number of the protocol
} Note that there's a lot of attributes that would need to be captured in these protocols that I'm not covering here, nor am I sure that Python dictionaries are the right approach to this protocol, but the idea of expressing a hierarchy of objects that eventually point down to memory is what I have in mind for a data exchange protocol. Then for dataframe libraries who want to implement a |
Thanks @kkraus14, it's very clear now. If we standardize into a single protocol as you describe, as opposed to multiple representations, do you think it could make sense to use Arrow? My understanding is that Arrow's goal is to solve the problem we're addressing here, and while your proposal makes sense, I'm not sure if we may be reinventing the wheel. Any thoughts on this? Do you have in mind any limitation in Arrow, or any reason why you'd prefer to use a custom implementation, and not rely on Arrow? For reference, Arrow implements what you've got in your example import numpy
import pyarrow
numpy_array = numpy.random.rand(100)
buffer = pyarrow.serialize(numpy_array).to_buffer()
print(buffer.address)
print(buffer.size)
print(buffer.is_mutable) In the example, Arrow is copying the memory from the NumPy array, not sure if avoiding the copy is possible. |
From the cuDF perspective it would be good, but we are similarly a columnar dataframe and were built to be compatible with Arrow. For others it may not be as nice of a fit, especially if they don't follow the Arrow data model.
Being tied to Arrow as opposed to being an independent specification makes us tied to Arrow's implementation details. I.E. Arrow uses a single bit per value for booleans versus a byte per value, limited units for timestamps / durations, no support for bfloat types, no support for Python objects, etc. I don't know the appetite of the Arrow community to expand the specification to things that Arrow doesn't and potentially will not support. Additionally, Arrow only has a specification at the Array level, they don't currently have a specification for the Table level or down at the Buffer level. The existing specifications for the Buffer level: Python buffer protocol, |
Before considering doing something "alike Arrow C Data Interface but not exactly", I think we should have a better idea of use cases of such a protocol and where Arrow would not be sufficient. If there are important use cases for dataframe-like applications that are currently hindered by limitations in the Arrow data types or C interface, then I think the Arrow community will be very interested to hear them and to discuss this.
If we go with a low-level exchange protocol as you are outlining (and not relying on lazy conversion to a set of different possible formats such as numpy, pandas, arrow, object following array protocol, ..), then we need to choose some set of implementation details. Given the complexity of defining this (and given the goal of the Arrow project), I would personally say that re-using Arrow's implementation details is actually a plus and will avoid lots of discussion.
This are indeed all types not directly supported in Arrow at the moment. Open issues about 8 bit booleans (ARROW-1674) and support for Python objects in ARROW IPC (ARROW-5931, but which is not the same as the C interface, to be clear). Re other units for timestamps, has there been any demand for this? (I am personally not aware of this being brought up in Arrow).
The C Data Interface is actually already used for Tables as well (not directly, but RecordBatch is supported using a StructArray, and Tables can be seen as one or more RecordBatches): Besides the "to Arrow or not to Arrow" arguments above, I personally still think there is value in the original proposal (based on wesm/dataframe-protocol#1). |
Can we focus on this point before going into the weeds on anything else? Do we think that it's valuable to include something like a
IMO, the target audience for this API standard won't fall into that trap. The entire point of the standard is to make things agnostic to the particular dataframe implementation (at the API level). The motivation for including something like a |
I'm very against this. Why should a protocol include specific implementations for certain containers? I.E. you could imagine a Say I'm an array library that can work on CPU/GPU/TPU/etc. and I want to handle being able to ingest a DataFrame from some random library, but I want to ensure that I keep the memory where it currently lives, how would I do that with this proposal? |
Thanks @datapythonista and @kkraus14, you obviously put a lot of though into this.
What is the added value of doing df.to_array() first? E.g. what is the difference between:
In the end, if either of these are possible, we have to specify in the spec that this is possible right? So in that case, what would favor Is your point (@kkraus14) that you'd like to see only In this case a default implementation of def to_numpy():
return np.asarray(self) And do we really need to have |
A point against using the array interface would be columns that have no memory representation. E.g. a virtual column in Vaex is not materialized, there is no memory anywhere that holds the data. A virtual column in vaex can be materialized to a numpy or arrow array, and a real column can be backed by a numpy or arrow array, so I'm not sure how this would fit into the array protocols (but they can even be something else...). From Vaex' point of view, implementing The So what I am basically asking is: Do array protocols make sense for dataframes that have no materialized/real columns. |
Thanks @maartenbreddels, that's a very good point, thanks for bringing it up. I'm not quite understanding how both approaches are different regarding virtual columns. I guess the first thing to decide is what to do with them:
For 1, I don't see how the approach of multi formats is different to the approach of an actual protocol. I guess you'll have to materialize them first when For 2, I think it's a bit trickier. I guess for it to work we need that all dataframe libraries support them, or that the consumer knows who how to materialize them given its formula. Assuming that is something we want to do, I guess the exchange will have to be implemented independently in both cases, since we'll be exchanging a formula, not a numpy array or a pointer. So, while it's a very good point to take into consideration, I don't fully understand why you think the numpy+arrow approach is better than exchanging a pointer and metadata (with arrow or a new protocol). Seems like it's independent to the approach we use. I guess I'm missing something. |
In pandas, we've found that the
|
@TomAugspurger Good point @datapythonista: The problem is not only limited to Vaex' virtual columns. The data could also live remote, in a database, or backed by dask. The point is that a column is not per se backed by an array, which makes exposing the array interface for a column unnatural. Of course, an implementation could lazily do this, by implementing Maybe I am starting to answer the question I posed to @kkraus14 about why we should have an explicit
I think we need both. AFAIK there is no |
Re the discussion on the call, @TomAugspurger said:
I think people will fall into this trap, but the only way to avoid it is documentation and communication, I don't think there's a technical solution.
I totally agree with that. What is the use-case of the proposed memory exchange protocol?
I'm pretty sure this is impossible to do by any technological way, unless there is no way to convert your structure to a numpy array. You can only ask downstream libraries not to do the expensive copy. And whether they do that or not is actually a really hard question, but I don't see how this relates to the syntax the downstream library uses to force a numpy array. @kkraus14 I'm really not sure what use-case you have in mind for your general non-copy protocol. [after more discussion, I understand the main use-case would be going from one cuda array to another cuda array and/or from one TPU library to another TPU array. I didn't think this was the problem we're trying to solve here, if it is, I don't see how it relates to the |
Say I have a dataframe library, Now, assume I have a Python library backed by a C/C++ library which ultimately needs to work with pointers, if it's backed by CPU memory I can potentially use Python buffer protocol or Numpy But say you have a library that uses the numpy C-API so it needs to guarantee a numpy array and it's being given an input from an XPU library. You could imagine adding an argument to the API of something like Given numpy is the standard, there's a lot of code out there that just calls It's still a trap, but it's a much more explicit opt-in trap as opposed to a relatively implicit trap. |
So you want a bridge from the dataframe-like API standard we are defining to the array-like API standard we are defining? Or to a different array standard that contains more meta-data? |
I don't want to put words in @kkraus14's mouth, but my interpretation of the problem is that with cuDF they do not ever want to copy to numpy (or any other object that must live on host memory), they would prefer to not be forced into doing that with this spec. (please correct me if I am wrong here) The solution proposed, Either way, if hypothetically some libraries will choose not to implement something because it is inefficient for them what is the best way forward @rgommers? |
We want any copy to numpy to be explicit by the user or library instead of implicit.
Dataframe libraries could still choose to implement the numpy array protocol if they want that to work. Having a specific memory exchange protocol and an associated argument to specify a device where you want the memory gives dataframe libraries the option to not support the numpy array protocol for example while still giving other library maintainers a more explicit path to copy from device --> host without forcing them to depend on the library directly.
This came up in Wes's PR as well and the problem is you have landmine optional features. Something like |
I'm using |
@kkraus14 sorry then I'm still lost on the use-case, clearly I'm missing something. I totally agree about the need to be explicit when a copy is forced and to allow downstream libraries to reason about capabilities of the different dataframe libraries. I don't understand what the return type of |
There's two separate proposals here. Lets say proposal number 1 is Proposal number 2 is For library maintainers like Matplotlib, sklearn, etc. I think the ask is that instead of just doing |
I'm struggling with understanding proposal 1. If we have a dataframe object that implements the API we define, wouldn't |
What we are planning is to work on that, after finishing with your proposal 2.
Assuming we don't allow having one dataframe column in CPU and another in GPU, or other per-column configuration, would make sense to use something like {'col1': obj_implementing_an_array_protocol_with_the_requested_specs,
'col2': ...
} Not proposing any details (returning a dict, the params...), just checking if I'm understanding your idea correctly. Unrelated to this, I checked |
Yes, but the original proposal at the top of this issue made it seem like part of the exchange protocol would be supporting numpy as well.
Sounds good to me.
Yes the idea would be to do something like that, though I think we'd need to iterate on the args a bit more. |
Regarding part 2 (the exchange protocol with actual specified memory layout):
@kkraus14 if the requirement for specifying the device is a blocking issue for being able to use the Arrow C Data Interface, I think it would be very useful if you (or someone else with the knowledge / background about this use case) could bring this up on the arrow dev mailing list. (even regardless of whether we would want to use this C interface in the consortium proposal or not) |
Sure, someone from my team or I will bring this up on the Arrow dev mailing list. For the folks here though, it would probably look something like how DLPack handles different devices: https://github.com/dmlc/dlpack/blob/master/include/dlpack/dlpack.h#L38-L74 |
Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29
I opened gh-30 to summarize this issue, previous issues and discussions in the weekly call we've had on this topic as concisely as I could. If there's any requirement or important question I missed, please point it out. I hope that gets us all on the same page, after which talking about proposed designs should be a lot easier. |
Thanks Ralf and Marc, for #30 and #31 . something that is not clear to me is if we want Keith's interface proposal, where the |
Based on Ralf's comment at #30 (comment), I would assume we need some object that also has other methods (like to get the number of rows / columns, or to only get a specific column). |
Also, something that @datapythonista addressed, is chunking. If data sources (the data behind the dataframe) are too large to fit into memory, we need some way to get out chunks in an efficient way, so that we can dump it to disk, or a database. One concern I have is how this chunking would play with the protocols we discussed, e.g an API like this. for chunk in df['col'].chunked(max_length=1_000_000):
ar = np.array(chunk) Does not know in advance that the materialized array should end up in a numpy container, while its default implementation may decide to materialize to an Arrow array (one could argue the copy of Arrow to Numpy is cheap, but that's not the case for null values/masked arrays). A very ugly API, that would give as much information in advance would be: for chunk in df['col'].chunked(max_length=1_000_000, protocol='__array_interface__'):
ar = np.array(chunk) A second concern I have, is how that would deal with 'materializing' multiple columns at once, as this can be more efficient if there are inter-column dependencies (CPU cache, disk cache, file layout efficiency, reusing calculations). for chunks in df[['col1', 'col2']].chunked(max_length=1_000_000, protocol='__array_interface__'):
arrays = [np.array(chunk) for chunk in chunks] (And of course the mandatory async generators) |
Agreed, this is unclear right now. I suspect my last comment there is off-base, and we should clearly separate the data interchange from the object with a unified API. So I'd suggest we want |
But if that dict already includes as one of its keys the column's data, then it's not necessarily possible to only convert a subset of columns? |
Yes it depends. Let's turn that around: say it must be possible to convert only a subset of columns. Which then has implications for the details of the implementation. |
@kkraus14 gentle reminder for this (I could also start a thread, but I basically know nothing about the details, requirements, what pointers mean if there is a device keyword, etc) |
Thanks for the nudge @jorisvandenbossche. Talking internally now to get someone to act as the point person in engaging the mailing list to detail out wants vs needs for the information relevant to handling device data as opposed to just CPU data. |
@kkraus14 another ping for starting a discussion about device support / C Data Interface issue. |
Thanks for the ping, this slipped through the cracks on my end. Will do my best to push things forward. |
Summarizes the various discussions about and goals/non-goals and requirements for the `__dataframe__` data interchange protocol. The intended audience for this document is Consortium members and dataframe library maintainers who may want to support this protocol. @datapythonista will add a companion document that's a more gentle introduction/tutorial in a "from zero to a protocol" style. The aim is to keep updating this till we have captured all the requirements and answered all the FAQs, so we can actually design the protocol after and verify it meets all our requirements. Closes gh-29
@kkraus14 just checking in on this device support in the Arrow protocol - was the discussion started? |
It has not. Given the amount of discussion in the DLPack issue for now, I think it makes sense to iron out the details there and have something to point to before proposing anything to the Arrow protocol. |
That makes sense to me, thanks Keith. |
Based on what it's defined in wesm/dataframe-protocol#1, the idea is to not support a single format to exchange data, but support multiple (e.g. arrow, numpy).
Using a code example here, to see what this approach implies.
1. Dataframe implementations should implement the
__dataframe__
, returning the exchange format we are definingFor example, let's assume Vaex is using Arrow, and it wants to offer its data in Arrow format to consumers:
Other implementations could use formats different from Arrow, for example, let's assume Modin wants to offer its data as numpy arrays:
2. Direct consumers should be able to understand all formats
For example, pandas could implement a
from_dataframe
function to create a pandas dataframe from different formats:This would allow pandas user to load data from other formats:
Vaex, Modin and any other implementation could implement an equivalent function to load data from other
libraries into their formats.
3. Indirect consumers can pick an implementation, and use it to standardize its input
For example, Seaborn may want to accept any dataframe implementation, but wants to write its code in pandas (the access to the data). It could convert any dataframe to pandas, using
from_dataframe
from the previous section:Are people happy with this approach?
CC: @rgommers
The text was updated successfully, but these errors were encountered: