-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Implement DataFrame Interchange Protocol #3727
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3727 +/- ##
===========================================
- Coverage 78.04% 31.07% -46.97%
===========================================
Files 448 453 +5
Lines 74199 74395 +196
===========================================
- Hits 57908 23118 -34790
- Misses 16291 51277 +34986
Continue to review full report at Codecov.
|
184f3ac
to
cbd4a38
Compare
838c304
to
93edaea
Compare
import enum | ||
from typing import Tuple | ||
|
||
# TODO: no numpy dep, pyarrow/pa.Array? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the Buffer
class use pyarrow here? From what I can tell it doesn't implement the __array_interface__
. So Buffer.ptr
needs to be rolled here and I'm guessing that includes handling when arrays are chunked.
Another thought I've had working on this is that, unless I'm misunderstanding something, we're going to have to ensure contiguous memory for underlying data from the interchange. So I'd think we'd have to use copies if needed. Maybe I can be corrected here. |
c5c2aa5
to
bad7450
Compare
bad7450
to
c05edcc
Compare
After reading into this more, there are some areas maybe unexplored with an arrow-backed Originally I wanted to establish a private API for polars using the spec's objects I gutted the current refactored |
Since I'm struggling to find the time to work on this I'll close it for now and open it back up if I can get back to it. |
Note: This is a draft and is currently in progress. Some details may change over time.
About
The DataFrame Interchange Protocol allows for the exchange of dataframes using a standard spec.
Some protocol documentation:
__dataframe__
protocoltl;dr (but you really should read if you're interested) Implements a public
from_dataframe
function for creatingDataFrame
objects using the__dataframe__
protocol and the__dataframe__
API itself.Proposal
Objectives
__dataframe__
APIfrom_dataframe
This is a Python protocol. So we can implement this in the Python package. Note that this is not the Array API. Using "exchange" is how Pandas implements this. I propose instead we call this namespace "interchange" for correctness.
Reading through their PR it does look like it should be given a decent amount of attention and review. At the same time I want this to follow current Polars wrapper design patterns.
We can start by focusing on a similarly verbatim startup that focuses on the core protocol requirements. With this PR we can include unit tests, but a followup PR should implement some compliance testing.
Additional resources
__array_interface__
ProtocolOther
I'm not planning on rushing through this, so if you're browsing and wondering if you can leapfrog me with an implementation feel free!