Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement DataFrame Interchange Protocol #3727

Closed

Conversation

cnpryer
Copy link
Contributor

@cnpryer cnpryer commented Jun 18, 2022

Note: This is a draft and is currently in progress. Some details may change over time.

About

The DataFrame Interchange Protocol allows for the exchange of dataframes using a standard spec.

Some protocol documentation:

tl;dr (but you really should read if you're interested) Implements a public from_dataframe function for creating DataFrame objects using the __dataframe__ protocol and the __dataframe__ API itself.

Proposal

Objectives

  • Add __dataframe__ API
  • Add from_dataframe
  • Finalize tests for API and compliance

This is a Python protocol. So we can implement this in the Python package. Note that this is not the Array API. Using "exchange" is how Pandas implements this. I propose instead we call this namespace "interchange" for correctness.

Reading through their PR it does look like it should be given a decent amount of attention and review. At the same time I want this to follow current Polars wrapper design patterns.

We can start by focusing on a similarly verbatim startup that focuses on the core protocol requirements. With this PR we can include unit tests, but a followup PR should implement some compliance testing.

Additional resources

Other

I'm not planning on rushing through this, so if you're browsing and wondering if you can leapfrog me with an implementation feel free!

@github-actions github-actions bot added the python Related to Python Polars label Jun 18, 2022
@cnpryer cnpryer changed the title Implement DataFrame Interchange Protocol WIP: Implement DataFrame Interchange Protocol Jun 18, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2022

Codecov Report

Merging #3727 (c05edcc) into master (25b58e2) will decrease coverage by 46.96%.
The diff coverage is 58.67%.

@@             Coverage Diff             @@
##           master    #3727       +/-   ##
===========================================
- Coverage   78.04%   31.07%   -46.97%     
===========================================
  Files         448      453        +5     
  Lines       74199    74395      +196     
===========================================
- Hits        57908    23118    -34790     
- Misses      16291    51277    +34986     
Impacted Files Coverage Δ
py-polars/polars/interchange/dataframe.py 36.50% <36.50%> (ø)
py-polars/polars/interchange/buffer.py 48.14% <48.14%> (ø)
py-polars/polars/interchange/column.py 53.70% <53.70%> (ø)
py-polars/polars/interchange/_utils.py 96.00% <96.00%> (ø)
py-polars/polars/interchange/__init__.py 100.00% <100.00%> (ø)
py-polars/src/lazy/utils.rs 0.00% <0.00%> (-100.00%) ⬇️
py-polars/src/py_modules.rs 0.00% <0.00%> (-100.00%) ⬇️
polars/polars-io/src/tests.rs 0.00% <0.00%> (-100.00%) ⬇️
polars/polars-lazy/src/lib.rs 0.00% <0.00%> (-100.00%) ⬇️
polars/polars-time/src/lib.rs 0.00% <0.00%> (-100.00%) ⬇️
... and 369 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 25b58e2...c05edcc. Read the comment docs.

@cnpryer cnpryer force-pushed the dataframe-interchange-protocol branch 3 times, most recently from 184f3ac to cbd4a38 Compare June 19, 2022 03:00
@cnpryer cnpryer force-pushed the dataframe-interchange-protocol branch 4 times, most recently from 838c304 to 93edaea Compare June 25, 2022 17:33
import enum
from typing import Tuple

# TODO: no numpy dep, pyarrow/pa.Array?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the Buffer class use pyarrow here? From what I can tell it doesn't implement the __array_interface__. So Buffer.ptr needs to be rolled here and I'm guessing that includes handling when arrays are chunked.

@cnpryer
Copy link
Contributor Author

cnpryer commented Jun 27, 2022

Another thought I've had working on this is that, unless I'm misunderstanding something, we're going to have to ensure contiguous memory for underlying data from the interchange. So I'd think we'd have to use copies if needed. Maybe I can be corrected here.

@cnpryer cnpryer force-pushed the dataframe-interchange-protocol branch 2 times, most recently from c5c2aa5 to bad7450 Compare July 3, 2022 21:23
@cnpryer cnpryer force-pushed the dataframe-interchange-protocol branch from bad7450 to c05edcc Compare July 4, 2022 19:14
@cnpryer
Copy link
Contributor Author

cnpryer commented Jul 8, 2022

After reading into this more, there are some areas maybe unexplored with an arrow-backed Buffer.

Originally I wanted to establish a private API for polars using the spec's objects Buffer, Column, and DataFrame. After spending more time on this, starting with a public implementation for from_dataframe might provide more opportunity to learn about what a direction using arrow-backed buffers (or maybe even a polars data structure) could look like.

I gutted the current refactored interchange namespace to focus on a from_dataframe first approach here.

@cnpryer
Copy link
Contributor Author

cnpryer commented Jul 25, 2022

Since I'm struggling to find the time to work on this I'll close it for now and open it back up if I can get back to it.

@cnpryer cnpryer closed this Jul 25, 2022
@stinodego stinodego added the A-interchange Area: Python dataframe interchange protocol label Sep 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interchange Area: Python dataframe interchange protocol python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants