Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement DataFrame interchange protocol #46141

Merged
merged 49 commits into from
Apr 27, 2022
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
ac58967
Vendor smoke tests from consortium
vnlitvinov Feb 21, 2022
fce881e
Vendor dataframe_protocol spec
vnlitvinov Feb 21, 2022
02946f8
Copy over the prototype and polish it a bit
vnlitvinov Feb 21, 2022
14fd478
Fix the protocol spec
vnlitvinov Feb 22, 2022
4515011
Enable pd.DataFrame.__dataframe__
vnlitvinov Feb 22, 2022
7d6fd5b
Align spec with existing implementations
vnlitvinov Feb 22, 2022
5d64c4a
Fix protocol tests
vnlitvinov Feb 22, 2022
b36fd46
Make DataFrame.__dataframe__ pass protocol tests
vnlitvinov Feb 22, 2022
d334b20
Explicitly mark abstract methods in spec
vnlitvinov Feb 24, 2022
014165d
Add more smoke tests
vnlitvinov Feb 24, 2022
def54ba
Implement column chunking
vnlitvinov Feb 24, 2022
8e6b882
Fix tests formatting
vnlitvinov Feb 24, 2022
282c85d
Start implementing chunk support in from_df
vnlitvinov Feb 24, 2022
9fbb58d
Test buffer contents if on CPU
vnlitvinov Feb 24, 2022
dd93625
Improve spec a bit
vnlitvinov Feb 24, 2022
07c8fae
Beautify spec whitespace
vnlitvinov Feb 24, 2022
b74c06e
Use constants from spec enums, beautify a bit
vnlitvinov Feb 24, 2022
6637a29
Format by black
vnlitvinov Feb 24, 2022
0883406
Format exchange tests by black
vnlitvinov Feb 24, 2022
49418d2
Respond to review - move files around
vnlitvinov Mar 30, 2022
78aebaa
Separate buffer and column implementations
vnlitvinov Mar 31, 2022
1b64ae2
Mimick what Modin did
vnlitvinov Mar 31, 2022
870ad21
Make spec tests pass
vnlitvinov Mar 31, 2022
edefc8f
Add tests for dtype_to_arrow_c_fmt
vnlitvinov Mar 31, 2022
7144cf2
Fix test declarations, some impl bugs remain
vnlitvinov Mar 31, 2022
0dc1e58
Fix .describe_categoricals and some tests
vnlitvinov Mar 31, 2022
0f7c654
Auto-fix some pre-commit checks
vnlitvinov Mar 31, 2022
522a66a
Fix more issues found by commit checks
vnlitvinov Mar 31, 2022
1525320
Fix categorical-related test failures
vnlitvinov Mar 31, 2022
0054c15
Add a whatsnew entry
vnlitvinov Mar 31, 2022
f8badc6
Fix rst linting
vnlitvinov Mar 31, 2022
86005d4
Fix DataFrame.__dataframe__ docstring
vnlitvinov Mar 31, 2022
9ab797b
Fix DataFrame.__dataframe__ docstring more
vnlitvinov Mar 31, 2022
7a54b20
Fix test_api::TestApi
vnlitvinov Mar 31, 2022
65a5370
Try to fix typecheck issues
vnlitvinov Mar 31, 2022
594ac53
Respond to review comments
vnlitvinov Apr 14, 2022
62c43af
Fix mypy error
vnlitvinov Apr 14, 2022
cacc9f1
Change check for dlpack
vnlitvinov Apr 14, 2022
804aa89
Address review comments
vnlitvinov Apr 18, 2022
d1c0d56
Remove dead elif branch
vnlitvinov Apr 19, 2022
5d98ebf
Fix tests broken by .column_names change
vnlitvinov Apr 19, 2022
60379e5
Add tests for datetime dtype
vnlitvinov Apr 20, 2022
497ca24
Fix from_dataframe docstring
vnlitvinov Apr 20, 2022
39f5a5c
Add tests for uint dtype
vnlitvinov Apr 21, 2022
d73558a
Handle string dtype better
vnlitvinov Apr 22, 2022
4ed35bf
Add test for mixed object dtype
vnlitvinov Apr 22, 2022
2fca3c0
Rename spec test for clarity
vnlitvinov Apr 23, 2022
f030d9f
Add missing test cases in test_dtype_to_arrow_c_fmt
vnlitvinov Apr 24, 2022
cc94e57
Add comments explaing magic dtype numbers
vnlitvinov Apr 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
Expand Up @@ -391,3 +391,4 @@ Serialization / IO / conversion
DataFrame.to_clipboard
DataFrame.to_markdown
DataFrame.style
DataFrame.__dataframe__
7 changes: 7 additions & 0 deletions doc/source/reference/general_functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,3 +78,10 @@ Hashing

util.hash_array
util.hash_pandas_object

Importing from other DataFrame libraries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. autosummary::
:toctree: api/

api.exchange.from_dataframe
18 changes: 18 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,24 @@ including other versions of pandas.
Enhancements
~~~~~~~~~~~~

.. _whatsnew_150.enhancements.dataframe_exchange:

DataFrame exchange protocol implementation
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Pandas now implement the DataFrame exchange API spec.
See the full details on the API at https://data-apis.org/dataframe-protocol/latest/index.html

The protocol consists of two parts:

- New method :meth:`DataFrame.__dataframe__` which produces the exchange object.
It effectively "exports" the Pandas dataframe as an exchange object so
any other library which has the protocol implemented can "import" that dataframe
without knowing anything about the producer except that it makes an exchange object.
- New function :func:`pandas.api.exchange.from_dataframe` which can take
an arbitrary exchange object from any conformant library and construct a
Pandas DataFrame out of it.

.. _whatsnew_150.enhancements.styler:

Styler
Expand Down
1 change: 1 addition & 0 deletions pandas/api/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
""" public toolkit API """
from pandas.api import ( # noqa:F401
exchange,
extensions,
indexers,
types,
Expand Down
8 changes: 8 additions & 0 deletions pandas/api/exchange/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
"""
Public API for DataFrame exchange protocol.
"""

from pandas.core.exchange.dataframe_protocol import DataFrame
from pandas.core.exchange.from_dataframe import from_dataframe

__all__ = ["from_dataframe", "DataFrame"]
Empty file.
80 changes: 80 additions & 0 deletions pandas/core/exchange/buffer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
from typing import (
Optional,
Tuple,
)

import numpy as np
from packaging import version

from pandas.core.exchange.dataframe_protocol import (
Buffer,
DlpackDeviceType,
)

_NUMPY_HAS_DLPACK = version.parse(np.__version__) >= version.parse("1.22.0")


class PandasBuffer(Buffer):
"""
Data in the buffer is guaranteed to be contiguous in memory.
"""

def __init__(self, x: np.ndarray, allow_copy: bool = True) -> None:
"""
Handle only regular columns (= numpy arrays) for now.
"""
if not x.strides == (x.dtype.itemsize,):
# The protocol does not support strided buffers, so a copy is
# necessary. If that's not allowed, we need to raise an exception.
if allow_copy:
x = x.copy()
else:
raise RuntimeError(
"Exports cannot be zero-copy in the case "
"of a non-contiguous buffer"
)

# Store the numpy array in which the data resides as a private
# attribute, so we can use it to retrieve the public attributes
self._x = x

@property
def bufsize(self) -> int:
"""
Buffer size in bytes.
"""
return self._x.size * self._x.dtype.itemsize

@property
def ptr(self) -> int:
"""
Pointer to start of the buffer as an integer.
"""
return self._x.__array_interface__["data"][0]

def __dlpack__(self):
"""
Represent this structure as DLPack interface.
"""
if _NUMPY_HAS_DLPACK:
return self._x.__dlpack__()
raise NotImplementedError("__dlpack__")
vnlitvinov marked this conversation as resolved.
Show resolved Hide resolved

def __dlpack_device__(self) -> Tuple[DlpackDeviceType, Optional[int]]:
"""
Device type and device ID for where the data in the buffer resides.
"""
return (DlpackDeviceType.CPU, None)

def __repr__(self) -> str:
return (
"PandasBuffer("
+ str(
{
"bufsize": self.bufsize,
"ptr": self.ptr,
"device": self.__dlpack_device__()[0].name,
}
)
+ ")"
)
Loading