Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

Merged
merged 14 commits into from
Mar 20, 2024

Conversation

@MarcoGorelli MarcoGorelli changed the title Interchange pyarrow Support pyarrow dtypes in the interchange protocol Mar 7, 2024
@MarcoGorelli MarcoGorelli force-pushed the interchange-pyarrow branch 2 times, most recently from 616f83c to 49e3ec4 Compare March 8, 2024 07:18
@MarcoGorelli MarcoGorelli added the Interchange Dataframe Interchange Protocol label Mar 8, 2024
@MarcoGorelli MarcoGorelli force-pushed the interchange-pyarrow branch from 49e3ec4 to c5c108e Compare March 8, 2024 07:42
@MarcoGorelli MarcoGorelli changed the title Support pyarrow dtypes in the interchange protocol BUG: PyArrow dtypes were not supported in the interchange protocol Mar 8, 2024
@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 13, 2024 19:41
@@ -298,13 +298,14 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

null_pos = None
if null_kind in (ColumnNullType.USE_BITMASK, ColumnNullType.USE_BYTEMASK):
assert buffers["validity"], "Validity buffers cannot be empty for masks"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this isn't true for pyarrow dtypes right? they use bitmasks, but their validity buffer can indeed be None, whereas pandas nullables always seem to be set?

In [7]: pd.Series([1, 2, 3], dtype='Int64').array._mask
Out[7]: array([False, False, False])

In [8]: pd.Series([1, 2, 3], dtype='Int64[pyarrow]').array._pa_array.chunks[0].buffers()[0] is None
Out[8]: True

pandas/core/interchange/buffer.py Outdated Show resolved Hide resolved
@@ -194,6 +211,13 @@ def describe_null(self):
column_null_dtype = ColumnNullType.USE_BYTEMASK
null_value = 1
return column_null_dtype, null_value
if isinstance(self._col.dtype, ArrowDtype):
if all(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a real need to iterate the chunks like this and check null / not-null? Arrow leaves it implementation defined as to whether or not there is a bitmask

https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps

I just wonder if there is value in use trying to dictate that through the interchange protocol versus letting consumers handle that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed things round a bit to just rechunk upfront (if allow_copy allows)

so once get here, we just need to check if buffers()[0] is None

The issue with just returning ColumnNullType.USE_BITMASK, 0 in all cases, even if there's no validity mask, is that then pyarrow.interchange.from_dataframe would raise

pandas/core/interchange/column.py Outdated Show resolved Hide resolved
@MarcoGorelli MarcoGorelli marked this pull request as draft March 15, 2024 16:09
@MarcoGorelli
Copy link
Member Author

thanks for your review! I've updated an simplified a bit

@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 15, 2024 18:27
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good - nice work

pandas/core/interchange/column.py Show resolved Hide resolved
pandas/core/interchange/column.py Show resolved Hide resolved
pandas/core/interchange/column.py Outdated Show resolved Hide resolved
pandas/core/interchange/from_dataframe.py Show resolved Hide resolved
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @MarcoGorelli - keep chipping away

@mroeschke mroeschke merged commit 710720e into pandas-dev:main Mar 20, 2024
46 checks passed
@mroeschke
Copy link
Member

Thanks @MarcoGorelli

@MarcoGorelli
Copy link
Member Author

thanks both!

keep chipping away

😄 we're getting there

I didn't set a milestone, is it OK if I backport this to 2.2.x?

@MarcoGorelli
Copy link
Member Author

@meeseeksmachine please backport to 2.2.x

@MarcoGorelli
Copy link
Member Author

@meeseeksdev backport 2.2.x

Copy link

lumberbot-app bot commented Mar 20, 2024

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.2.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 710720e6555c779a6539354ebae59b1a649cebb3
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #57764: BUG: PyArrow dtypes were not supported in the interchange protocol'
  1. Push to a named branch:
git push YOURFORK 2.2.x:auto-backport-of-pr-57764-on-2.2.x
  1. Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #57764 on branch 2.2.x (BUG: PyArrow dtypes were not supported in the interchange protocol)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@mroeschke mroeschke added this to the 2.2.2 milestone Mar 20, 2024
MarcoGorelli added a commit to MarcoGorelli/pandas that referenced this pull request Mar 21, 2024
…andas-dev#57764)

* fix pyarrow interchange

* reduce diff

* reduce diff

* start simplifying

* simplify, remove is_validity arg

* remove unnecessary branch

* doc maybe_rechunk

* mypy

* extra test

* mark _col unused, assert rechunking did not modify original df

* declare buffer: Buffer outside of if/else branch

(cherry picked from commit 710720e)
MarcoGorelli added a commit to MarcoGorelli/pandas that referenced this pull request Mar 21, 2024
…andas-dev#57764)

* fix pyarrow interchange

* reduce diff

* reduce diff

* start simplifying

* simplify, remove is_validity arg

* remove unnecessary branch

* doc maybe_rechunk

* mypy

* extra test

* mark _col unused, assert rechunking did not modify original df

* declare buffer: Buffer outside of if/else branch

(cherry picked from commit 710720e)
MarcoGorelli added a commit to MarcoGorelli/pandas that referenced this pull request Mar 21, 2024
…andas-dev#57764)

* fix pyarrow interchange

* reduce diff

* reduce diff

* start simplifying

* simplify, remove is_validity arg

* remove unnecessary branch

* doc maybe_rechunk

* mypy

* extra test

* mark _col unused, assert rechunking did not modify original df

* declare buffer: Buffer outside of if/else branch

(cherry picked from commit 710720e)
MarcoGorelli added a commit that referenced this pull request Mar 21, 2024
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
…andas-dev#57764)

* fix pyarrow interchange

* reduce diff

* reduce diff

* start simplifying

* simplify, remove is_validity arg

* remove unnecessary branch

* doc maybe_rechunk

* mypy

* extra test

* mark _col unused, assert rechunking did not modify original df

* declare buffer: Buffer outside of if/else branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interchange Dataframe Interchange Protocol
Projects
None yet
4 participants