BUG: PyArrow dtypes were not supported in the interchange protocol #57764

MarcoGorelli · 2024-03-07T14:26:05Z

fixes a few things:

closes BUG: interchange buffer show bytemask (instead of bitmask) validity for 'string[pyarrow]' #57762
closes BUG: validity of interchange column is incorrectly set to Some for 'large-string[pyarrow]' #57761
closes BUG: interchange protocol with nullable pyarrow datatypes a non-null validity provides nonsense results #57664

MarcoGorelli · 2024-03-13T19:52:27Z

pandas/core/interchange/from_dataframe.py

@@ -298,13 +298,14 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]:

    null_pos = None
    if null_kind in (ColumnNullType.USE_BITMASK, ColumnNullType.USE_BYTEMASK):
-        assert buffers["validity"], "Validity buffers cannot be empty for masks"


this isn't true for pyarrow dtypes right? they use bitmasks, but their validity buffer can indeed be None, whereas pandas nullables always seem to be set?

In [7]: pd.Series([1, 2, 3], dtype='Int64').array._mask Out[7]: array([False, False, False]) In [8]: pd.Series([1, 2, 3], dtype='Int64[pyarrow]').array._pa_array.chunks[0].buffers()[0] is None Out[8]: True

pandas/core/interchange/buffer.py

WillAyd · 2024-03-14T02:44:04Z

pandas/core/interchange/column.py

@@ -194,6 +211,13 @@ def describe_null(self):
            column_null_dtype = ColumnNullType.USE_BYTEMASK
            null_value = 1
            return column_null_dtype, null_value
+        if isinstance(self._col.dtype, ArrowDtype):
+            if all(


Is there a real need to iterate the chunks like this and check null / not-null? Arrow leaves it implementation defined as to whether or not there is a bitmask

https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps

I just wonder if there is value in use trying to dictate that through the interchange protocol versus letting consumers handle that

I've changed things round a bit to just rechunk upfront (if allow_copy allows)

so once get here, we just need to check if buffers()[0] is None

The issue with just returning ColumnNullType.USE_BITMASK, 0 in all cases, even if there's no validity mask, is that then pyarrow.interchange.from_dataframe would raise

pandas/core/interchange/column.py

MarcoGorelli · 2024-03-15T18:12:07Z

thanks for your review! I've updated an simplified a bit

WillAyd

Looking good - nice work

pandas/core/interchange/column.py

pandas/core/interchange/from_dataframe.py

WillAyd

Nice work @MarcoGorelli - keep chipping away

mroeschke · 2024-03-20T21:11:01Z

Thanks @MarcoGorelli

MarcoGorelli · 2024-03-20T21:17:18Z

thanks both!

keep chipping away

😄 we're getting there

I didn't set a milestone, is it OK if I backport this to 2.2.x?

MarcoGorelli · 2024-03-20T21:17:32Z

@meeseeksmachine please backport to 2.2.x

MarcoGorelli · 2024-03-20T21:19:47Z

@meeseeksdev backport 2.2.x

lumberbot-app · 2024-03-20T21:20:16Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.2.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 710720e6555c779a6539354ebae59b1a649cebb3

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #57764: BUG: PyArrow dtypes were not supported in the interchange protocol'

Push to a named branch:

git push YOURFORK 2.2.x:auto-backport-of-pr-57764-on-2.2.x

Create a PR against branch 2.2.x, I would have named this PR:

"Backport PR #57764 on branch 2.2.x (BUG: PyArrow dtypes were not supported in the interchange protocol)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch (cherry picked from commit 710720e)

…orted in the interchange protocol) (#57947)

…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch

MarcoGorelli changed the title ~~Interchange pyarrow~~ Support pyarrow dtypes in the interchange protocol Mar 7, 2024

MarcoGorelli force-pushed the interchange-pyarrow branch 2 times, most recently from 616f83c to 49e3ec4 Compare March 8, 2024 07:18

MarcoGorelli added the Interchange Dataframe Interchange Protocol label Mar 8, 2024

MarcoGorelli force-pushed the interchange-pyarrow branch from 49e3ec4 to c5c108e Compare March 8, 2024 07:42

MarcoGorelli mentioned this pull request Mar 8, 2024

BUG: DataFrame Interchange Protocol errors on Boolean columns #57758

Merged

5 tasks

MarcoGorelli changed the title ~~Support pyarrow dtypes in the interchange protocol~~ BUG: PyArrow dtypes were not supported in the interchange protocol Mar 8, 2024

MarcoGorelli force-pushed the interchange-pyarrow branch from 3d47fb9 to fd557f2 Compare March 10, 2024 14:11

fix pyarrow interchange

9d6b21b

MarcoGorelli force-pushed the interchange-pyarrow branch from fd557f2 to 9d6b21b Compare March 10, 2024 14:34

MarcoGorelli marked this pull request as ready for review March 13, 2024 19:41

MarcoGorelli added 3 commits March 13, 2024 19:47

reduce diff

031d9aa

reduce diff

cec4b4d

Merge remote-tracking branch 'upstream/main' into interchange-pyarrow

a1f0c43

MarcoGorelli commented Mar 13, 2024

View reviewed changes

MarcoGorelli requested review from WillAyd and jorisvandenbossche March 13, 2024 19:52

WillAyd reviewed Mar 14, 2024

View reviewed changes

MarcoGorelli added 3 commits March 15, 2024 13:05

Merge remote-tracking branch 'upstream/main' into interchange-pyarrow

f041b58

start simplifying

9adf45f

simplify, remove is_validity arg

080e54f

MarcoGorelli marked this pull request as draft March 15, 2024 16:09

MarcoGorelli added 3 commits March 15, 2024 16:11

remove unnecessary branch

9344458

doc maybe_rechunk

c2f5bfa

mypy

e4531a0

extra test

0d89d97

MarcoGorelli marked this pull request as ready for review March 15, 2024 18:27

WillAyd reviewed Mar 15, 2024

View reviewed changes

pandas/core/interchange/column.py Show resolved Hide resolved

pandas/core/interchange/column.py Show resolved Hide resolved

pandas/core/interchange/column.py Outdated Show resolved Hide resolved

pandas/core/interchange/from_dataframe.py Show resolved Hide resolved

mark _col unused, assert rechunking did not modify original df

d85c904

MarcoGorelli added 2 commits March 20, 2024 18:13

Merge remote-tracking branch 'upstream/main' into interchange-pyarrow

f28e33c

declare buffer: Buffer outside of if/else branch

db0f402

WillAyd approved these changes Mar 20, 2024

View reviewed changes

mroeschke approved these changes Mar 20, 2024

View reviewed changes

mroeschke merged commit 710720e into pandas-dev:main Mar 20, 2024
46 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Mar 20, 2024

mroeschke added this to the 2.2.2 milestone Mar 20, 2024

This was referenced Mar 21, 2024

Backport PR #57764 on branch 2.2.x (BUG: PyArrow dtypes were not supported in the interchange protocol) #57947

Merged

BUG: Interchange with Pyarrow types loses validity buffer #56805

Closed

MarcoGorelli added a commit that referenced this pull request Mar 21, 2024

Backport PR #57764 on branch 2.2.x (BUG: PyArrow dtypes were not supp…

7e8d492

…orted in the interchange protocol) (#57947)

lithomas1 removed the Still Needs Manual Backport label Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

MarcoGorelli commented Mar 7, 2024 •

edited

Loading

MarcoGorelli Mar 13, 2024

WillAyd Mar 14, 2024

MarcoGorelli Mar 15, 2024

MarcoGorelli commented Mar 15, 2024

WillAyd left a comment

WillAyd left a comment

mroeschke commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

lumberbot-app bot commented Mar 20, 2024

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

BUG: PyArrow dtypes were not supported in the interchange protocol #57764

Conversation

MarcoGorelli commented Mar 7, 2024 • edited Loading

MarcoGorelli Mar 13, 2024

Choose a reason for hiding this comment

WillAyd Mar 14, 2024

Choose a reason for hiding this comment

MarcoGorelli Mar 15, 2024

Choose a reason for hiding this comment

MarcoGorelli commented Mar 15, 2024

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

mroeschke commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

MarcoGorelli commented Mar 20, 2024

lumberbot-app bot commented Mar 20, 2024

MarcoGorelli commented Mar 7, 2024 •

edited

Loading