-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: PyArrow dtypes were not supported in the interchange protocol #57764
Conversation
616f83c
to
49e3ec4
Compare
49e3ec4
to
c5c108e
Compare
3d47fb9
to
fd557f2
Compare
fd557f2
to
9d6b21b
Compare
@@ -298,13 +298,14 @@ def string_column_to_ndarray(col: Column) -> tuple[np.ndarray, Any]: | |||
|
|||
null_pos = None | |||
if null_kind in (ColumnNullType.USE_BITMASK, ColumnNullType.USE_BYTEMASK): | |||
assert buffers["validity"], "Validity buffers cannot be empty for masks" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't true for pyarrow dtypes right? they use bitmasks, but their validity buffer can indeed be None
, whereas pandas nullables always seem to be set?
In [7]: pd.Series([1, 2, 3], dtype='Int64').array._mask
Out[7]: array([False, False, False])
In [8]: pd.Series([1, 2, 3], dtype='Int64[pyarrow]').array._pa_array.chunks[0].buffers()[0] is None
Out[8]: True
pandas/core/interchange/column.py
Outdated
@@ -194,6 +211,13 @@ def describe_null(self): | |||
column_null_dtype = ColumnNullType.USE_BYTEMASK | |||
null_value = 1 | |||
return column_null_dtype, null_value | |||
if isinstance(self._col.dtype, ArrowDtype): | |||
if all( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a real need to iterate the chunks like this and check null / not-null? Arrow leaves it implementation defined as to whether or not there is a bitmask
https://arrow.apache.org/docs/format/Columnar.html#validity-bitmaps
I just wonder if there is value in use trying to dictate that through the interchange protocol versus letting consumers handle that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed things round a bit to just rechunk upfront (if allow_copy
allows)
so once get here, we just need to check if buffers()[0]
is None
The issue with just returning ColumnNullType.USE_BITMASK, 0
in all cases, even if there's no validity mask, is that then pyarrow.interchange.from_dataframe
would raise
thanks for your review! I've updated an simplified a bit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good - nice work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @MarcoGorelli - keep chipping away
Thanks @MarcoGorelli |
thanks both!
😄 we're getting there I didn't set a milestone, is it OK if I backport this to 2.2.x? |
@meeseeksmachine please backport to 2.2.x |
@meeseeksdev backport 2.2.x |
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch (cherry picked from commit 710720e)
…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch (cherry picked from commit 710720e)
…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch (cherry picked from commit 710720e)
…orted in the interchange protocol) (#57947)
…andas-dev#57764) * fix pyarrow interchange * reduce diff * reduce diff * start simplifying * simplify, remove is_validity arg * remove unnecessary branch * doc maybe_rechunk * mypy * extra test * mark _col unused, assert rechunking did not modify original df * declare buffer: Buffer outside of if/else branch
fixes a few things:
'string[pyarrow]'
#57762