-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support arrow/parquet roundtrip for nullable integer / string extension dtypes #29483
ENH: Support arrow/parquet roundtrip for nullable integer / string extension dtypes #29483
Conversation
pandas/core/arrays/string_.py
Outdated
|
||
results = [] | ||
for arr in chunks: | ||
str_arr = StringArray(np.array(arr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger Is there a way to turn off validation when creating a StringArray? (in this case I know the elements will be strings or None)
I didn't directly see a private method on StringArray that does this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at the moment. In StringArray.__init__
we have
skip_validation = isinstance(values, type(self))
I'd be open to a public API for allowing the user / library to state that they have the right values.
One slight concern though: right now it's expect that the values are strings or nan
, and we may expect to change the NA value to change in the future. Libraries will have to adapt to that if they're claiming to be compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so actually the validation right now is needed here, even if it is just to convert None to nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the None -> nan replacement happens in in _from_sequence
, though I have a TODO to move it to _validate
since that's already requires a pass over the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch, I switched to _from_sequence
for now, and added a test for it.
doc/source/user_guide/io.rst
Outdated
@@ -4717,6 +4717,7 @@ Several caveats. | |||
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag. | |||
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message | |||
on an attempt at serialization. | |||
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data type (requiring pyarrow >= 1.0.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only preserve integer and string extension types or does it preserve all? I assume the former but somewhat unclear in note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this only preserve integer and string extension types or does it preserve all? I assume the former but somewhat unclear in note
It does preserve it for all dtypes that have implemented the needed protocol methods (__arrow_array__
, __from_arrow__
), which right now is integer and string for pandas. And potentially any external EA as well, in case those support those protocols.
Tried to clarify this in the docs.
"""Construct IntegerArray from passed pyarrow Array""" | ||
import pyarrow | ||
|
||
if isinstance(array, pyarrow.Array): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is this False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The passed pyarrow values can be either a pyarrow.Array or pyarrow.ChunkedArray. Added a comment for this
The "red" CI is because of codecov (correctly) catching that the coverage of the changes is not good. This is because the tests only run with pyarrow master. We do not have a pyarrow dev build however (and not sure it is worth to do this for those limited number of tests, it would probably also need nightlies from pyarrow). |
Ignoring coverage is fine here. Not worth setting up nightly builds against pyarrow yet. You also have some linting failures in the docs: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=20694 |
The failure on Linux py36_minimum_versions seems unrelated (an unexpected resource warning in a parser test, is this already a know failing / flaky tests?) |
More feedback here? Or ready to merge? cc @jreback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. minor comment on the test (if correct, no objections)
Updated to resolve conflicts. Unless more comments going to merge later today. |
by implementing two methods: ``ExtensionArray.__arrow_array__`` and | ||
``ExtensionDtype.__from_arrow__``. | ||
|
||
The ``ExtensionArray.__arrow_array__`` ensures that ``pyarrow`` knowns how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: knowns -> knows
|
||
The ``ExtensionDtype.__from_arrow__`` method then controls the conversion | ||
back from pyarrow to a pandas ExtensionArray. This method receives a pyarrow | ||
``Array`` or ``ChunkedArray`` as only argument and is expected to return the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"as only" -> "as its only"
…tension dtypes (pandas-dev#29483) * Add __from_arrow__ support for IntegerArray, StringArray
…tension dtypes (pandas-dev#29483) * Add __from_arrow__ support for IntegerArray, StringArray
xref #20612
This implements the
__from_arrow__
method for integer/string extension types, so roundtripping to arrow/parquet fully works.