Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support arrow/parquet roundtrip for nullable integer / string extension dtypes #29483

Merged

Conversation

jorisvandenbossche
Copy link
Member

xref #20612

This implements the __from_arrow__ method for integer/string extension types, so roundtripping to arrow/parquet fully works.


results = []
for arr in chunks:
str_arr = StringArray(np.array(arr))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger Is there a way to turn off validation when creating a StringArray? (in this case I know the elements will be strings or None)
I didn't directly see a private method on StringArray that does this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the moment. In StringArray.__init__ we have

skip_validation = isinstance(values, type(self))

I'd be open to a public API for allowing the user / library to state that they have the right values.

One slight concern though: right now it's expect that the values are strings or nan, and we may expect to change the NA value to change in the future. Libraries will have to adapt to that if they're claiming to be compatible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so actually the validation right now is needed here, even if it is just to convert None to nan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the None -> nan replacement happens in in _from_sequence, though I have a TODO to move it to _validate since that's already requires a pass over the data.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch, I switched to _from_sequence for now, and added a test for it.

@jorisvandenbossche jorisvandenbossche added Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. IO Parquet parquet, feather labels Nov 8, 2019
@jorisvandenbossche jorisvandenbossche added this to the 1.0 milestone Nov 8, 2019
@@ -4717,6 +4717,7 @@ Several caveats.
* The ``pyarrow`` engine preserves the ``ordered`` flag of categorical dtypes with string types. ``fastparquet`` does not preserve the ``ordered`` flag.
* Non supported types include ``Period`` and actual Python object types. These will raise a helpful error message
on an attempt at serialization.
* The ``pyarrow`` engine preserves extension data types such as the nullable integer and string data type (requiring pyarrow >= 1.0.0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only preserve integer and string extension types or does it preserve all? I assume the former but somewhat unclear in note

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only preserve integer and string extension types or does it preserve all? I assume the former but somewhat unclear in note

It does preserve it for all dtypes that have implemented the needed protocol methods (__arrow_array__, __from_arrow__), which right now is integer and string for pandas. And potentially any external EA as well, in case those support those protocols.

Tried to clarify this in the docs.

"""Construct IntegerArray from passed pyarrow Array"""
import pyarrow

if isinstance(array, pyarrow.Array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is this False?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The passed pyarrow values can be either a pyarrow.Array or pyarrow.ChunkedArray. Added a comment for this

@jorisvandenbossche
Copy link
Member Author

The "red" CI is because of codecov (correctly) catching that the coverage of the changes is not good. This is because the tests only run with pyarrow master. We do not have a pyarrow dev build however (and not sure it is worth to do this for those limited number of tests, it would probably also need nightlies from pyarrow).

@TomAugspurger
Copy link
Contributor

Ignoring coverage is fine here. Not worth setting up nightly builds against pyarrow yet.

You also have some linting failures in the docs: https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=20694

@jorisvandenbossche
Copy link
Member Author

The failure on Linux py36_minimum_versions seems unrelated (an unexpected resource warning in a parser test, is this already a know failing / flaky tests?)

@jorisvandenbossche
Copy link
Member Author

More feedback here? Or ready to merge?

cc @jreback

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. minor comment on the test (if correct, no objections)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 19, 2019

Updated to resolve conflicts. Unless more comments going to merge later today.

by implementing two methods: ``ExtensionArray.__arrow_array__`` and
``ExtensionDtype.__from_arrow__``.

The ``ExtensionArray.__arrow_array__`` ensures that ``pyarrow`` knowns how
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: knowns -> knows


The ``ExtensionDtype.__from_arrow__`` method then controls the conversion
back from pyarrow to a pandas ExtensionArray. This method receives a pyarrow
``Array`` or ``ChunkedArray`` as only argument and is expected to return the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"as only" -> "as its only"

@TomAugspurger TomAugspurger merged commit fe1803d into pandas-dev:master Nov 19, 2019
@jorisvandenbossche jorisvandenbossche deleted the int-string-from-arrow branch November 19, 2019 15:19
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…tension dtypes (pandas-dev#29483)

* Add __from_arrow__ support for IntegerArray, StringArray
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
…tension dtypes (pandas-dev#29483)

* Add __from_arrow__ support for IntegerArray, StringArray
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions ExtensionArray Extending pandas with custom dtypes or arrays. IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants