Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArrayManager] Enable read_parquet to not create 2D blocks when using ArrayManager #40303

Merged
merged 7 commits into from
Apr 26, 2021

Conversation

jorisvandenbossche
Copy link
Member

xref #39146

I was exploring the Parquet IO, and pyarrow has an option to not created consolidated blocks. If we do this when wanting to create an ArrayManager, we can reduce the memory usage. It's a bit slower, though, because there is still the overhead of creating more blocks (that's something that would need to be changed in pyarrow).

Would still need to add a test that checks the option is honored.

@jorisvandenbossche jorisvandenbossche added Internals Related to non-user accessible pandas implementation IO Parquet parquet, feather labels Mar 8, 2021
@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review April 1, 2021 08:10
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Apr 21, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty reasonable. cc @jbrockmendel


# setup engines & skips
@pytest.fixture(
params=[
pytest.param(
"fastparquet",
marks=pytest.mark.skipif(
not _HAVE_FASTPARQUET, reason="fastparquet is not installed"
not _HAVE_FASTPARQUET or get_option("mode.data_manager") == "array",
reason="fastparquet is not installed or ArrayManager is used",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a "for now" or a "ever"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If your question is about "will ArrayManager be supported with fastparquet engine", that's probably a question for the fastparquet package (and since this is only optional for now, there is still time to discuss that with them)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so not actionable on our end, thanks

@jbrockmendel
Copy link
Member

small question, LGTM

@jorisvandenbossche jorisvandenbossche merged commit 58181b1 into pandas-dev:master Apr 26, 2021
@jorisvandenbossche jorisvandenbossche deleted the am-parquet branch April 26, 2021 09:44
yeshsurya pushed a commit to yeshsurya/pandas that referenced this pull request May 6, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants