Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid Upcasting to pa.large_binary() #803

Closed
wants to merge 1 commit into from

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Jun 9, 2024

Fixes #791

Parquet seems to have a 2GB upper limit on the size of a single Page within each column. Attempting to write data that is larger than 2GB in a single cell/page results in errors observed in the issue linked above.

This PR seeks to limit schema_to_pyarrow to cast to regular types that restrict Arrow data representations to 2GB, which aligns with the limitations of Parquet.

EDIT:

Based on the discussions on #791 it looks like we'd benefit from decoupling the motivation to support larger arrow types from the size limitations of parquet.

#807 resolves the schema inconsistency issue by always casting to large_* types instead

@sungwy sungwy changed the title avoid upcasting to large_binary Avoid Upcasting to pa.large_binary() Jun 9, 2024
@sungwy sungwy marked this pull request as draft June 10, 2024 21:48
@sungwy sungwy closed this Jun 14, 2024
@sungwy sungwy deleted the sy-arrow-rm-large branch June 14, 2024 22:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upcasting and Downcasting inconsistencies with PyArrow Schema
1 participant