Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow library for checkpoint parquet file read/write #20

Merged
merged 3 commits into from
Jul 24, 2023

Conversation

chelseajonesr
Copy link
Collaborator

@chelseajonesr chelseajonesr commented Jul 24, 2023

Use the Arrow library for checkpoint read/write.

Note that while this is already much faster and more memory efficient at loading large checkpoints than the previous version, it's still using a very large amount of memory on very large checkpoints (73GB on a 24 million row checkpoint). An optimization branch is in progress.

github.com/segmentio/encoding v0.3.5 // indirect
golang.org/x/sys v0.5.0 // indirect
)

replace github.com/segmentio/parquet-go v0.0.0-20230427215636-d483faba23a5 => github.com/chelseajonesr/parquet-go v0.0.3
replace github.com/apache/arrow/go/v13 => github.com/chelseajonesr/arrow/go/v13 v13.0.0-20230711200800-c7890b0a2007
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This redirection is for a minor patch to allow Arrow to skip serializing/deserializing records to Arrow when the tag is "-", similarly to JSON. It's been accepted into the Arrow 14 branch as of today so I'll likely remove this redirection in the optimization branch that is in progress,

@jshiv jshiv merged commit 5d36c22 into rivian:main Jul 24, 2023
@chelseajonesr chelseajonesr deleted the arrow-library branch August 11, 2023 06:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants