Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetFileArrowReader::get_record_reader[_by_colum] batch_size overallocates #2321

Closed
crepererum opened this issue Aug 4, 2022 · 0 comments · Fixed by #2511
Closed

ParquetFileArrowReader::get_record_reader[_by_colum] batch_size overallocates #2321

crepererum opened this issue Aug 4, 2022 · 0 comments · Fixed by #2511
Labels
bug good first issue Good for newcomers help wanted parquet Changes to the parquet crate

Comments

@crepererum
Copy link
Contributor

Describe the bug
The batch_size passed to ParquetFileArrowReader::get_record_reader[_by_colum] results in allocating that many records in memory even when the file contains less data. This is a bit unfortunate (or dangerous) because this parameter is really hard to estimate. In a system that reads and writes parquet files, you may assume that the files written only contain a reasonable amount of data (in bytes), but don't know how many rows there are. Even looking at the parquet file and the file-level metadata will tell you that it's OK to read everything, so you optimistically just pass a very high batch_size... and OOM your process.

To Reproduce
No isolated code yet, but it roughly goes as follows:

  1. create a reasonably sized parquet file with a single record batch and some columns
  2. use a really large batch_size when reading the file.

Expected behavior
The row counts are known at least within the file-level parquet metadata (and probably other places), so they should be applied as a limit before allocating buffers.

Additional context
Occurs with arrow version 19.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for newcomers help wanted parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants