`ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates #2321

crepererum · 2022-08-04T14:09:19Z

Describe the bug
The batch_size passed to ParquetFileArrowReader::get_record_reader[_by_colum] results in allocating that many records in memory even when the file contains less data. This is a bit unfortunate (or dangerous) because this parameter is really hard to estimate. In a system that reads and writes parquet files, you may assume that the files written only contain a reasonable amount of data (in bytes), but don't know how many rows there are. Even looking at the parquet file and the file-level metadata will tell you that it's OK to read everything, so you optimistically just pass a very high batch_size... and OOM your process.

To Reproduce
No isolated code yet, but it roughly goes as follows:

create a reasonably sized parquet file with a single record batch and some columns
use a really large batch_size when reading the file.

Expected behavior
The row counts are known at least within the file-level parquet metadata (and probably other places), so they should be applied as a limit before allocating buffers.

Additional context
Occurs with arrow version 19.

The text was updated successfully, but these errors were encountered:

crepererum added the bug label Aug 4, 2022

tustvold added good first issue Good for newcomers help wanted labels Aug 4, 2022

Ted-Jiang mentioned this issue Aug 19, 2022

Avoid large over allocate buffer in sync reader #2511

Merged

tustvold closed this as completed in #2511 Aug 19, 2022

alamb added the parquet Changes to the parquet crate label Sep 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates #2321

`ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates #2321

crepererum commented Aug 4, 2022

ParquetFileArrowReader::get_record_reader[_by_colum] batch_size overallocates #2321

ParquetFileArrowReader::get_record_reader[_by_colum] batch_size overallocates #2321

Comments

crepererum commented Aug 4, 2022

`ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates #2321

`ParquetFileArrowReader::get_record_reader[_by_colum]` `batch_size` overallocates #2321