ParquetFileArrowReader::get_record_reader[_by_colum]
batch_size
overallocates
#2321
Labels
ParquetFileArrowReader::get_record_reader[_by_colum]
batch_size
overallocates
#2321
Describe the bug
The
batch_size
passed toParquetFileArrowReader::get_record_reader[_by_colum]
results in allocating that many records in memory even when the file contains less data. This is a bit unfortunate (or dangerous) because this parameter is really hard to estimate. In a system that reads and writes parquet files, you may assume that the files written only contain a reasonable amount of data (in bytes), but don't know how many rows there are. Even looking at the parquet file and the file-level metadata will tell you that it's OK to read everything, so you optimistically just pass a very highbatch_size
... and OOM your process.To Reproduce
No isolated code yet, but it roughly goes as follows:
batch_size
when reading the file.Expected behavior
The row counts are known at least within the file-level parquet metadata (and probably other places), so they should be applied as a limit before allocating buffers.
Additional context
Occurs with arrow version 19.
The text was updated successfully, but these errors were encountered: