parquet_derive: support reading selected columns from parquet file #6268

double-free · 2024-08-18T14:34:47Z

Feature Description

I'm effectively using parquet_derive in my project, and I found that there are two inconvenient constraints:

The ParquetRecordReader enforces the struct to organize fields exactly in the same order in the parquet file.
The ParquetRecordReader enforces the struct to parse all fields in the parquet file. "all" might be exaggerating, but it is what happens if you want to get the last column, even only the last column.

As describe in its document:

Derive flat, simple RecordReader implementations. Works by parsing a struct tagged with #[derive(ParquetRecordReader)] and emitting the correct writing code for each field of the struct. Column readers are generated in the order they are defined.

In my use cases (and I believe these are common requests), user should be able to read pruned parquet file, and they should have the freedom to re-organize fields' ordering in decoded struct.

My Solution

I introduced a HashMap to map field name to its index. Of course, it assumes field name is unique, and this is always true since the current parquet_derive macro is applied to a flat struct without nesting.

Pros and Cons

Obviously removing those two constraints makes parquet_derive a more handy tool.

But it has some implied changes:

previously, since the ParquetRecordReader relies only on the index of fields, it allows that a field is named as abc to implicitly rename itself to bcd in the encoded struct. After this change, user must guarantee that the field name in ParquetRecordReader to exist in parquet columns.
- I think it is more intuitive and more natural to constrain the "field name" rather than "index", if we use ParquetRecordReader to derive a decoder macro.
allowing reading partial parquet file may improve the performance for some users, but introducing a HashMap in the parser may slowdown the function a bit.
- when the num_records in a single parsing call is large enough, the cost of HashMap lookup is negligible.

Both implied changes seem to have a more positive impact than negative impact. Please review if this is a reasonable feature request.

The text was updated successfully, but these errors were encountered:

alamb · 2024-08-31T13:34:21Z

label_issue.py automatically added labels {'parquet-derive'} from #6269

double-free added the enhancement Any new improvement worthy of a entry in the changelog label Aug 18, 2024

double-free mentioned this issue Aug 18, 2024

parquet_derive: Match fields by name, support reading selected fields rather than all #6269

Merged

alamb closed this as completed in #6269 Aug 31, 2024

alamb added the parquet-derive label Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet_derive: support reading selected columns from parquet file #6268

parquet_derive: support reading selected columns from parquet file #6268

double-free commented Aug 18, 2024 •

edited

Loading

alamb commented Aug 31, 2024

parquet_derive: support reading selected columns from parquet file #6268

parquet_derive: support reading selected columns from parquet file #6268

Comments

double-free commented Aug 18, 2024 • edited Loading

Feature Description

My Solution

Pros and Cons

alamb commented Aug 31, 2024

double-free commented Aug 18, 2024 •

edited

Loading