Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet_derive: support reading selected columns from parquet file #6268

Closed
double-free opened this issue Aug 18, 2024 · 1 comment · Fixed by #6269
Closed

parquet_derive: support reading selected columns from parquet file #6268

double-free opened this issue Aug 18, 2024 · 1 comment · Fixed by #6269
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet-derive

Comments

@double-free
Copy link
Contributor

double-free commented Aug 18, 2024

Feature Description

I'm effectively using parquet_derive in my project, and I found that there are two inconvenient constraints:

  1. The ParquetRecordReader enforces the struct to organize fields exactly in the same order in the parquet file.
  2. The ParquetRecordReader enforces the struct to parse all fields in the parquet file. "all" might be exaggerating, but it is what happens if you want to get the last column, even only the last column.

As describe in its document:

Derive flat, simple RecordReader implementations. Works by parsing a struct tagged with #[derive(ParquetRecordReader)] and emitting the correct writing code for each field of the struct. Column readers are generated in the order they are defined.

In my use cases (and I believe these are common requests), user should be able to read pruned parquet file, and they should have the freedom to re-organize fields' ordering in decoded struct.

My Solution

I introduced a HashMap to map field name to its index. Of course, it assumes field name is unique, and this is always true since the current parquet_derive macro is applied to a flat struct without nesting.

Pros and Cons

Obviously removing those two constraints makes parquet_derive a more handy tool.

But it has some implied changes:

  • previously, since the ParquetRecordReader relies only on the index of fields, it allows that a field is named as abc to implicitly rename itself to bcd in the encoded struct. After this change, user must guarantee that the field name in ParquetRecordReader to exist in parquet columns.
    • I think it is more intuitive and more natural to constrain the "field name" rather than "index", if we use ParquetRecordReader to derive a decoder macro.
  • allowing reading partial parquet file may improve the performance for some users, but introducing a HashMap in the parser may slowdown the function a bit.
    • when the num_records in a single parsing call is large enough, the cost of HashMap lookup is negligible.

Both implied changes seem to have a more positive impact than negative impact. Please review if this is a reasonable feature request.

@alamb
Copy link
Contributor

alamb commented Aug 31, 2024

label_issue.py automatically added labels {'parquet-derive'} from #6269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet-derive
Projects
None yet
2 participants