You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm effectively using parquet_derive in my project, and I found that there are two inconvenient constraints:
The ParquetRecordReader enforces the struct to organize fields exactly in the same order in the parquet file.
The ParquetRecordReader enforces the struct to parse all fields in the parquet file. "all" might be exaggerating, but it is what happens if you want to get the last column, even only the last column.
As describe in its document:
Derive flat, simple RecordReader implementations. Works by parsing a struct tagged with #[derive(ParquetRecordReader)] and emitting the correct writing code for each field of the struct. Column readers are generated in the order they are defined.
In my use cases (and I believe these are common requests), user should be able to read pruned parquet file, and they should have the freedom to re-organize fields' ordering in decoded struct.
My Solution
I introduced a HashMap to map field name to its index. Of course, it assumes field name is unique, and this is always true since the current parquet_derive macro is applied to a flat struct without nesting.
Pros and Cons
Obviously removing those two constraints makes parquet_derive a more handy tool.
But it has some implied changes:
previously, since the ParquetRecordReader relies only on the index of fields, it allows that a field is named as abc to implicitly rename itself to bcd in the encoded struct. After this change, user must guarantee that the field name in ParquetRecordReader to exist in parquet columns.
I think it is more intuitive and more natural to constrain the "field name" rather than "index", if we use ParquetRecordReader to derive a decoder macro.
allowing reading partial parquet file may improve the performance for some users, but introducing a HashMap in the parser may slowdown the function a bit.
when the num_records in a single parsing call is large enough, the cost of HashMap lookup is negligible.
Both implied changes seem to have a more positive impact than negative impact. Please review if this is a reasonable feature request.
The text was updated successfully, but these errors were encountered:
Feature Description
I'm effectively using
parquet_derive
in my project, and I found that there are two inconvenient constraints:ParquetRecordReader
enforces the struct to organize fields exactly in the same order in the parquet file.ParquetRecordReader
enforces the struct to parse all fields in the parquet file. "all" might be exaggerating, but it is what happens if you want to get the last column, even only the last column.As describe in its document:
In my use cases (and I believe these are common requests), user should be able to read pruned parquet file, and they should have the freedom to re-organize fields' ordering in decoded struct.
My Solution
I introduced a
HashMap
to map field name to its index. Of course, it assumes field name is unique, and this is always true since the currentparquet_derive
macro is applied to a flat struct without nesting.Pros and Cons
Obviously removing those two constraints makes
parquet_derive
a more handy tool.But it has some implied changes:
ParquetRecordReader
relies only on the index of fields, it allows that a field is named asabc
to implicitly rename itself tobcd
in the encoded struct. After this change, user must guarantee that the field name inParquetRecordReader
to exist in parquet columns.ParquetRecordReader
to derive a decoder macro.HashMap
in the parser may slowdown the function a bit.num_records
in a single parsing call is large enough, the cost ofHashMap
lookup is negligible.Both implied changes seem to have a more positive impact than negative impact. Please review if this is a reasonable feature request.
The text was updated successfully, but these errors were encountered: