-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid Allocating When Decoding Thrift #5777
Conversation
}; | ||
use thrift::transport::TReadTransport; | ||
|
||
pub trait TInputProtocolRef<'de>: TInputProtocol { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what allows the reader to borrow from the slice instead of allocating, fortunately #4892 already did the heavy lifting here
@@ -78,12 +78,22 @@ pub fn decode_metadata(buf: &[u8]) -> Result<ParquetMetaData> { | |||
row_groups.push(RowGroupMetaData::from_thrift(schema_descr.clone(), rg)?); | |||
} | |||
let column_orders = parse_column_orders(t_file_metadata.column_orders, &schema_descr); | |||
let kv_metadata = t_file_metadata.key_value_metadata.map(|x| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This formulation is a little obtuse, we'd likely want to do something to make this better
@@ -210,6 +211,12 @@ impl<'a> From<&'a [u8]> for ByteArray { | |||
} | |||
} | |||
|
|||
impl<'a> From<Cow<'a, [u8]>> for ByteArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth highlighting that this does perform an allocation, and does mean that when decoding into the Rust versions of statistics, etc... we still allocate. However, we now only do this for ByteArray types, whereas previously all columns would have associated allocations, and theoretically the reader could perform projection pushdown at this point.
We don't have very much benchmark coverage of metadata parsing, #5770 will hopefully help address this, but what we have shows a non-trivial performance uplift
I'm confident it will be even more pronounced for wider schemas |
This is a proof of concept, to show that this is possible
Which issue does this PR close?
Closes #5775
Rationale for this change
I haven't had time to properly benchmark this, but in a quick test this at least halved the number of allocations associated with reading a parquet file.
What changes are included in this PR?
Are there any user-facing changes?