-
Notifications
You must be signed in to change notification settings - Fork 20
Conversation
@sunchao could you review the final version of record reader? It has not changed much, but I made some comment updates and minor tweaks. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sadikovi ! this is awesome work! I left a few comments.
src/bin/read-file.rs
Outdated
fn main() { | ||
let args: Vec<String> = env::args().collect(); | ||
if args.len() != 2 && args.len() != 3 { | ||
println!("Usage: read-file <file-path> <num-records>"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change <num-records>
to [num-records]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will change.
src/bin/read-file.rs
Outdated
@@ -0,0 +1,43 @@ | |||
extern crate parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the file name read-file
is still not specific enough - after people do cargo install
. Perhaps we should add prefix such as parquet
or something to differentiate these executables (same for dump-schema
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can rename them to parquet-read
and parquet-schema
. Will that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should be better 👍
src/file/reader.rs
Outdated
/// Get full iterator of `Row` from a file (over all row groups). | ||
/// Projected schema can be a subset of or equal to the file schema, when it is None, | ||
/// full file schema is assumed. | ||
fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<FileRowIter>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a single type of iterator for both FileReader::get_row_iter
and RowGroupReader::get_row_iter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should be able to, will change.
src/record/api.rs
Outdated
PhysicalType::BYTE_ARRAY => { | ||
match logical_type { | ||
LogicalType::UTF8 | LogicalType::ENUM | LogicalType::JSON => { | ||
Row::Str(String::from_utf8(value.data().to_vec()).unwrap()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can change from_utf8
to from_utf8_unchecked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is marked as unsafe. I can add it like this:
let value = unsafe { String::from_utf8_unchecked(value.data().to_vec()) };
Row::Str(value)
Will it be okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it should be OK.
src/record/reader.rs
Outdated
|
||
/// Returns true if repeated type is an element type for the list. | ||
/// Used to determine legacy list types. | ||
/// This method is copied from Spark Parquet reader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Perhaps good to also add the link for reference: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
@sunchao I addressed your comments. Could you do another pass on this PR? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM!
The latest build failed because Thrift failed to compile with the latest nightly. I've filed Thrift-4536 to fix it. In the meanwhile, we can update the travis.yml to use an earlier nightly version (e.g., nightly-2018-03-14). |
Thanks for the following up on the build error! As you mentioned on JIRA, it looks like we need to fix |
NP @sadikovi . Yes I'm working on a PR for the |
@sunchao Thanks. I updated the version. We need to revert it once |
Actually the fix is more involved than I expected. I think it might be wiser to wait until there's a conclusion in the Rust thread. It is breaking multiple libraries. |
It failed again (not sure why since it succeeded on my machine) - can you try |
@sunchao I updated, thanks! |
Merged! Thanks @sadikovi for this awesome contribution! |
@sunchao Thanks for merging! I really appreciate your review and help throughout working on read support. I will open another pull request with README update, including how to read schema and/or files. |
This PR adds new API for row-based reading of Parquet files.
This is one of the layers of API that will be finally provided (with a prospect of dropping record API in favour of Arrow, once it is integrated):
Most of the code is in
src/record
module:api.rs
contains all necessary API for mapping toRow
, which is our internal representation of records, supports most of the primitive types (easy to add new ones) and all complex types, e.g. structs, lists, maps.triplet.rs
contains internal triplet iterator that provides access to(value, def level, rep level)
tuple, performs necessary buffering and spacing of values.reader.rs
contains the code to assemble a tree of readers and to traverse the tree. Also containsRecordIter
andRowIter
iterators of records.API for
FileReader
andRowGroupReader
is also updated to return an iterator ofRow
s:fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<RowIter>;
fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<RowIter>;
These methods (especially for
FileReader
) can take optional projected schema to read only certain columns from a file. WhenNone
is provided, full schema is implied.RowIter
will automatically load the next row group, as iterator progresses. ForRowGroupReader
row iterator will only load data from that row group.I also added
parquet-read.rs
file inbin
directory, to read Parquet files. This allows to quickly inspect files similar toparquet-mr/parquet-tools
. Now we have two binaries - one is for inspecting schemaparquet-schema
and another one for reading dataparquet-read
.There are also some minor updates, like making certain methods public so they can be used in reader and moving common test methods into
test_common.rs
.Added the following test files:
nonnullable.impala.parquet
(Impala test file)nullable.impala.parquet
(Impala test file)nulls.snappy.parquet
(file that contains nulls only)nested_lists.snappy.parquet
(file with nested lists)nested_maps.snappy.parquet
(file with nested maps)Closes #41
Closes #42