Skip to content
This repository has been archived by the owner on Jan 11, 2021. It is now read-only.

Row-based read support #68

Merged
merged 17 commits into from
Mar 29, 2018
Merged

Row-based read support #68

merged 17 commits into from
Mar 29, 2018

Conversation

sadikovi
Copy link
Collaborator

@sadikovi sadikovi commented Mar 26, 2018

This PR adds new API for row-based reading of Parquet files.

This is one of the layers of API that will be finally provided (with a prospect of dropping record API in favour of Arrow, once it is integrated):

  • Low-level column readers API (already exists)
  • Record API (this PR)
  • Arrow API (TODO)

Most of the code is in src/record module:

  • api.rs contains all necessary API for mapping to Row, which is our internal representation of records, supports most of the primitive types (easy to add new ones) and all complex types, e.g. structs, lists, maps.
  • triplet.rs contains internal triplet iterator that provides access to (value, def level, rep level) tuple, performs necessary buffering and spacing of values.
  • reader.rs contains the code to assemble a tree of readers and to traverse the tree. Also contains RecordIter and RowIter iterators of records.

API for FileReader and RowGroupReader is also updated to return an iterator of Rows:

  • fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<RowIter>;
  • fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<RowIter>;

These methods (especially for FileReader) can take optional projected schema to read only certain columns from a file. When None is provided, full schema is implied.
RowIter will automatically load the next row group, as iterator progresses. For RowGroupReader row iterator will only load data from that row group.

I also added parquet-read.rs file in bin directory, to read Parquet files. This allows to quickly inspect files similar to parquet-mr/parquet-tools. Now we have two binaries - one is for inspecting schema parquet-schema and another one for reading data parquet-read.

There are also some minor updates, like making certain methods public so they can be used in reader and moving common test methods into test_common.rs.

Added the following test files:

  • nonnullable.impala.parquet (Impala test file)
  • nullable.impala.parquet (Impala test file)
  • nulls.snappy.parquet (file that contains nulls only)
  • nested_lists.snappy.parquet (file with nested lists)
  • nested_maps.snappy.parquet (file with nested maps)

I think it is a good idea to update main README with available APIs - will do it after this PR is merged.

Closes #41
Closes #42

@sadikovi
Copy link
Collaborator Author

@sunchao could you review the final version of record reader? It has not changed much, but I made some comment updates and minor tweaks.

Thanks!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) to 92.773% when pulling c0b0d5b on sadikovi:read-support-final into d9710c8 on sunchao:master.

@coveralls
Copy link

coveralls commented Mar 26, 2018

Coverage Status

Coverage increased (+0.2%) to 92.666% when pulling aefb3f4 on sadikovi:read-support-final into f2f0993 on sunchao:master.

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sadikovi ! this is awesome work! I left a few comments.

fn main() {
let args: Vec<String> = env::args().collect();
if args.len() != 2 && args.len() != 3 {
println!("Usage: read-file <file-path> <num-records>");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change <num-records> to [num-records]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will change.

@@ -0,0 +1,43 @@
extern crate parquet;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file name read-file is still not specific enough - after people do cargo install. Perhaps we should add prefix such as parquet or something to differentiate these executables (same for dump-schema)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename them to parquet-read and parquet-schema. Will that work?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be better 👍

/// Get full iterator of `Row` from a file (over all row groups).
/// Projected schema can be a subset of or equal to the file schema, when it is None,
/// full file schema is assumed.
fn get_row_iter(&self, projection: Option<SchemaType>) -> Result<FileRowIter>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a single type of iterator for both FileReader::get_row_iter and RowGroupReader::get_row_iter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should be able to, will change.

PhysicalType::BYTE_ARRAY => {
match logical_type {
LogicalType::UTF8 | LogicalType::ENUM | LogicalType::JSON => {
Row::Str(String::from_utf8(value.data().to_vec()).unwrap())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can change from_utf8 to from_utf8_unchecked

Copy link
Collaborator Author

@sadikovi sadikovi Mar 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is marked as unsafe. I can add it like this:

let value = unsafe { String::from_utf8_unchecked(value.data().to_vec()) };
Row::Str(value)

Will it be okay?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think it should be OK.


/// Returns true if repeated type is an element type for the list.
/// Used to determine legacy list types.
/// This method is copied from Spark Parquet reader.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadikovi
Copy link
Collaborator Author

@sunchao I addressed your comments. Could you do another pass on this PR? Thanks!

Copy link
Owner

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM!

@sunchao
Copy link
Owner

sunchao commented Mar 29, 2018

The latest build failed because Thrift failed to compile with the latest nightly. I've filed Thrift-4536 to fix it. In the meanwhile, we can update the travis.yml to use an earlier nightly version (e.g., nightly-2018-03-14).

@sadikovi
Copy link
Collaborator Author

Thanks for the following up on the build error!

As you mentioned on JIRA, it looks like we need to fix try_from crate. I suggest we do that, otherwise it would fail for other people that use the latest nightly Rust. Let me know if you are planning to open a PR in the try_from repository, or if you want I could try fixing it.

@sunchao
Copy link
Owner

sunchao commented Mar 29, 2018

NP @sadikovi . Yes I'm working on a PR for the try_from repo - will post soon :)
Meanwhile you can change the travis.yml to use nightly-2018-03-14 to unblock the merge.

@sadikovi
Copy link
Collaborator Author

@sunchao Thanks. I updated the version. We need to revert it once try_from is fixed!

@sunchao
Copy link
Owner

sunchao commented Mar 29, 2018

Actually the fix is more involved than I expected. I think it might be wiser to wait until there's a conclusion in the Rust thread. It is breaking multiple libraries.

@sunchao
Copy link
Owner

sunchao commented Mar 29, 2018

It failed again (not sure why since it succeeded on my machine) - can you try nightly-2018-03-26?

@sadikovi
Copy link
Collaborator Author

@sunchao I updated, thanks!

@sunchao sunchao merged commit 67636b0 into sunchao:master Mar 29, 2018
@sunchao
Copy link
Owner

sunchao commented Mar 29, 2018

Merged! Thanks @sadikovi for this awesome contribution!

@sadikovi sadikovi deleted the read-support-final branch March 29, 2018 23:02
@sadikovi
Copy link
Collaborator Author

@sunchao Thanks for merging! I really appreciate your review and help throughout working on read support.

I will open another pull request with README update, including how to read schema and/or files.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants