Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor top-level interface #42

Closed
Jefffrey opened this issue Nov 14, 2023 · 1 comment · Fixed by #53
Closed

Refactor top-level interface #42

Jefffrey opened this issue Nov 14, 2023 · 1 comment · Fixed by #53

Comments

@Jefffrey
Copy link
Collaborator

(Where top-level interface refers to how DataFusion will use this library to read ORC files as that is the main intention of the crate)

Since we want this library to integrate with DataFusion, we should try provide a more clean interface for it to be able to read ORC files as record batches.

In current way:

fn new_arrow_reader(path: &str, fields: &[&str]) -> ArrowReader<File> {
let f = File::open(path).expect("no file found");
let reader = Reader::new(f).unwrap();
let cursor = Cursor::new(reader, fields).unwrap();
ArrowReader::new(cursor, None)
}

  • Create reader (our struct) from file
  • Create cursor (our struct) from reader
  • Create arrow reader (our struct) from cursor

Similar can be said for async version.

We can take inspiration from how parquet does it:

@Jefffrey
Copy link
Collaborator Author

I will work on trying to simplify the Reader/Cursor part a bit, maybe try to replicate what parquet does here: https://github.com/apache/arrow-rs/blob/master/parquet/src/file/reader.rs#L40-L68

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant