Skip to content

Commit

Permalink
chore: remove donated ORC format related parts (#138)
Browse files Browse the repository at this point in the history
  • Loading branch information
waynexia authored Oct 30, 2024
1 parent 883c892 commit 4fd3453
Show file tree
Hide file tree
Showing 200 changed files with 226 additions and 13,581 deletions.
8 changes: 0 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -189,11 +189,3 @@ jobs:
RUST_BACKTRACE: 1
CARGO_INCREMENTAL: 0
UNITTEST_LOG_DIR: "__unittest_logs"
- name: Codecov upload
uses: codecov/codecov-action@v2
with:
token: ${{ secrets.CODECOV_TOKEN }}
files: ./lcov.info
flags: rust
fail_ci_if_error: false
verbose: true
80 changes: 13 additions & 67 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
# under the License.

[package]
name = "orc-rust"
name = "datafusion-orc"
version = "0.4.1"
edition = "2021"
homepage = "https://github.com/datafusion-contrib/datafusion-orc"
Expand All @@ -32,26 +32,17 @@ rust-version = "1.73"
all-features = true

[dependencies]
arrow = { version = "52", features = ["prettyprint", "chrono-tz"] }
bytemuck = { version = "1.18.0", features = ["must_cast"] }
arrow = { version = "53", features = ["prettyprint", "chrono-tz"] }
async-trait = { version = "0.1.77" }
bytes = "1.4"
chrono = { version = "0.4.37", default-features = false, features = ["std"] }
chrono-tz = "0.9"
fallible-streaming-iterator = { version = "0.1" }
flate2 = "1"
lz4_flex = "0.11"
lzokay-native = "0.1"
num = "0.4.1"
prost = { version = "0.12" }
snafu = "0.8"
snap = "1.1"
zstd = "0.12"

# async support
async-trait = { version = "0.1.77", optional = true }
futures = { version = "0.3", optional = true, default-features = false, features = ["std"] }
futures-util = { version = "0.3", optional = true }
tokio = { version = "1.28", optional = true, features = [
datafusion = { version = "42.0" }
datafusion-expr = { version = "42.0" }
datafusion-physical-expr = { version = "42.0" }
futures = { version = "0.3", default-features = false, features = ["std"] }
futures-util = { version = "0.3" }
object_store = { version = "0.11" }
orc-rust = { version = "0.5", features = ["async"] }
tokio = { version = "1.28", features = [
"io-util",
"sync",
"fs",
Expand All @@ -60,61 +51,16 @@ tokio = { version = "1.28", optional = true, features = [
"rt-multi-thread",
] }

# cli
anyhow = { version = "1.0", optional = true }
clap = { version = "4.5.4", features = ["derive"], optional = true }

# opendal
opendal = { version = "0.48", optional = true, default-features = false }

# datafusion support
datafusion = { version = "39.0.0", optional = true }
datafusion-expr = { version = "39.0.0", optional = true }
datafusion-physical-expr = { version = "39.0.0", optional = true }
object_store = { version = "0.10.1", optional = true }

[dev-dependencies]
arrow-ipc = { version = "52.0.0", features = ["lz4"] }
arrow-json = "52.0.0"
arrow-ipc = { version = "53.0.0", features = ["lz4"] }
arrow-json = "53.0.0"
criterion = { version = "0.5", default-features = false, features = ["async_tokio"] }
opendal = { version = "0.48", default-features = false, features = ["services-memory"] }
pretty_assertions = "1.3.0"
proptest = "1.0.0"
serde_json = { version = "1.0", default-features = false, features = ["std"] }

[features]
default = ["async"]

async = ["async-trait", "futures", "futures-util", "tokio"]
cli = ["anyhow", "clap"]
datafusion = ["async", "dep:datafusion", "datafusion-expr", "datafusion-physical-expr", "object_store"]
# Enable opendal support.
opendal = ["dep:opendal"]

[[bench]]
name = "arrow_reader"
harness = false
required-features = ["async"]
# Some issue when publishing and path isn't specified, so adding here
path = "./benches/arrow_reader.rs"

[profile.bench]
debug = true

[[example]]
name = "datafusion_integration"
required-features = ["datafusion"]
# Some issue when publishing and path isn't specified, so adding here
path = "./examples/datafusion_integration.rs"

[[bin]]
name = "orc-metadata"
required-features = ["cli"]

[[bin]]
name = "orc-export"
required-features = ["cli"]

[[bin]]
name = "orc-stats"
required-features = ["cli"]
123 changes: 2 additions & 121 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,124 +1,5 @@
[![test](https://github.com/datafusion-contrib/datafusion-orc/actions/workflows/ci.yml/badge.svg)](https://github.com/datafusion-contrib/datafusion-orc/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/WenyXu/orc-rs/branch/main/graph/badge.svg?token=2CSHZX02XM)](https://codecov.io/gh/WenyXu/orc-rs)
[![Crates.io](https://img.shields.io/crates/v/orc-rust)](https://crates.io/crates/orc-rust)
[![Crates.io](https://img.shields.io/crates/d/orc-rust)](https://crates.io/crates/orc-rust)

# orc-rust

A native Rust implementation of the [Apache ORC](https://orc.apache.org) file format,
providing API's to read data into [Apache Arrow](https://arrow.apache.org) in-memory arrays.

See the [documentation](https://docs.rs/orc-rust/latest/orc_rust/) for examples on how to use this crate.

## Supported features

This crate currently only supports reading ORC files into Arrow arrays. Write support is planned
(see [Roadmap](#roadmap)). The below features listed relate only to reading ORC files.
At this time, we aim to support the [ORCv1](https://orc.apache.org/specification/ORCv1/) specification only.

- Read synchronously & asynchronously (using Tokio)
- All compression types (Zlib, Snappy, Lzo, Lz4, Zstd)
- All ORC data types
- All encodings
- Rudimentary support for retrieving statistics
- Retrieving user metadata into Arrow schema metadata

## Roadmap

The long term vision for this crate is to be feature complete enough to be donated to the
[arrow-rs](https://github.com/apache/arrow-rs) project.

The following lists the rough roadmap for features to be implemented, from highest to lowest priority.

- Performance enhancements
- DataFusion integration
- Predicate pushdown
- Row indices
- Bloom filters
- Write from Arrow arrays
- Encryption

A non-Arrow API interface is not planned at the moment. Feel free to raise an issue if there is such
a use case.

## Version compatibility

No guarantees are provided about stability across versions. We will endeavour to keep the top level API's
(`ArrowReader` and `ArrowStreamReader`) as stable as we can, but other API's provided may change as we
explore the interface we want the library to expose.

Versions will be released on an ad-hoc basis (with no fixed schedule).

## Mapping ORC types to Arrow types

The following table lists how ORC data types are read into Arrow data types:

| ORC Data Type | Arrow Data Type | Notes |
| ----------------- | -------------------------- | ----- |
| Boolean | Boolean | |
| TinyInt | Int8 | |
| SmallInt | Int16 | |
| Int | Int32 | |
| BigInt | Int64 | |
| Float | Float32 | |
| Double | Float64 | |
| String | Utf8 | |
| Char | Utf8 | |
| VarChar | Utf8 | |
| Binary | Binary | |
| Decimal | Decimal128 | |
| Date | Date32 | |
| Timestamp | Timestamp(Nanosecond, None) | ¹ |
| Timestamp instant | Timestamp(Nanosecond, UTC) | ¹ |
| Struct | Struct | |
| List | List | |
| Map | Map | |
| Union | Union(_, Sparse) | ² |

¹: `ArrowReaderBuilder::with_schema` allows configuring different time units or decoding to
`Decimal128(38, 9)` (i128 of non-leap nanoseconds since UNIX epoch).
Overflows may happen while decoding to a non-Seconds time unit, and results in `OrcError`.
Loss of precision may happen while decoding to a non-Nanosecond time unit, and results in `OrcError`.
`Decimal128(38, 9)` avoids both overflows and loss of precision.

²: Currently only supports a maximum of 127 variants

## Contributing

All contributions are welcome! Feel free to raise an issue if you have a feature request, bug report,
or a question. Feel free to raise a Pull Request without raising an issue first, as long as the Pull
Request is descriptive enough.

Some tools we use in addition to the standard `cargo` that require installation are:

- [taplo](https://taplo.tamasfe.dev/)
- [typos](https://crates.io/crates/typos)

```shell
cargo install typos-cli
cargo install taplo-cli
```

```shell
# Building the crate
cargo build

# Running the test suite
cargo test

# Simple benchmarks
cargo bench

# Formatting TOML files
taplo format

# Detect any typos in the codebase
typos
```

To regenerate/update the [proto.rs](src/proto.rs) file, execute the [regen.sh](regen.sh) script.

```shell
./regen.sh
```
# datafusion-orc

Experimental ORC file reader for DataFusion based on [orc-rust](https://crates.io/crates/orc-rust).
70 changes: 0 additions & 70 deletions benches/arrow_reader.rs

This file was deleted.

2 changes: 1 addition & 1 deletion examples/datafusion_integration.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

use datafusion::error::Result;
use datafusion::prelude::*;
use orc_rust::datafusion::{OrcReadOptions, SessionContextOrcExt};
use datafusion_orc::{OrcReadOptions, SessionContextOrcExt};

#[tokio::main]
async fn main() -> Result<()> {
Expand Down
Loading

0 comments on commit 4fd3453

Please sign in to comment.