ArrowArc

ArrowArc is an experimental data transport mechanism that uses Apache Arrow for high-performance data manipulation. It is designed to be a zero-code, zero-config, and zero-maintenance data transport mechanism.

Benchmarks

I'll add more benchmarks as I stabilize the library.

Transport Postgres to Parquet

Transport 4 million records from Postgres to Parquet in under 3 seconds.

{
  "StartTime": "2024-08-31T13:10:54-05:00",
  "EndTime": "2024-08-31T13:10:57-05:00",
  "RecordsProcessed": 4000000,
  "Throughput": "1337039.12 records/second",
  "ThroughputBytes": "172.16 MB/second",
  "TotalBytes": "515.05 MB",
  "TotalDuration": "2.992s"
}

Getting Started

You have several options to use ArrowArc:

Use the command line utilities to transport data.
Use the library in your Go program.
Use a YAML configuration file to define your data pipelines.

Command Line Utilities

Use the arrowarc command to get started. It will display a help menu with available commands, including demos and benchmarks.

Go Library

Example of setting up a pipeline to transport data from BigQuery to DuckDB:

// Setup the BigQuery client and reader
bq, err := integrations.NewBigQueryReadClient(ctx)
reader, err := bq.NewBigQueryReader(ctx, projectID, datasetID, tableID)

// Setup the DuckDB client and writer
duck, err := integrations.OpenDuckDBConnection(ctx, dbFilePath)
writer, err := integrations.NewDuckDBRecordWriter(ctx, duck, tableID)

// Create and start the data pipeline
p, err := pipeline.NewDataPipeline(reader, writer)

// Start the pipeline
err = p.Start(ctx)
if err != nil {
    log.Fatalf("Failed to start pipeline: %v", err)
}

// Wait for the pipeline to finish
if pipelineErr := <-p.Done(); pipelineErr != nil {
    return "", fmt.Errorf("pipeline encountered an error: %w", pipelineErr)
}

// Print the Transport Report
fmt.Println(p.Report())

You can expect a report similar to this:

{
  "start_time": "2024-08-31T10:22:23-05:00",
  "end_time": "2024-08-31T10:22:26-05:00",
  "records_processed": 4000000,
  "total_size": "0.63 GB",
  "total_duration": "3.34s",
  "throughput": "1197492.21 records/s",
  "throughput_size": "194.11 MB/s"
}

Features

CLI Utilities

Utility	Status
Transport Table	✅
Rewrite Parquet	✅
Generate Parquet	✅
Generate IPC	✅
Avro To Parquet	✅
CSV To Parquet	✅
CSV To JSON	✅
JSON To Parquet	✅
Parquet to CSV	✅
Parquet to JSON	✅
Flight Server	✅
Sync Table	❌
Validate Table	❌

Integrations

Database Integrations

Database	Extraction	Ingestion
PostgreSQL	✅	🚧
BigQuery	✅	✅
DuckDB	✅	✅
Spanner	✅	❌
CockroachDB	✅	🚧
MySQL	🚧	❌
Oracle	❌	❌
Snowflake	❌	❌
SQLite	❌	❌
Flight	❌	❌

Cloud Storage Integrations

Provider	Extraction	Ingestion
Google Cloud Storage (GCS)	✅	✅
Amazon S3	❌	❌
Azure Blob Storage	❌	❌

Filesystem Formats

Format	Extraction	Ingestion
Parquet	✅	✅
Avro	✅	❌
CSV	✅	✅
JSON	✅	✅
IPC	✅	✅
Iceberg	✅	❌

Contributing

We welcome all contributions. Please see the Code of Conduct.

License

Please see the LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 365 Commits
.github		.github
arrowutils		arrowutils
assets/images		assets/images
cmd		cmd
converter		converter
experiments		experiments
generator		generator
integrations		integrations
internal		internal
pipeline		pipeline
pkg		pkg
test		test
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArrowArc

Benchmarks

Transport Postgres to Parquet

Getting Started

Command Line Utilities

Go Library

Features

CLI Utilities

Integrations

Database Integrations

Cloud Storage Integrations

Filesystem Formats

Contributing

License

About

Contributors 2

Languages

License

TFMV/arrowarc

Folders and files

Latest commit

History

Repository files navigation

ArrowArc

Benchmarks

Transport Postgres to Parquet

Getting Started

Command Line Utilities

Go Library

Features

CLI Utilities

Integrations

Database Integrations

Cloud Storage Integrations

Filesystem Formats

Contributing

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Contributors 2

Languages