ArrowArc is an experimental data transport mechanism that uses Apache Arrow for high-performance data manipulation. It is designed to be a zero-code, zero-config, and zero-maintenance data transport mechanism.
I'll add more benchmarks as I stabilize the library.
Transport 4 million records from Postgres to Parquet in under 3 seconds.
{
"StartTime": "2024-08-31T13:10:54-05:00",
"EndTime": "2024-08-31T13:10:57-05:00",
"RecordsProcessed": 4000000,
"Throughput": "1337039.12 records/second",
"ThroughputBytes": "172.16 MB/second",
"TotalBytes": "515.05 MB",
"TotalDuration": "2.992s"
}
You have several options to use ArrowArc:
- Use the command line utilities to transport data.
- Use the library in your Go program.
- Use a YAML configuration file to define your data pipelines.
Use the arrowarc
command to get started. It will display a help menu with available commands, including demos and benchmarks.
Example of setting up a pipeline to transport data from BigQuery to DuckDB:
// Setup the BigQuery client and reader
bq, err := integrations.NewBigQueryReadClient(ctx)
reader, err := bq.NewBigQueryReader(ctx, projectID, datasetID, tableID)
// Setup the DuckDB client and writer
duck, err := integrations.OpenDuckDBConnection(ctx, dbFilePath)
writer, err := integrations.NewDuckDBRecordWriter(ctx, duck, tableID)
// Create and start the data pipeline
p, err := pipeline.NewDataPipeline(reader, writer)
// Start the pipeline
err = p.Start(ctx)
if err != nil {
log.Fatalf("Failed to start pipeline: %v", err)
}
// Wait for the pipeline to finish
if pipelineErr := <-p.Done(); pipelineErr != nil {
return "", fmt.Errorf("pipeline encountered an error: %w", pipelineErr)
}
// Print the Transport Report
fmt.Println(p.Report())
You can expect a report similar to this:
{
"start_time": "2024-08-31T10:22:23-05:00",
"end_time": "2024-08-31T10:22:26-05:00",
"records_processed": 4000000,
"total_size": "0.63 GB",
"total_duration": "3.34s",
"throughput": "1197492.21 records/s",
"throughput_size": "194.11 MB/s"
}
Utility | Status |
---|---|
Transport Table | ✅ |
Rewrite Parquet | ✅ |
Generate Parquet | ✅ |
Generate IPC | ✅ |
Avro To Parquet | ✅ |
CSV To Parquet | ✅ |
CSV To JSON | ✅ |
JSON To Parquet | ✅ |
Parquet to CSV | ✅ |
Parquet to JSON | ✅ |
Flight Server | ✅ |
Sync Table | ❌ |
Validate Table | ❌ |
Database | Extraction | Ingestion |
---|---|---|
PostgreSQL | ✅ | 🚧 |
BigQuery | ✅ | ✅ |
DuckDB | ✅ | ✅ |
Spanner | ✅ | ❌ |
CockroachDB | ✅ | 🚧 |
MySQL | 🚧 | ❌ |
Oracle | ❌ | ❌ |
Snowflake | ❌ | ❌ |
SQLite | ❌ | ❌ |
Flight | ❌ | ❌ |
Provider | Extraction | Ingestion |
---|---|---|
Google Cloud Storage (GCS) | ✅ | ✅ |
Amazon S3 | ❌ | ❌ |
Azure Blob Storage | ❌ | ❌ |
Format | Extraction | Ingestion |
---|---|---|
Parquet | ✅ | ✅ |
Avro | ✅ | ❌ |
CSV | ✅ | ✅ |
JSON | ✅ | ✅ |
IPC | ✅ | ✅ |
Iceberg | ✅ | ❌ |
We welcome all contributions. Please see the Code of Conduct.
Please see the LICENSE for more details.