A lightweight, fully Python-based workflow for building an Apache Iceberg table from a pure Parquet dataset using non-JVM tools
.
This repository demonstrates how to:
- Ingest and Partition Data: Use Daft or Polars to read a Parquet file and write partitioned Parquet files based on a designated column (e.g.,
group
). - Create an Iceberg Table: Dynamically infer the table schema from the partitioned data and create an Apache Iceberg table using PyIceberg with either a SQLite or PostgreSQL catalog.
- Append Data & Track Snapshots: Append partitioned data to the Iceberg table and view snapshot history.
- Upsert Data: Merge new data with the existing table data through an upsert operation.
- Query the Table: Use DuckDB’s Iceberg extension to query the Iceberg table directly.
- Utilize Distributed Processing: Integrate Daft and Polars with Ray for distributed execution.
- Use PostgreSQL as a Catalog: Store Iceberg metadata in a PostgreSQL database instead of SQLite.
- Python 3.11+
- Install dependencies using:
The key dependencies include:
pip install -r requirements.txt
- getdaft[all]
- getdaft[ray]
- pyarrow
- duckdb
- pyiceberg
- pyiceberg[sql-sqlite]
- pyiceberg[sql-postgres]
- polars
Ensure you have a Parquet file named large_dataset.parquet
in the repository root.
Execute the main pipeline to convert your Parquet dataset into an Iceberg table.
python main-daft.py
or
python main-polars.py
python main-daft-ray.py
or
python main-polars-ray.py
python main-daft-psql.py
or
python main-polars-psql.py
This will:
- Read and partition the Parquet file.
- Dynamically infer the schema.
- Create an Iceberg table using either SQLite or PostgreSQL as a catalog.
- Append the partitioned data.
- Print the table schema and snapshot history.
A separate upsert script is provided to merge new data into the existing Iceberg table. This operation reads all partitioned files, merges them with new records based on a key (e.g., id
), and then overwrites the table with the merged result.
Run the upsert process with:
python upsert.py
A variant that works with Parquet files is also available:
python upsert_parquet.py
To view the table’s metadata location and snapshot history, run:
python read_history.py
Query the table using DuckDB’s Iceberg extension with the latest snapshot metadata:
python query_iceberg_duckdb.py
main-daft.py
: Converts the Parquet dataset into an Apache Iceberg table using Daft and a SQLite catalog.main-daft-psql.py
: Uses Daft with a PostgreSQL catalog.main-daft-ray.py
: Runs Daft with Ray for distributed processing.main-polars.py
: Uses Polars with a SQLite catalog.main-polars-psql.py
: Uses Polars with a PostgreSQL catalog.main-polars-ray.py
: Runs Polars with Ray for distributed processing.upsert.py
: Handles upserts to the Iceberg table.upsert_parquet.py
: Upserts new data using Parquet files.read_history.py
: Loads the table from the catalog and prints metadata and snapshot history.query_iceberg_duckdb.py
: Uses DuckDB’s Iceberg extension to query the Iceberg table.
A docker-compose.yml
file is included to run PostgreSQL and Ray for the distributed processing setup.
- To start the PostgreSQL and Ray cluster:
docker-compose up -d
This repository provides a modern Python-based data lakehouse solution leveraging Daft
or Polars
, PyIceberg
(with SQLite or PostgreSQL), and DuckDB
—without relying on JVM-based tools. By integrating Ray, it enables distributed processing for large-scale datasets efficiently.