Update README.md for PyDala2 with enhanced documentation and examples…

…; bump version to 0.9.3 in pyproject.toml
legout · Dec 5, 2024 · 2755482 · 2755482
1 parent 6ae926c
commit 2755482
Show file tree

Hide file tree

Showing 3 changed files with 134 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -1 +1,133 @@
-# pydala
+# PyDala2 
+
+<p align="center">
+  <img src="logo.jpeg" width="400" alt="PyDala2">
+</p>
+
+[![PyPI version](https://badge.fury.io/py/pydala2.svg)](https://badge.fury.io/py/pydala2)
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+
+
+
+## Overview 📖
+Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.
+
+## ✨ Key Features
+📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
+🔄 Robust Caching: Built-in support for faster data access
+🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
+🔍 Advanced Querying: SQL-like filtering with predicate pushdown
+🛠️ Schema Management: Automatic validation and tracking
+
+## 🚀 Quick Start
+### Installation
+```bash
+pip install pydala2
+```
+
+### 📊 Creating a Dataset
+```python
+from pydala.dataset import ParquetDataset
+
+dataset = ParquetDataset(
+    path="path/to/dataset",
+    partitioning="hive",         # Hive-style partitioning
+    timestamp_column="timestamp", # For time-based operations
+    cached=True                  # Enable performance caching
+)
+```
+
+### 💾 Writing Data
+```python
+import polars as pl
+
+# Create sample time-series data
+df = pl.DataFrame({
+    "timestamp": pl.date_range(0, 1000, "1d"),
+    "value": range(1000)
+})
+
+# Write with smart partitioning and compression
+dataset.write_to_dataset(
+    data=df,                    # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result 
+    mode="overwrite",           # Options: "overwrite", "append", "delta"
+    row_group_size=250_000,     # Optimize chunk size
+    compression="zstd",         # High-performance compression
+    partition_by=["year", "month"], # Auto-partition by time
+    unique=True                 # Ensure data uniqueness
+)
+```
+
+### 📥 Reading & Converting Data
+```python
+dataset.load(update_metadata=True)
+
+# Flexible data format conversion
+pt = dataset.t                  # PyDala Table
+df_polars = pt.to_polars()      # Convert to Polars
+df_pandas = pt.to_pandas()      # Convert to Pandas
+df_arrow = pt.to_arrow()        # Convert to Arrow
+rel_ddb = pt.to_ddb()           # Convert DuckDB relation
+
+# and many more... 
+```
+
+### 🔍 Smart Querying
+```python
+# Efficient filtered reads with predicate pushdown
+pt_filtered = dataset.filter("timestamp > '2023-01-01'")
+
+# Chaining operations
+df_filtered = (
+    dataset
+    .filter("column_name > 100")
+    .pl.with_columns(
+        pl.col("column_name").str.slice(0, 5).alias("new_column_name")
+        )
+    .to_pandas()
+    )
+
+# Fast metadata-only scans
+pt_scanned = dataset.scan("column_name > 100")
+
+# Access matching files
+matching_files = ds.scan_files
+```
+
+### 🔄 Metadata Management
+```python
+# Incremental metadata update
+dataset.load(update_metadata=True)   # Update for new files
+
+# Full metadata refresh
+dataset.load(reload_metadata=True)   # Reload all metadata
+
+# Repair schema/metadata
+dataset.repair_schema()
+```
+
+### ⚡ Performance Optimization Tools
+```python
+# Optimize storage types
+dataset.opt_dtypes()              # Automatic type optimization
+
+# Smart file management
+dataset.compact_by_rows(max_rows=100_000)  # Combine small files
+dataset.repartition(partitioning_columns=["date"])  # Optimize partitions
+dataset.compact_by_timeperiod(interval="1d")  # Time-based optimization
+dataset.compact_partitions()  # Partition structure optimization
+```
+
+## ⚠️ Important Notes
+Type optimization involves full dataset rewrite
+Choose compaction strategy based on your access patterns
+Regular metadata updates ensure optimal query performance
+
+## 📚 Documentation
+For advanced usage and complete API documentation, visit our docs.
+
+## 🤝 Contributing
+Contributions welcome! See our contribution guidelines.
+
+## 📝 License
+[MIT License](LICENSE)
diff --git a/logo.jpeg b/logo.jpeg
diff --git a/pyproject.toml b/pyproject.toml
@@ -23,7 +23,7 @@ name = "pydala2"
 readme = "README.md"
 repository = "https://github.com/legout/pydala2"
 requires-python = ">= 3.10"
-version = "0.9.2.3"
+version = "0.9.3"
 
 [project.optional-dependencies]
 legacy = ["polars-lts-cpu>=0.20.4"]