Skip to content

Commit

Permalink
Update README.md for PyDala2 with enhanced documentation and examples…
Browse files Browse the repository at this point in the history
…; bump version to 0.9.3 in pyproject.toml
  • Loading branch information
legout committed Dec 5, 2024
1 parent 6ae926c commit 2755482
Show file tree
Hide file tree
Showing 3 changed files with 134 additions and 2 deletions.
134 changes: 133 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,133 @@
# pydala
# PyDala2

<p align="center">
<img src="logo.jpeg" width="400" alt="PyDala2">
</p>

[![PyPI version](https://badge.fury.io/py/pydala2.svg)](https://badge.fury.io/py/pydala2)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)



## Overview 📖
Pydala is a high-performance Python library for managing Parquet datasets with powerful metadata capabilities. Built on Apache Arrow, it provides an efficient, user-friendly interface for handling large-scale data operations.

## ✨ Key Features
📦 Smart Dataset Management: Efficient Parquet handling with metadata optimization
🔄 Robust Caching: Built-in support for faster data access
🔌 Seamless Integration: Works with Polars, PyArrow, and DuckDB
🔍 Advanced Querying: SQL-like filtering with predicate pushdown
🛠️ Schema Management: Automatic validation and tracking

## 🚀 Quick Start
### Installation
```bash
pip install pydala2
```

### 📊 Creating a Dataset
```python
from pydala.dataset import ParquetDataset

dataset = ParquetDataset(
path="path/to/dataset",
partitioning="hive", # Hive-style partitioning
timestamp_column="timestamp", # For time-based operations
cached=True # Enable performance caching
)
```

### 💾 Writing Data
```python
import polars as pl

# Create sample time-series data
df = pl.DataFrame({
"timestamp": pl.date_range(0, 1000, "1d"),
"value": range(1000)
})

# Write with smart partitioning and compression
dataset.write_to_dataset(
data=df, # Can be a polars or pandas DataFrame or an Arrow Table, Dataset, or RecordBatch or a duckdb result
mode="overwrite", # Options: "overwrite", "append", "delta"
row_group_size=250_000, # Optimize chunk size
compression="zstd", # High-performance compression
partition_by=["year", "month"], # Auto-partition by time
unique=True # Ensure data uniqueness
)
```

### 📥 Reading & Converting Data
```python
dataset.load(update_metadata=True)

# Flexible data format conversion
pt = dataset.t # PyDala Table
df_polars = pt.to_polars() # Convert to Polars
df_pandas = pt.to_pandas() # Convert to Pandas
df_arrow = pt.to_arrow() # Convert to Arrow
rel_ddb = pt.to_ddb() # Convert DuckDB relation

# and many more...
```

### 🔍 Smart Querying
```python
# Efficient filtered reads with predicate pushdown
pt_filtered = dataset.filter("timestamp > '2023-01-01'")

# Chaining operations
df_filtered = (
dataset
.filter("column_name > 100")
.pl.with_columns(
pl.col("column_name").str.slice(0, 5).alias("new_column_name")
)
.to_pandas()
)

# Fast metadata-only scans
pt_scanned = dataset.scan("column_name > 100")

# Access matching files
matching_files = ds.scan_files
```

### 🔄 Metadata Management
```python
# Incremental metadata update
dataset.load(update_metadata=True) # Update for new files

# Full metadata refresh
dataset.load(reload_metadata=True) # Reload all metadata

# Repair schema/metadata
dataset.repair_schema()
```

### ⚡ Performance Optimization Tools
```python
# Optimize storage types
dataset.opt_dtypes() # Automatic type optimization

# Smart file management
dataset.compact_by_rows(max_rows=100_000) # Combine small files
dataset.repartition(partitioning_columns=["date"]) # Optimize partitions
dataset.compact_by_timeperiod(interval="1d") # Time-based optimization
dataset.compact_partitions() # Partition structure optimization
```

## ⚠️ Important Notes
Type optimization involves full dataset rewrite
Choose compaction strategy based on your access patterns
Regular metadata updates ensure optimal query performance

## 📚 Documentation
For advanced usage and complete API documentation, visit our docs.

## 🤝 Contributing
Contributions welcome! See our contribution guidelines.

## 📝 License
[MIT License](LICENSE)
Binary file added logo.jpeg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ name = "pydala2"
readme = "README.md"
repository = "https://github.com/legout/pydala2"
requires-python = ">= 3.10"
version = "0.9.2.3"
version = "0.9.3"

[project.optional-dependencies]
legacy = ["polars-lts-cpu>=0.20.4"]
Expand Down

0 comments on commit 2755482

Please sign in to comment.