Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add partitioning strategies for S3 storage source (#4805)
This PR introduces configurable partitioning strategies for S3 input sources, enabling distributed job executions to efficiently process subsets of S3 objects. When a job is created with multiple executions (N > 1), each execution is assigned a unique partition index (0 to N-1) and will only process its designated subset of objects based on the configured partitioning strategy. ## Motivation - Enable parallel processing of large S3 datasets across multiple job executions - Allow users to control how objects are distributed based on their data organization patterns - Provide deterministic object distribution for reproducible results ## Features - Multiple partitioning strategies: - `none`: No partitioning, all objects available to all executions (default) - `object`: Partition by complete object key using consistent hashing - `regex`: Partition using regex pattern matches from object keys - `substring`: Partition based on a specific portion of object keys - `date`: Partition based on dates found in object keys - Hash-based partitioning using FNV-1a ensures: - Deterministic assignment of objects to partitions - Distribution based on the chosen strategy and input data patterns - Robust handling of edge cases: - Fallback to partition 0 for unmatched objects - Proper handling of directories and empty paths - Unicode support for substring partitioning ## Example Usage Basic object partitioning: ```yaml source: type: s3 params: bucket: mybucket key: data/* partition: type: object ``` Regex partitioning with capture groups: ```yaml source: type: s3 params: bucket: mybucket key: data/* partition: type: regex pattern: "data/(\d{4})/(\d{2})/.*\.csv" ``` Date-based partitioning: ```yaml source: type: s3 params: bucket: mybucket key: logs/* partition: type: date dateFormat: "2006-01-02" ``` ## Testing - Unit tests covering all partitioning strategies - Integration tests with actual S3 storage - Edge case handling and error scenarios - Distribution analysis with various input patterns <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit Based on the comprehensive summary of changes, here are the release notes: ## Release Notes - **New Features** - Added S3 Object Partitioning system with support for multiple partitioning strategies (Object, Regex, Substring, Date) - Enhanced storage and compute modules to support execution-level context - **Improvements** - Refined method signatures across multiple packages to include execution context - Updated error handling and message formatting in various storage and compute modules - Improved flexibility in resource calculation and bidding strategies - **Bug Fixes** - Updated volume size calculation methods to handle more complex input scenarios - Enhanced validation for storage and partitioning configurations - **Documentation** - Added comprehensive documentation for S3 Object Partitioning system - Improved inline documentation for new features and method changes <!-- end of auto-generated comment: release notes by coderabbit.ai -->
- Loading branch information