Add partitioning strategies for S3 storage source #4805

wdbaruni · 2025-01-12T11:32:02Z

This PR introduces configurable partitioning strategies for S3 input sources, enabling distributed job executions to efficiently process subsets of S3 objects. When a job is created with multiple executions (N > 1), each execution is assigned a unique partition index (0 to N-1) and will only process its designated subset of objects based on the configured partitioning strategy.

Motivation

Enable parallel processing of large S3 datasets across multiple job executions
Allow users to control how objects are distributed based on their data organization patterns
Provide deterministic object distribution for reproducible results

Features

Multiple partitioning strategies:
- none: No partitioning, all objects available to all executions (default)
- object: Partition by complete object key using consistent hashing
- regex: Partition using regex pattern matches from object keys
- substring: Partition based on a specific portion of object keys
- date: Partition based on dates found in object keys
Hash-based partitioning using FNV-1a ensures:
- Deterministic assignment of objects to partitions
- Distribution based on the chosen strategy and input data patterns
Robust handling of edge cases:
- Fallback to partition 0 for unmatched objects
- Proper handling of directories and empty paths
- Unicode support for substring partitioning

Example Usage

Basic object partitioning:

  source:
      type: s3
      params:
        bucket: mybucket
        key: data/*
        partition:
          type: object

Regex partitioning with capture groups:

  source:
    type: s3
    params:
        bucket: mybucket
        key: data/*
        partition:
          type: regex
          pattern: "data/(\d{4})/(\d{2})/.*\.csv"

Date-based partitioning:

  source:
    type: s3
      params:
        bucket: mybucket
        key: logs/*
        partition:
          type: date
          dateFormat: "2006-01-02"

Testing

Unit tests covering all partitioning strategies
Integration tests with actual S3 storage
Edge case handling and error scenarios
Distribution analysis with various input patterns

Summary by CodeRabbit

Based on the comprehensive summary of changes, here are the release notes:

Release Notes

New Features
- Added S3 Object Partitioning system with support for multiple partitioning strategies (Object, Regex, Substring, Date)
- Enhanced storage and compute modules to support execution-level context
Improvements
- Refined method signatures across multiple packages to include execution context
- Updated error handling and message formatting in various storage and compute modules
- Improved flexibility in resource calculation and bidding strategies
Bug Fixes
- Updated volume size calculation methods to handle more complex input scenarios
- Enhanced validation for storage and partitioning configurations
Documentation
- Added comprehensive documentation for S3 Object Partitioning system
- Improved inline documentation for new features and method changes

linear · 2025-01-12T11:32:06Z

ENG-520 Partitioned S3 input source

coderabbitai · 2025-01-12T11:32:12Z

Walkthrough

The pull request introduces a comprehensive refactoring across multiple packages, primarily focusing on modifying method signatures to include an *models.Execution parameter. This change affects storage providers, compute capacity calculators, and bidding mechanisms. The modifications aim to enhance context handling by introducing an execution-centric approach, replacing previous job-centric implementations. Additionally, a new S3 object partitioning system has been introduced, providing flexible strategies for distributing and processing S3 objects.

Changes

File Path	Change Summary
`pkg/compute/bidder.go`	Updated method signatures from `models.Job` to `models.Execution` for bidding methods
`pkg/compute/capacity/calculators.go`	Modified `Calculate` method signatures in usage calculators to use `*models.Execution`
`pkg/storage/*/storage.go`	Updated `GetVolumeSize` and related methods across multiple storage providers to include `*models.Execution`
`pkg/s3/partitioning.go`	New file introducing S3 object partitioning with multiple strategies (object, regex, substring, date)
`pkg/s3/types.go`	Added `ObjectSummary` struct and modified `SourceSpec` to include partition configuration

Sequence Diagram

sequenceDiagram
    participant Client
    participant Bidder
    participant ExecutionContext
    participant ResourceCalculator
    participant StorageProvider

    Client->>Bidder: RunBidding(context, execution)
    Bidder->>ExecutionContext: Extract Job Details
    Bidder->>ResourceCalculator: Calculate(context, execution)
    ResourceCalculator-->>Bidder: Calculated Resources
    Bidder->>StorageProvider: GetVolumeSize(context, execution)
    StorageProvider-->>Bidder: Volume Size
    Bidder->>Client: Bidding Result

Poem

🐰 Hopping through code with glee,
Execution's context sets data free!
From job to context, we evolve,
Partitioning S3, problems we solve
A rabbit's refactor, clean and bright! 🚀

Finishing Touches

📝 Generate Docstrings (Beta)

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (17)

pkg/storage/s3/storage.go (1)

65-65: Update Function Documentation for New Parameter

The GetVolumeSize method now includes a new parameter execution *models.Execution. Please update the function's documentation to reflect the addition of this parameter and explain its purpose.

pkg/s3/partitioning.go (1)

276-279: Redundant Check for totalPartitions

The check if totalPartitions <= 0 in getPartitionIndex might be redundant since totalPartitions is already validated to be greater than zero in PartitionObjects. Consider removing this check to simplify the code.
pkg/compute/capacity/disk/calculator.go (2)
36-38: Error Handling for Volume Size Calculation

When calculating volumeSize, if an error occurs, it's wrapped with a generic message. Consider providing more context about which input source caused the error to aid in debugging.

Modify the error handling:
-	return nil, bacerrors.Wrap(err, "error getting job disk space requirements")
+	return nil, bacerrors.Wrap(err, fmt.Sprintf("error getting disk space requirements for input source: %s", input.Source))
31-31: Optimize Disk Requirements Calculation

Initialize totalDiskRequirements as parsedUsage.Disk to include any pre-parsed disk usage. Then, accumulate the sizes of input sources to provide a complete disk requirement estimation.

Apply this change:
-	var totalDiskRequirements uint64 = 0
+	totalDiskRequirements := parsedUsage.Disk
pkg/storage/inline/storage_test.go (1)
31-31: Consider adding partition-specific test cases

While the tests are correctly updated to use mock.Execution(), consider adding test cases that verify behavior with different partition configurations, especially for S3 storage implementation.
 func TestPlaintextInlineStorage(t *testing.T) {
     // ... existing test code ...
+    t.Run("with_partition_config", func(t *testing.T) {
+        execution := mock.ExecutionWithPartition(0, 2) // Mock with partition index
+        size, err := storage.GetVolumeSize(context.Background(), execution, inputSource)
+        require.NoError(t, err)
+        // Add assertions for partitioned behavior
+    })
 }
Also applies to: 58-58
pkg/storage/ipfs/storage_test.go (1)

Line range hint 55-64: Consider adding test cases for execution-specific behavior

While the test has been updated to include the execution context, it doesn't verify if the execution context affects the volume size calculation.

Consider adding test cases that validate execution-specific scenarios, such as:

Different execution configurations

Edge cases with nil execution

pkg/storage/local_directory/storage.go (1)

Line range hint 51-64: Consider utilizing execution context for volume size calculation

The execution context is currently ignored (_), but it could be valuable for:

Implementing execution-specific path resolution

Adding execution-based access controls

Supporting volume size quotas per execution
pkg/s3/errors.go (1)
31-32: LGTM! Consider adding examples in comments.

The change to use variadic parameters for error message formatting is a good improvement. It allows for more flexible error messages, which will be useful for the new partitioning feature.

Consider adding examples in comments to show how to use the new format:
// Example usage:
// NewS3InputSourceError(BadRequestErrorCode, "invalid partition strategy: %s", strategy)
pkg/s3/types.go (1)

12-18: Consider using non-pointer types for required fields.

The ObjectSummary struct uses pointer types for string fields. While this allows for nil values, consider:

Are these fields truly optional? If not, using non-pointer types would be safer.

Document why pointer types are used (e.g., for JSON null handling).
pkg/storage/noop/noop.go (1)
72-72: LGTM! Consider documenting the unused parameter.

The execution parameter is marked as unused with _. Consider adding a comment explaining why this parameter is needed for interface compatibility but not used in the noop implementation.
// _ *models.Execution is unused in noop implementation but required for interface compatibility
pkg/storage/ipfs/storage.go (2)
52-52: LGTM! Consider documenting the unused parameter.

The execution parameter is marked as unused with _. Consider adding a comment explaining why this parameter is needed for interface compatibility but not used in the IPFS implementation.
// _ *models.Execution is unused in IPFS implementation but required for interface compatibility
Line range hint 55-56: Consider creating an issue for the TODO comment.

The TODO comment about timeout handling has been present for a while and should be addressed.

Would you like me to create a GitHub issue to track this TODO? The issue would focus on improving the timeout handling by:

Making the timeout configurable via parameters

Using context timeouts set by the caller
pkg/storage/s3/storage_test.go (1)
Line range hint 63-438: Consider improving test organization for better maintainability.

The test cases comprehensively cover various partitioning scenarios, but the test structure could be improved for better maintainability:

Consider grouping test cases by partitioning strategy using subtests

Extract common test setup into helper functions

Add comments explaining the expected behavior for complex test cases

Example structure:
func (s *StorageTestSuite) TestStorage() {
    s.Run("No Partitioning", func() {
        // Test cases for no partitioning
    })
    
    s.Run("Object Partitioning", func() {
        // Test cases for object partitioning
    })
    
    s.Run("Regex Partitioning", func() {
        // Test cases for regex partitioning
    })
    
    // ... other strategies
}
pkg/s3/partitioning_test.go (2)
1251-1268: Consider splitting the verification logic for better maintainability.

The verifyPartitioning function handles multiple verification tasks. Consider splitting it into smaller, focused functions:

Extract the partition generation loop into a separate function

Create a dedicated function for running all verifications

Add error context to verification failures

Example refactor:
func (s *PartitionTestSuite) generatePartitions(spec SourceSpec, objects []ObjectSummary, totalPartitions int) ([][]ObjectSummary, error) {
    partitions := make([][]ObjectSummary, totalPartitions)
    for i := 0; i < totalPartitions; i++ {
        partition, err := PartitionObjects(objects, totalPartitions, i, spec)
        if err != nil {
            return nil, fmt.Errorf("failed to generate partition %d: %w", i, err)
        }
        partitions[i] = partition
    }
    return partitions, nil
}

func (s *PartitionTestSuite) runVerifications(spec SourceSpec, objects []ObjectSummary, partitions [][]ObjectSummary, totalPartitions int, expected [][]string) {
    s.verifyNoDirectories(partitions)
    s.verifyComplete(objects, partitions)
    s.verifyConsistency(spec, objects, totalPartitions)
    if expected != nil {
        s.verifyPartitionContents(partitions, expected)
    }
}

func (s *PartitionTestSuite) verifyPartitioning(spec SourceSpec, objects []ObjectSummary, totalPartitions int, expected [][]string) {
    s.Require().NotNil(expected, "expected partition contents must not be nil")
    s.Require().Equal(totalPartitions, len(expected), "expected partition count must match totalPartitions")

    partitions, err := s.generatePartitions(spec, objects, totalPartitions)
    s.Require().NoError(err, "failed to generate partitions")
    
    s.runVerifications(spec, objects, partitions, totalPartitions, expected)
}
252-420: Enhance test case documentation for better clarity.

While the test cases are comprehensive, consider these improvements:

Add comments explaining the purpose of each test case

Use more descriptive test names that indicate the scenario being tested

Document the rationale behind expected distributions

Example improvements:
func (s *PartitionTestSuite) TestPartitionByObject() {
    tests := []struct {
        name            string
        paths           []string
        prefix          string
        totalPartitions int
        expected        [][]string
    }{
        {
            name: "evenly distributes files across partitions", // More descriptive name
            paths: []string{
                "file1.txt",
                "file2.txt",
                "dir/", // directory
                "file3.txt",
                "file4.txt",
            },
            prefix:          "",
            totalPartitions: 2,
            expected: [][]string{
                {"file1.txt", "file3.txt"}, // Explain: Files distributed based on hash
                {"file2.txt", "file4.txt"}, // Explain: Even distribution achieved
            },
        },
        // ... other test cases with similar improvements
    }
    // ... rest of the function
}
pkg/s3/partitioning.md (2)
165-182: Fix markdown code block formatting.

The code blocks are missing language specifiers. Add appropriate language specifiers for better syntax highlighting:
- ```
+ ```text
  Original Key                    | Trimmed Key (used for partitioning)
  -------------------------------|--------------------------------
  ...
Also applies to: 175-182, 185-191, 200-204, 209-213, 218-222

🧰 Tools

🪛 Markdownlint (0.37.0)

165-165: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

175-175: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

1-258: Enhance documentation with additional sections.

The documentation is comprehensive but could benefit from these additions:

Add a troubleshooting section covering common issues

Include performance benchmarks or guidelines

Add examples of combining multiple partitioning strategies

Include a section on monitoring and metrics

Example additions:
## Troubleshooting

### Common Issues
1. Uneven Distribution
   - Check if prefix trimming is working as expected
   - Verify that the partitioning strategy matches your data pattern

2. Performance Issues
   - Consider using simpler strategies for large datasets
   - Monitor regex pattern complexity

## Performance Guidelines

Strategy    | Overhead | Use Case
------------|----------|----------
Object      | Low      | Random distribution
Regex       | High     | Complex patterns
Substring   | Low      | Fixed formats
Date        | Medium   | Time series

## Monitoring
- Monitor partition sizes for evenness
- Track fallback occurrences
- Measure processing time per partition
🧰 Tools

🪛 LanguageTool

[uncategorized] ~18-~18: A period might be missing here.
Context: ...ng is not needed or when handling small datasets ### 2. Object (`PartitionKeyTypeObject...

(AI_EN_LECTOR_MISSING_PUNCTUATION_PERIOD)

[grammar] ~111-~111: This phrase is duplicated. You should probably use “Regex Partitioning” only once.
Context: ... configurations ## Usage Examples ### Regex Partitioning #### Regex Partitioning with Capture Groups ```go config := Par...

(PHRASE_REPETITION)

[uncategorized] ~138-~138: Possible missing comma found.
Context: ...rtition objects with keys starting with dates like "2024-01-15-data.csv" ### Substri...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Markdownlint (0.37.0)

165-165: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

175-175: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

185-185: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

200-200: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

209-209: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

218-218: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 00b8981 and b241601.

📒 Files selected for processing (23)

pkg/compute/bidder.go (4 hunks)
pkg/compute/capacity/calculators.go (2 hunks)
pkg/compute/capacity/disk/calculator.go (1 hunks)
pkg/compute/capacity/types.go (1 hunks)
pkg/s3/errors.go (1 hunks)
pkg/s3/partitioning.go (1 hunks)
pkg/s3/partitioning.md (1 hunks)
pkg/s3/partitioning_test.go (1 hunks)
pkg/s3/types.go (4 hunks)
pkg/storage/inline/storage.go (1 hunks)
pkg/storage/inline/storage_test.go (2 hunks)
pkg/storage/ipfs/storage.go (1 hunks)
pkg/storage/ipfs/storage_test.go (2 hunks)
pkg/storage/local_directory/storage.go (1 hunks)
pkg/storage/local_directory/storage_test.go (1 hunks)
pkg/storage/noop/noop.go (1 hunks)
pkg/storage/s3/storage.go (9 hunks)
pkg/storage/s3/storage_test.go (10 hunks)
pkg/storage/s3/types.go (2 hunks)
pkg/storage/tracing/tracing.go (1 hunks)
pkg/storage/types.go (1 hunks)
pkg/storage/url/urldownload/storage.go (1 hunks)
pkg/storage/url/urldownload/storage_test.go (2 hunks)

🧰 Additional context used

🪛 LanguageTool

pkg/s3/partitioning.md

[uncategorized] ~18-~18: A period might be missing here.
Context: ...ng is not needed or when handling small datasets ### 2. Object (`PartitionKeyTypeObject...

(AI_EN_LECTOR_MISSING_PUNCTUATION_PERIOD)

[grammar] ~111-~111: This phrase is duplicated. You should probably use “Regex Partitioning” only once.
Context: ... configurations ## Usage Examples ### Regex Partitioning #### Regex Partitioning with Capture Groups ```go config := Par...

(PHRASE_REPETITION)

[uncategorized] ~138-~138: Possible missing comma found.
Context: ...rtition objects with keys starting with dates like "2024-01-15-data.csv" ### Substri...

(AI_HYDRA_LEO_MISSING_COMMA)

🪛 Markdownlint (0.37.0)

pkg/s3/partitioning.md

165-165: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

175-175: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

185-185: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

200-200: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

209-209: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

218-218: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🔇 Additional comments (25)

pkg/storage/s3/storage.go (7)

80-84: Proper Error Handling for Object Partitioning

The integration of s3helper.PartitionObjects enhances object partitioning based on execution parameters. Error handling is appropriately managed.

88-97: Robust Checks for Negative Size and Overflow

Good implementation of checks for negative object sizes and potential uint64 overflow. This ensures reliability when calculating the total volume size.

100-100: Type Conversion Consistency

The conversion of object.Size to uint64 is consistent after the negative size check. This maintains data integrity in the size calculation.

131-135: Effective Partitioning in PrepareStorage

The addition of object partitioning in the PrepareStorage method ensures that only relevant objects are processed for the current execution partition. Error handling is correctly implemented.

158-158: Updated Parameter Type for Object Summary

The downloadObject function now accepts object s3helper.ObjectSummary, aligning with the new object summary structure. This increases code maintainability by using a unified type.

189-194: Ensure Non-Nil Fields in GetObjectInput

While constructing s3.GetObjectInput, confirm that object.Key, object.VersionID, and object.ETag are not nil to avoid potential issues with the S3 API call.

162-162: ⚠️ Potential issue

Potential Nil Pointer Dereference

Before dereferencing object.Key with *object.Key, ensure that it is not nil to prevent a runtime panic.

Consider adding a nil check:
if object.Key == nil {
	return fmt.Errorf("object key is nil")
}
objectTokens := strings.Split(*object.Key, "/")
pkg/s3/partitioning.go (2)

70-75: Validate Substring Partitioning Indices

In the Validate method, ensure that StartIndex is less than EndIndex, and both are within acceptable bounds. The error messages are clear and help users correct their configurations.

270-272: SanitizeKeyForPatternMatching Function Logic

The sanitizeKeyForPatternMatching function trims the prefix and leading slash from objectKey. Ensure that this logic correctly handles keys where the prefix might not match or multiple slashes are present.
pkg/compute/capacity/calculators.go (2)

24-24: LGTM: Parameter change aligns with execution-centric design

The change from job to execution parameter provides necessary context for S3 partitioning while maintaining backward compatibility through the execution's job field.

43-46: Verify execution context propagation

The execution context is correctly propagated to child calculators. However, we should verify that all calculator implementations handle the new parameter correctly.

✅ Verification successful

Context propagation verified successfully

All calculator implementations (DefaultsUsageCalculator, ChainedUsageCalculator, and DiskUsageCalculator) correctly handle the context parameter with proper method signatures and context propagation.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Find all implementations of Calculate method to verify parameter changes
ast-grep --pattern 'func $_($$) Calculate(ctx context.Context, execution *models.Execution, $$$) ($$$)'
Length of output: 105

Script:
#!/bin/bash
# Find calculator interface definition and implementations
echo "=== Looking for Calculator interface ==="
rg "type.*Calculator.*interface" -A 5

echo -e "\n=== Looking for Calculate method implementations ==="
rg "func.*Calculate.*$ctx context\.Context.*\*models\.Execution"
Length of output: 833

Script:
#!/bin/bash
# List all .go files in capacity directory
echo "=== Go files in capacity directory ==="
fd -e go . pkg/compute/capacity/

echo -e "\n=== Looking for calculator struct types ==="
rg "type.*Calculator.*struct" -A 2 pkg/compute/capacity/
Length of output: 2018

Script:
#!/bin/bash
echo "=== Calculate implementations in calculators.go ==="
rg "func \(.*$ Calculate$" -A 5 pkg/compute/capacity/calculators.go

echo -e "\n=== Calculate implementations in disk/calculator.go ==="
rg "func \(.*$ Calculate\(" -A 5 pkg/compute/capacity/disk/calculator.go
Length of output: 1245
pkg/storage/types.go (1)

19-19: Interface change impacts all storage implementations

The addition of the execution parameter to GetVolumeSize is necessary for S3 partitioning but requires careful verification of all implementations.

✅ Verification successful

All storage implementations updated correctly

All storage providers have been updated to match the new GetVolumeSize signature. Most implementations ignore the execution parameter (using _) since it's only needed for S3 partitioning, while the S3 and tracing implementations properly utilize it.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Find all Storage interface implementations to verify compliance
ast-grep --pattern 'type $_ struct {
  $$$
}

func ($$) GetVolumeSize(context.Context, *models.Execution, models.InputSource) (uint64, error)'
Length of output: 144

Script:
#!/bin/bash
# Find all GetVolumeSize implementations with any signature
echo "=== GetVolumeSize implementations ==="
ast-grep --pattern 'func $_GetVolumeSize($$$) (uint64, error)'

echo -e "\n=== Files potentially containing Storage implementations ==="
fd -e go | rg -l "Storage"

echo -e "\n=== Detailed view of GetVolumeSize implementations ==="
rg "func.*GetVolumeSize.*\(" -A 1
Length of output: 3120
pkg/storage/s3/types.go (2)

38-39: LGTM: Consistent error message formatting

The error message format is clear and consistent with the rest of the codebase.

75-76: LGTM: Error handling matches DecodeSourceSpec

The error handling is consistent between both decode functions, maintaining code symmetry.
pkg/compute/capacity/types.go (1)

42-42: Verify all implementers of UsageCalculator interface

The interface change requires updates to all implementing types. Let's verify the changes are consistent across the codebase.

✅ Verification successful

All UsageCalculator implementations are properly updated

Found three implementations (DefaultsUsageCalculator, ChainedUsageCalculator, and DiskUsageCalculator) all using the new signature with *models.Execution parameter. No outdated implementations detected.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Find all implementations of UsageCalculator interface
ast-grep --pattern 'type $_ struct {
  $$$
}

func ($_ $_) Calculate(ctx context.Context, $_, $_ models.Resources) (*models.Resources, error) {
  $$$
}'
Length of output: 154

Script:
#!/bin/bash
# Find implementations of UsageCalculator with updated signature
ast-grep --pattern 'type $_ struct {
  $$$
}

func ($_ $_) Calculate(ctx context.Context, execution *models.Execution, parsedUsage models.Resources) (*models.Resources, error) {
  $$$
}'

# Also search for any potentially outdated implementations
ast-grep --pattern 'func ($_ $_) Calculate(ctx context.Context, $_, models.Resources) (*models.Resources, error)'
Length of output: 304

Script:
#!/bin/bash
# Find files containing Calculate method implementations
rg -l "Calculate" --type go

# Show Calculate method implementations with context
rg "func.*Calculate" --type go -B 5 -A 5
Length of output: 2645
pkg/storage/tracing/tracing.go (1)

41-45: LGTM! Clean implementation of the decorator pattern

The changes correctly propagate the execution context to the delegate storage implementation while maintaining the tracing functionality.

pkg/storage/ipfs/storage_test.go (1)

Line range hint 108-117: LGTM! Timeout test properly updated

The timeout test correctly includes the execution context while maintaining the original timeout verification logic.

pkg/storage/local_directory/storage.go (1)

Line range hint 66-82: LGTM! Clean implementation of PrepareStorage

The method signature has been updated while maintaining the existing functionality. The ignored execution context (_) aligns with the current implementation needs.

pkg/s3/types.go (1)

62-62: LGTM! Error message formatting is consistent.

The error message formatting in DecodeSourceSpec and DecodePreSignedResultSpec follows the new variadic parameter pattern introduced in NewS3InputSourceError.

Also applies to: 80-81

pkg/storage/inline/storage.go (1)

67-67: LGTM! Method signature updated to support execution context.

The change aligns with the broader refactoring to support S3 partitioning strategies. The execution parameter is correctly ignored as inline storage doesn't require execution context.

pkg/compute/bidder.go (1)

49-49: LGTM! Consistent refactoring to execution-centric approach.

The changes properly transition from job-centric to execution-centric approach while maintaining the original bidding logic. All job references are correctly accessed via the execution object.

Also applies to: 65-65, 69-69, 80-80, 88-88, 91-91, 128-128, 130-130, 135-135, 142-142

pkg/storage/local_directory/storage_test.go (1)

209-209: LGTM! Test updated to provide mock execution context.

The test correctly uses mock.Execution() while maintaining the original test coverage.

pkg/storage/url/urldownload/storage.go (1)

83-83: LGTM! Method signature updated while preserving functionality.

The change correctly adds the execution parameter to align with the interface changes. The parameter is properly ignored as URL download doesn't require execution context.

pkg/storage/url/urldownload/storage_test.go (2)

390-390: LGTM!

The test has been correctly updated to include the execution context parameter.

431-431: LGTM!

The error test case has been properly updated to include the execution context parameter.

pkg/s3/partitioning.go

pkg/compute/capacity/disk/calculator.go

pkg/s3/types.go

Add partitioning strategies for S3 storage source

b241601

coderabbitai bot reviewed Jan 12, 2025

View reviewed changes

pkg/s3/partitioning.go Show resolved Hide resolved

pkg/compute/capacity/disk/calculator.go Show resolved Hide resolved

pkg/s3/types.go Show resolved Hide resolved

wdbaruni merged commit 7bd50cc into main Jan 12, 2025
14 checks passed

wdbaruni deleted the eng-520-partitioned-s3-input-source branch January 12, 2025 12:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add partitioning strategies for S3 storage source #4805

Add partitioning strategies for S3 storage source #4805

wdbaruni commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

linear bot commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

Add partitioning strategies for S3 storage source #4805

Add partitioning strategies for S3 storage source #4805

Conversation

wdbaruni commented Jan 12, 2025 • edited by coderabbitai bot Loading

Motivation

Features

Example Usage

Testing

Summary by CodeRabbit

Release Notes

linear bot commented Jan 12, 2025

coderabbitai bot commented Jan 12, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Poem

Finishing Touches

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

wdbaruni commented Jan 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 12, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)