Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDFs for Mapping Feed Ranges to Buckets #43092

Merged
merged 6 commits into from
Jan 6, 2025
Merged

Conversation

tvaron3
Copy link
Member

@tvaron3 tvaron3 commented Nov 26, 2024

Problem

Joins with Databricks tables could be inefficient due to the lack of range partitioning. Because there is no range partitioning, customers that want to partition their Databricks tables on feed ranges cannot do it.

Description

Two UDFs were added: GetFeedRangesForContainer and GetOverlappingFeedRange. GetFeedRangesForContainer gives the number of feeds ranges a user needs split from the full range of a container. If no number is specified, it will return the feed ranges corresponding to the physical partitions. GetOverlappingFeedRange gives the feed range that overlaps with the feed range a partition key maps to. A feed range as a string could now be used to partition a Databricks table and using the feed ranges from the GetFeedRangesForContainer UDF.

Testing

There already existed an API in the java SDK to split a feed range into multiple sub feed ranges. It explicitly checked for if a partition key was a hierarchical partition key to prevent splitting, but I removed this check. I added tests for splitting feed ranges for hierarchical partition keys to ensure this check could be removed. For the UDFs, I added tests with both normal partition keys and hierarchical partition keys. For these tests, it compares the results of a query filtered by a feed range to the documents with the bucket corresponding to that feed range.

Suggestions

Better names for the udfs ?
Edit: finalized with GetFeedRangesForContainer and GetOverlappingFeedRange

@azure-sdk
Copy link
Collaborator

API change check

API changes are not detected in this pull request.

@tvaron3
Copy link
Member Author

tvaron3 commented Dec 3, 2024

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3
Copy link
Member Author

tvaron3 commented Dec 3, 2024

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3 tvaron3 marked this pull request as ready for review December 3, 2024 23:31
@tvaron3 tvaron3 requested review from kirankumarkolli and a team as code owners December 3, 2024 23:31
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tvaron3
Copy link
Member Author

tvaron3 commented Dec 18, 2024

/azp run java - cosmos - tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3
Copy link
Member Author

tvaron3 commented Dec 18, 2024

/azp run java - cosmos - spark

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@tvaron3 tvaron3 merged commit 005ceb4 into Azure:main Jan 6, 2025
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants