-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: coalesce continuous indices into ranges if possible #3513
perf: coalesce continuous indices into ranges if possible #3513
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3513 +/- ##
==========================================
+ Coverage 78.60% 78.66% +0.05%
==========================================
Files 254 254
Lines 94999 95022 +23
Branches 94999 95022 +23
==========================================
+ Hits 74677 74749 +72
+ Misses 17302 17251 -51
- Partials 3020 3022 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
One test case failed in the CI:
I checked the test case, however, I’m not sure if this failure is caused by this PR. Could you please advise? Thanks. Additionally, is there any relevant existing benchmark in Lance that might be affected by this PR? We may need to check its results to assess the impact on existing benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a nice optimization!
We have some continuous benchmarking here (I think this is a public URL but let me know if you can't access it). We regrettably don't have any relevant benchmarks (the random access benchmark is unlikely to be able to benefit from this optimization). We don't have automation in place to run benchmarks on PRs and I wouldn't expect them to be affected by this change so I'll just merge and we can check the result. |
Yeah. In our table scans we fall back to filtered full scans for these cases when the data type is small. Still, for larger types (embeddings, etc.) we still use takes. |
In
DecodeBatchScheduler
, when performingschedule_take
with a given list of indices, the current implementation generates a list of ranges, each containing a single index. However, in cases where the indices are continuous, we can merge them into a single range instead of multiple separate ranges. This optimization improves efficiency, particularly benefiting dense queries that return most of the records in the dataset, as demonstrated in our benchmarks.