Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: coalesce continuous indices into ranges if possible #3513

Merged
merged 2 commits into from
Mar 12, 2025

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Mar 6, 2025

In DecodeBatchScheduler, when performing schedule_take with a given list of indices, the current implementation generates a list of ranges, each containing a single index. However, in cases where the indices are continuous, we can merge them into a single range instead of multiple separate ranges. This optimization improves efficiency, particularly benefiting dense queries that return most of the records in the dataset, as demonstrated in our benchmarks.

Copy link

github-actions bot commented Mar 6, 2025

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@niyue niyue changed the title Coalesce continuous indices into ranges if possible perf: coalesce continuous indices into ranges if possible Mar 6, 2025
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.66%. Comparing base (644213b) to head (0b7ee7c).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3513      +/-   ##
==========================================
+ Coverage   78.60%   78.66%   +0.05%     
==========================================
  Files         254      254              
  Lines       94999    95022      +23     
  Branches    94999    95022      +23     
==========================================
+ Hits        74677    74749      +72     
+ Misses      17302    17251      -51     
- Partials     3020     3022       +2     
Flag Coverage Δ
unittests 78.66% <100.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@niyue
Copy link
Contributor Author

niyue commented Mar 6, 2025

One test case failed in the CI:

FAILED python/tests/torch_tests/test_bench_utils.py::test_ground_truth - assert False

I checked the test case, however, I’m not sure if this failure is caused by this PR. Could you please advise? Thanks.

Additionally, is there any relevant existing benchmark in Lance that might be affected by this PR? We may need to check its results to assess the impact on existing benchmarks.

@BubbleCal BubbleCal requested a review from westonpace March 7, 2025 11:15
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a nice optimization!

@westonpace
Copy link
Contributor

Additionally, is there any relevant existing benchmark in Lance that might be affected by this PR? We may need to check its results to assess the impact on existing benchmarks.

We have some continuous benchmarking here (I think this is a public URL but let me know if you can't access it). We regrettably don't have any relevant benchmarks (the random access benchmark is unlikely to be able to benefit from this optimization). We don't have automation in place to run benchmarks on PRs and I wouldn't expect them to be affected by this change so I'll just merge and we can check the result.

@westonpace westonpace merged commit 9203377 into lancedb:main Mar 12, 2025
25 of 27 checks passed
@westonpace
Copy link
Contributor

particularly benefiting dense queries that return most of the records in the dataset

Yeah. In our table scans we fall back to filtered full scans for these cases when the data type is small. Still, for larger types (embeddings, etc.) we still use takes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants