Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: detect the drift and retrain the index if hit threshold #3489

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

BubbleCal
Copy link
Contributor

@BubbleCal BubbleCal commented Feb 28, 2025

Today we provide 2 ways for users to maintain the vector index:

  • create_index: create a new index on the entire dataset
  • optimize: incrementally index on the unindexed rows

it's recommended that the users should call optimize for shorter indexing time, but the index might be not accurate as inserting more rows.

this PR introduces:

  • record the loss of each delta index
  • force to retrain the index on all rows if avg_loss > original_avg_loss * THRESHOLD
  • support to train KMeans from existing centroids to significantly improve the indexing perf

after this, the users don't need to call create_index to create a new index to replace existing one, optimize would detect the avg loss, and retrain the index in more efficient way

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the enhancement New feature or request label Feb 28, 2025
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal changed the title feat: record loss for IVF and KMeans feat: detect the drift and retrain the index if hit threshold Feb 28, 2025
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added the java label Feb 28, 2025
@@ -45,9 +49,6 @@ pub struct IvfBuildParams {

pub shuffle_partition_concurrency: usize,

/// Use residual vectors to build sub-vector.
pub use_residual: bool,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never used, so remove it

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@codecov-commenter
Copy link

codecov-commenter commented Mar 6, 2025

Codecov Report

Attention: Patch coverage is 82.93839% with 72 lines in your changes missing coverage. Please review.

Project coverage is 78.84%. Comparing base (0487ff5) to head (52b27c7).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/ivf.rs 74.69% 11 Missing and 10 partials ⚠️
rust/lance-linalg/src/kmeans.rs 74.28% 9 Missing ⚠️
rust/lance/src/index/vector/ivf/v2.rs 95.50% 8 Missing ⚠️
rust/lance-index/src/vector/hnsw/index.rs 0.00% 6 Missing ⚠️
rust/lance/src/index/vector/builder.rs 89.65% 1 Missing and 5 partials ⚠️
rust/lance/src/index/vector/pq.rs 14.28% 6 Missing ⚠️
rust/lance/src/index/vector/fixture_test.rs 20.00% 4 Missing ⚠️
rust/lance-index/src/vector/ivf.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/vector/ivf/storage.rs 75.00% 3 Missing ⚠️
rust/lance/src/session/index_extension.rs 0.00% 3 Missing ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3489      +/-   ##
==========================================
+ Coverage   78.66%   78.84%   +0.18%     
==========================================
  Files         254      254              
  Lines       95025    95690     +665     
  Branches    95025    95690     +665     
==========================================
+ Hits        74751    75447     +696     
+ Misses      17250    17124     -126     
- Partials     3024     3119      +95     
Flag Coverage Δ
unittests 78.84% <82.93%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…loss

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…loss

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal marked this pull request as ready for review March 13, 2025 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants