Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: delta index fragment bitmaps contained previous index coverage #3377

Merged
merged 9 commits into from
Jan 20, 2025

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Jan 14, 2025

BREAKING CHANGE: delta index fragment bitmaps will now only contain the fragment ids covered by the delta, not the full index. To get the full bitmap, make sure to union with all index segments with the same name. Old datasets will still show previous fragment ids, until a write is done, which forces a migration. If corrupted fragment ids are present in a dataset, then the dataset.index_statistics will return an error. Before using dataset.index_statistics(), call dataset.validate() to check the integrity and use dataset.delete("false") to force a migration.

Fixes #3374

@github-actions github-actions bot added the bug Something isn't working label Jan 14, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 91.55844% with 13 lines in your changes missing coverage. Please review.

Project coverage is 78.69%. Comparing base (26eb471) to head (6e6d3ba).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance/src/dataset.rs 79.06% 8 Missing and 1 partial ⚠️
rust/lance/src/index.rs 95.71% 0 Missing and 3 partials ⚠️
rust/lance/src/io/commit.rs 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3377      +/-   ##
==========================================
+ Coverage   78.45%   78.69%   +0.23%     
==========================================
  Files         250      250              
  Lines       90189    90873     +684     
  Branches    90189    90873     +684     
==========================================
+ Hits        70758    71511     +753     
+ Misses      16525    16417     -108     
- Partials     2906     2945      +39     
Flag Coverage Δ
unittests 78.69% <91.55%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wjones127 wjones127 changed the title fix: delta index fragment bitmaps contained previous index coverage fix!: delta index fragment bitmaps contained previous index coverage Jan 15, 2025
let num_indexed_rows = num_indexed_rows_per_delta.iter().last().unwrap();
let num_indexed_rows: usize = num_indexed_rows_per_delta.iter().cloned().sum();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chebbyChefNEQ I saw you wrote .last() in #2979. Do you remember why this made sense to you at the time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_indexed_rows_per_delta is a vec of cumulative number of rows indexed I think. so we get a vec like
[100, 150, 200] instead of [100, 50, 50]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so .last() is the total number of indexed rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it intentional that it was cumulative? I have been treating it as a bug in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I don't think we ever made a contract around this. I agree that cumulative seems like a bug and we should record the number of indexed rows per delta

@wjones127 wjones127 marked this pull request as ready for review January 15, 2025 23:06
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this

@wjones127 wjones127 merged commit 2b784b3 into lancedb:main Jan 20, 2025
27 checks passed
@wjones127 wjones127 deleted the fix/delta-index-bitmap branch January 20, 2025 19:30
wjones127 added a commit that referenced this pull request Feb 1, 2025
Follow up to #3377. That PR made
`index_statistics()` error by default. This ended up being a footgun for
some users who rely heavily on that method. So instead of forcing the
user to do the migration themself, we do it for them. It can be disabled
using an environment variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delta indices have wrong fragment bitmap
4 participants