-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: delta index fragment bitmaps contained previous index coverage #3377
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3377 +/- ##
==========================================
+ Coverage 78.45% 78.69% +0.23%
==========================================
Files 250 250
Lines 90189 90873 +684
Branches 90189 90873 +684
==========================================
+ Hits 70758 71511 +753
+ Misses 16525 16417 -108
- Partials 2906 2945 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
let num_indexed_rows = num_indexed_rows_per_delta.iter().last().unwrap(); | ||
let num_indexed_rows: usize = num_indexed_rows_per_delta.iter().cloned().sum(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chebbyChefNEQ I saw you wrote .last()
in #2979. Do you remember why this made sense to you at the time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_indexed_rows_per_delta
is a vec of cumulative number of rows indexed I think. so we get a vec like
[100, 150, 200] instead of [100, 50, 50]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so .last()
is the total number of indexed rows
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was it intentional that it was cumulative? I have been treating it as a bug in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I don't think we ever made a contract around this. I agree that cumulative seems like a bug and we should record the number of indexed rows per delta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working through this
Follow up to #3377. That PR made `index_statistics()` error by default. This ended up being a footgun for some users who rely heavily on that method. So instead of forcing the user to do the migration themself, we do it for them. It can be disabled using an environment variable.
BREAKING CHANGE: delta index fragment bitmaps will now only contain the fragment ids covered by the delta, not the full index. To get the full bitmap, make sure to union with all index segments with the same name. Old datasets will still show previous fragment ids, until a write is done, which forces a migration. If corrupted fragment ids are present in a dataset, then the
dataset.index_statistics
will return an error. Before usingdataset.index_statistics()
, calldataset.validate()
to check the integrity and usedataset.delete("false")
to force a migration.Fixes #3374