-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: correctly handle nulls in btree and bitmap indices #3211
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Although I think we might consider this a breaking change, since IIUC older versions won't be able to read the new scalar indices, right? Or will they?
rust/lance-index/src/scalar/btree.rs
Outdated
@@ -669,13 +674,13 @@ impl BTreeLookup { | |||
/// Note: this is very similar to the IVF index except we store the IVF part in a btree | |||
/// for faster lookup | |||
#[derive(Clone, Debug, DeepSizeOf)] | |||
pub struct BTreeIndex { | |||
pub struct BTreeIndex<const BATCH_SIZE: u64 = DEFAULT_BTREE_BATCH_SIZE> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of making BATCH_SIZE
a compile-time constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, it was less that I wanted to make it a compile time constant and more that I didn't have a good way of passing the batch size to BTreeIndex::load
(a static method). However, I realize now I can put it in the file schema. Let me do that real quick.
…. Address clippy warnings
They will not be able to. They would get a runtime error about "legacy method called on v2 file". However, that's more of a forward compatibility concern right? This is technically not a breaking change according to semver but I'm fine considering it as one. Otherwise A/B style upgades will break I will update the title. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3211 +/- ##
==========================================
- Coverage 78.59% 78.58% -0.01%
==========================================
Files 244 244
Lines 83957 84021 +64
Branches 83957 84021 +64
==========================================
+ Hits 65986 66032 +46
- Misses 15181 15193 +12
- Partials 2790 2796 +6
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There were a few issues with our null handling in scalar indices.
First, it appears I assumed earlier that
X < NULL
andX > NULL
would always be false. However, inarrow-rs
the ordering considersNULL
to be "the smallest value" and soX < NULL
always evaluated to true. This required some changes to the logic in the btree and bitmap indices.Second, the btree index was still using the v1 file format because it relied on the page size to keep track of the index's batch size. I've instead made the batch size a configurable property (configurable in code, not configurable by users) and made it so that btree can use the v2 file format.
Finally, related to the above, I changed it so we now write v2 files for all scalar indices, even if the dataset is a v1 dataset. I think that's a reasonable decision at this point.
The logic to fallback and read the old v1 files was already in place (I believe @BubbleCal added it back when working on inverted index) but I added a migration test just to be sure we weren't breaking our btree / bitmap support.
Users with existing bitmap indices will get the new correct behavior without any changes.
Users with existing btree indices will get some of the new correct behavior but will need to retrain their indices to get all of the correct behavior.
BREAKING CHANGE: Bitmap and btree indices will no longer be readable by older versions of Lance. This is not a "backwards compatibility change" (no APIs or code will stop working) but rather a "forwards compatibility change" (you need to be careful in a multi-verison deployment or if you roll back)