Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix!: correctly handle nulls in btree and bitmap indices #3211

Merged
merged 4 commits into from
Dec 6, 2024

Conversation

westonpace
Copy link
Contributor

@westonpace westonpace commented Dec 5, 2024

There were a few issues with our null handling in scalar indices.

First, it appears I assumed earlier that X < NULL and X > NULL would always be false. However, in arrow-rs the ordering considers NULL to be "the smallest value" and so X < NULL always evaluated to true. This required some changes to the logic in the btree and bitmap indices.

Second, the btree index was still using the v1 file format because it relied on the page size to keep track of the index's batch size. I've instead made the batch size a configurable property (configurable in code, not configurable by users) and made it so that btree can use the v2 file format.

Finally, related to the above, I changed it so we now write v2 files for all scalar indices, even if the dataset is a v1 dataset. I think that's a reasonable decision at this point.

The logic to fallback and read the old v1 files was already in place (I believe @BubbleCal added it back when working on inverted index) but I added a migration test just to be sure we weren't breaking our btree / bitmap support.

Users with existing bitmap indices will get the new correct behavior without any changes.
Users with existing btree indices will get some of the new correct behavior but will need to retrain their indices to get all of the correct behavior.

BREAKING CHANGE: Bitmap and btree indices will no longer be readable by older versions of Lance. This is not a "backwards compatibility change" (no APIs or code will stop working) but rather a "forwards compatibility change" (you need to be careful in a multi-verison deployment or if you roll back)

@github-actions github-actions bot added bug Something isn't working python labels Dec 5, 2024
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Although I think we might consider this a breaking change, since IIUC older versions won't be able to read the new scalar indices, right? Or will they?

@@ -669,13 +674,13 @@ impl BTreeLookup {
/// Note: this is very similar to the IVF index except we store the IVF part in a btree
/// for faster lookup
#[derive(Clone, Debug, DeepSizeOf)]
pub struct BTreeIndex {
pub struct BTreeIndex<const BATCH_SIZE: u64 = DEFAULT_BTREE_BATCH_SIZE> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of making BATCH_SIZE a compile-time constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, it was less that I wanted to make it a compile time constant and more that I didn't have a good way of passing the batch size to BTreeIndex::load (a static method). However, I realize now I can put it in the file schema. Let me do that real quick.

@westonpace
Copy link
Contributor Author

Although I think we might consider this a breaking change, since IIUC older versions won't be able to read the new scalar indices, right? Or will they?

They will not be able to. They would get a runtime error about "legacy method called on v2 file". However, that's more of a forward compatibility concern right? This is technically not a breaking change according to semver but I'm fine considering it as one. Otherwise A/B style upgades will break I will update the title.

@westonpace westonpace changed the title fix: correctly handle nulls in btree and bitmap indices fix!: correctly handle nulls in btree and bitmap indices Dec 5, 2024
@github-actions github-actions bot added the java label Dec 5, 2024
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 81.13208% with 30 lines in your changes missing coverage. Please review.

Project coverage is 78.58%. Comparing base (f21397d) to head (3a299eb).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/flat.rs 62.06% 9 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/bitmap.rs 61.90% 8 Missing ⚠️
rust/lance-index/src/scalar/btree.rs 85.96% 4 Missing and 4 partials ⚠️
rust/lance-index/src/scalar/lance_format.rs 94.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3211      +/-   ##
==========================================
- Coverage   78.59%   78.58%   -0.01%     
==========================================
  Files         244      244              
  Lines       83957    84021      +64     
  Branches    83957    84021      +64     
==========================================
+ Hits        65986    66032      +46     
- Misses      15181    15193      +12     
- Partials     2790     2796       +6     
Flag Coverage Δ
unittests 78.58% <81.13%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@westonpace westonpace merged commit 1e349cd into lancedb:main Dec 6, 2024
23 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants