Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add support for ngram indices #3468

Merged
merged 7 commits into from
Feb 25, 2025

Conversation

westonpace
Copy link
Contributor

Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.

This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.

@github-actions github-actions bot added enhancement New feature or request python labels Feb 21, 2025
@codecov-commenter
Copy link

codecov-commenter commented Feb 21, 2025

Codecov Report

Attention: Patch coverage is 55.65611% with 392 lines in your changes missing coverage. Please review.

Project coverage is 78.46%. Comparing base (b185a27) to head (3d43c6c).

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/ngram.rs 34.44% 227 Missing and 9 partials ⚠️
rust/lance-index/src/scalar/expression.rs 37.03% 65 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs 15.00% 17 Missing ⚠️
rust/lance-index/src/scalar.rs 69.23% 11 Missing and 1 partial ⚠️
rust/lance/src/index/scalar.rs 71.87% 6 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/btree.rs 38.46% 6 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/label_list.rs 52.94% 7 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs 94.52% 0 Missing and 8 partials ⚠️
rust/lance-core/src/utils/mask.rs 0.00% 7 Missing ⚠️
rust/lance/src/io/exec/scalar_index.rs 40.00% 3 Missing and 3 partials ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3468      +/-   ##
==========================================
- Coverage   78.68%   78.46%   -0.22%     
==========================================
  Files         251      252       +1     
  Lines       93030    93792     +762     
  Branches    93030    93792     +762     
==========================================
+ Hits        73201    73597     +396     
- Misses      16853    17200     +347     
- Partials     2976     2995      +19     
Flag Coverage Δ
unittests 78.46% <55.65%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +244 to +298
if field.data_type() != DataType::Utf8 {
return Err(Error::InvalidInput {
source: "NGram index can only be created on Utf8 type columns".into(),
location: location!(),
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it todo to support LargeString? If it's not too hard, might be worth supporting it from the start.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be too hard but I don't know if it would be a good idea.

Well, if it's "I'm using large string because polars uses large string" then its fine.

If it's "I'm using large string becuase my strings are several MB apiece" then I think an ngram index will be mostly ineffective since most strings at that size will have most trigrams in them anyways.

So it's a question between locking out users and opening up a footgun. I suppose the footgun case is probably rare enough that it might be better to be inclusive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with skipping for now. TBH I'm not sure what our policy should be for string type support. It seems nice to be agnostic about it. It would also be nice to future proof for more StringView support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are primarily storage I think that, in a perfect world, we would store string, large_string, and string_view as simply "string" and allow the encoding to be chosen at read time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring for now as this PR is getting quite complicated.

.filter(tantivy::tokenizer::AlphaNumOnlyFilter)
.filter(tantivy::tokenizer::AsciiFoldingFilter)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW have you tested that ascii folding works in combination with AlphaNumOnly? I wonder if we might need to switch the order so the folding happens before the filter.

Suggested change
.filter(tantivy::tokenizer::AlphaNumOnlyFilter)
.filter(tantivy::tokenizer::AsciiFoldingFilter)
.filter(tantivy::tokenizer::AsciiFoldingFilter)
.filter(tantivy::tokenizer::AlphaNumOnlyFilter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not. I will add some tests for this and make sure I have the order correct (I have no idea)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some unit tests to verify tokenizer behavior. You were correct, these were not in the right order.

Comment on lines +113 to +115
fn deep_size_of_children(&self, _: &mut deepsize::Context) -> usize {
self.cache.weighted_size() as usize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this the correct answer? Wouldn't it be 4 bytes + Size of NGramPostingList per entry?

TBH it's unclear what weighted size does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we define the cache we say:

cache: Cache::builder()
    .max_capacity(*CACHE_SIZE as u64)
    .weigher(|_, posting: &Arc<NGramPostingList>| posting.deep_size_of() as u32)
    .build(),

So the weigher is set to deep_size_of for the NGramPostingList. In other words, this is a bytes-based cache (not an entry-based cache). As a result, the weighted_size should be a fast approximation of the total index size (I may be wrong here but this is my interpretation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I missed the weigher. That is cool!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I can't take credit for it 😆 . I stole this pattern from @BubbleCal 😆

Comment on lines +392 to +439
fn process_batch(&mut self, batch: &RecordBatch) {
let text_col = batch.column(0).as_string::<i32>();
let row_id_col = batch.column(1).as_primitive::<UInt64Type>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do we assert that there are zero nulls in both columns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if a user trains this on a string field that has nulls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Bad things. I will fix this and add tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nulls should now be handled correctly (we build a nulls bitmap). The index doesn't yet handle IS NULL but that's a parser problem and shouldn't be an index problem.

Self::validate_schema(schema.as_ref())?;

while let Some(batch) = data.try_next().await? {
self.process_batch(&batch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how easy would this be to parallelize? would this be the place in the code to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to parallelize. Perhaps not trivially easy because we have to build up a series of per-thread maps and then merge them in the end. We do this in the trainer for the inverted index. We also spill that index to disk but I don't think we will need to do that here since the # of trigrams is bounded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring for follow-up to keep this PR constrained for now.


/// An ngram index
///
/// This index stores a mapping from ngrams to the row ids that contain them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be useful to mention that this is implemented as an inverted index, even though it's not the same as an index_type=INVERTED in our code base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised the comment

StringContains(String),
// TODO: In the future we should be able to do string-insensitive contains
// as well as partial matches (e.g. LIKE 'foo%') and potentially even
// some regular expressions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also be able to accelerate regular equality queries out of this index.

Since AFAIK we only support one index type per column, will it be annoying if users need to pick between faster "contains" and faster equality, by choosing between ngram and btree/bitmap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my hope was that we could support multiple indices (at least multiple index types) on a single column but it might be useful to do equality with an ngram index since it should be much smaller for big text.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to add support for equality and IS NULL in a future PR

(IndexExprResult::Exact(lhs), IndexExprResult::AtLeast(_)) => {
// We could do better here, elements in both lhs and rhs are known
// to be true and don't require a recheck. We only need to recheck
// elements in lhs that are not in rhs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like we almost want these IndexExprResults to be sets of enums. That would also let us handle cases like negation in the new ngram index, where

select * from foo where not contains(t, "abcde")

currently would get a full recheck, but if we recognized the negation and served it from the index, we'd only need to recheck the rows that had overlapping trigrams and could answer exactly for the others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Right now we return:

{
  block_list: RowIdTreeMap,
  allow_list: RowIdTreeMap,
}

I think we eventually want something like:

{
  definitely_false: RowIdTreeMap,
  likely_false: RowIdTreeMap,
  likely_true: RowIdTreeMap,
  definitely_true: RowIdTreeMap,
}

However, we don't really have the machinery to process that kind of result. I was thinking this could be tackled around the same time after we tackle the "index size too large for a take" scenario. Let's sync on this today.

///
/// If false is returned then the query still may be answered exactly but if true is returned
/// then the query must be answered exactly
fn can_answer_exact(&self, query: &dyn AnyQuery) -> bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need this with the changes ScalarIndexExpr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Right now we rely on AnyQuery::needs_recheck. I don't love that pattern though. AnyQuery isn't really the right spot to make that check. It works today because there is only one index that can answer TextQuery but there could (in theory) be indices in the future that can answer TextQuery exactly. Or maybe a more likely scenario is an index that can answer SargableQuery inexactly. I'd like to eventually get rid of AnyQuery::needs_recheck but I couldn't quite work it out to use ScalarIndex::can_answer_exact because today we need to make the decision early in the planning process before we have access to the scalar index. So I'm leaving this in as the long term goal.

let row_id_col = batch.column(1).as_primitive::<UInt64Type>();
for (text, row_id) in text_col.iter().zip(row_id_col.values()) {
let text = text.unwrap();
let tokens = self.tokenizer.clone().tokenize(text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this list be deduplicated? I don't see any indication tantivy handles that for us

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more thing to pile on - I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).

I'm not sure I understand this point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?

It could probably be substantial. I've removed the excess allocations.

should this list be deduplicated? I don't see any indication tantivy handles that for us

Tantivy does not deduplicate. I'm not sure how much savings we would get in training. We'd have to go back to allocating the vector so that we could sort the tokens. We can test it later.

It might be better to deduplicate on search, as I think right now we end up doing a bitmap AND of a treemap with itself which isn't trivial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@westonpace I did a bit of experimentation with the patch on a dataset of reddit comments. My findings were,

  • No savings from sort + dedup - works out slower. I think probably inserting "duplicates" into the bitmap is very quick.
  • ~25% speedup in the build from a caching strategy, where I split the input strings on spaces, tokenize each "word" individually, and maintain a cache of word tokenizations (hash map). I didn't verify that this results in the same tokenizations as tokenizing the full input in one shot but I think it probably would? Might be something to explore.

My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.

Yep. Just testing now on 200M large docs and it's pretty brutal. I'll try and get a follow-up with parallelization quickly.

Self {
tokenizer,
// Default capacity loosely based on case insensitive ascii ngrams with punctuation stripped
tokens_map: HashMap::with_capacity(4 * 26 * 26 * 26),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does the 4 come from? Also this assumes trigrams not ngrams I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed some indices with more than 26 * 26 * 26 items but didn't chase it down. Now I've realized the problem was that I wasn't considering the digits (0-9). I've updated to 363636.

@@ -1876,6 +1865,12 @@ impl Scanner {
(**self.dataset.fragments()).clone()
};

// If this unwrap fails we have a bug because we shouldn't be using this function unless we've already
// checked that there is an index query
let index_expr = filter_plan.index_query.as_ref().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be more of a philosophical question, but why should that crash the server? In a multitenant environment that would bring down all users, not just the one with the messed up install/making the request.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline. Will summarize here:

I used unwrap because I'd prefer to get a stack trace (not just a line number) if this error occurred. Why was index_query unset? Was the user maybe using a table provider and we hit some weird corner case? Was this just some branch of create_plan I hadn't considered? Hopefully the user provides enough context but I'd rather not hope.

I also use unwrap because if panics are bringing down a multi-tenant server then that isn't good (catch_unwind can be used to deal with that).

Also, translating every panic into error makes the error return pretty much meaningless. Instead of "this will return error if X, Y, or Z happens" I have to say, at least, "this will return error if X, Y, Z, or something I don't know about happened". However, I will admit this ship has sailed. Users probably assume just about any Lance call can fail for just about any reason. There are just too many things that can go wrong in a database.

I will admit that panic is not great. Panics do still bring down applications. Internal error would at least record the line number which is usually good enough.

Maybe we can make a simple macro / ext function like safe_expect which returns an internal error and also logs the current stack trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I just ran into this 😆

E       ValueError: LanceError(IO): Internal error: compute_projection: expected struct.
E       This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker, /home/pace/dev/lance/rust/lance/src/dataset/scanner.rs:1497:29

Fortunately, the string compute_projection: expected struct was unique enough to make up for the missing line number.

pub static ref TOKENS_FIELD: Field = Field::new(TOKENS_COL, DataType::Utf8, false);
pub static ref POSTINGS_FIELD: Field = Field::new(POSTING_LIST_COL, DataType::Binary, false);
pub static ref POSTINGS_SCHEMA: SchemaRef = Arc::new(Schema::new(vec![TOKENS_FIELD.clone(), POSTINGS_FIELD.clone()]));
/// Currently we ALWAYS use trigrams with ascii folding and lower casing. We may want to make this configurable in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the benefit of ascii folding? Won't that cause a lower rate of success on recheck?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I was thinking we might someday want an ascii_folding_contains(prompt, cafe) but I suppose that would mean we'd need to write the equivalent UDF for the in-memory check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it might be costing us time in the build as well

westonpace and others added 2 commits February 25, 2025 07:08
Co-authored-by: Will Jones <willjones127@gmail.com>
@wkalt
Copy link
Contributor

wkalt commented Feb 25, 2025

This looks good to me. I have done some local testing with a 10M row dataset and the impact is significant, at least for selective searches. The index is slow to build and could benefit from parallelization but that doesn't need to hold things up for now.

@westonpace westonpace merged commit 59d6596 into lancedb:main Feb 25, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants