feat: add support for ngram indices #3468

westonpace · 2025-02-21T00:33:03Z

Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up contains(col, 'substr') filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.

This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.

codecov-commenter · 2025-02-21T01:23:59Z

Codecov Report

Attention: Patch coverage is 55.65611% with 392 lines in your changes missing coverage. Please review.

Project coverage is 78.46%. Comparing base (b185a27) to head (3d43c6c).

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/ngram.rs	34.44%	227 Missing and 9 partials ⚠️
rust/lance-index/src/scalar/expression.rs	37.03%	65 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted/index.rs	15.00%	17 Missing ⚠️
rust/lance-index/src/scalar.rs	69.23%	11 Missing and 1 partial ⚠️
rust/lance/src/index/scalar.rs	71.87%	6 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/btree.rs	38.46%	6 Missing and 2 partials ⚠️
rust/lance-index/src/scalar/label_list.rs	52.94%	7 Missing and 1 partial ⚠️
rust/lance/src/dataset/scanner.rs	94.52%	0 Missing and 8 partials ⚠️
rust/lance-core/src/utils/mask.rs	0.00%	7 Missing ⚠️
rust/lance/src/io/exec/scalar_index.rs	40.00%	3 Missing and 3 partials ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3468      +/-   ##
==========================================
- Coverage   78.68%   78.46%   -0.22%     
==========================================
  Files         251      252       +1     
  Lines       93030    93792     +762     
  Branches    93030    93792     +762     
==========================================
+ Hits        73201    73597     +396     
- Misses      16853    17200     +347     
- Partials     2976     2995      +19

Flag	Coverage Δ
unittests	`78.46% <55.65%> (-0.22%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wjones127 · 2025-02-24T17:46:01Z

rust/lance/src/index/scalar.rs

+            if field.data_type() != DataType::Utf8 {
+                return Err(Error::InvalidInput {
+                    source: "NGram index can only be created on Utf8 type columns".into(),
+                    location: location!(),
+                });
+            }


Is it todo to support LargeString? If it's not too hard, might be worth supporting it from the start.

I don't think it would be too hard but I don't know if it would be a good idea.

Well, if it's "I'm using large string because polars uses large string" then its fine.

If it's "I'm using large string becuase my strings are several MB apiece" then I think an ngram index will be mostly ineffective since most strings at that size will have most trigrams in them anyways.

So it's a question between locking out users and opening up a footgun. I suppose the footgun case is probably rare enough that it might be better to be inclusive.

I'm fine with skipping for now. TBH I'm not sure what our policy should be for string type support. It seems nice to be agnostic about it. It would also be nice to future proof for more StringView support.

Since we are primarily storage I think that, in a perfect world, we would store string, large_string, and string_view as simply "string" and allow the encoding to be chosen at read time.

Deferring for now as this PR is getting quite complicated.

rust/lance-index/benches/ngram.rs

wjones127 · 2025-02-24T17:58:57Z

rust/lance-index/src/scalar/ngram.rs

+    .filter(tantivy::tokenizer::AlphaNumOnlyFilter)
+    .filter(tantivy::tokenizer::AsciiFoldingFilter)


BTW have you tested that ascii folding works in combination with AlphaNumOnly? I wonder if we might need to switch the order so the folding happens before the filter.

Suggested change

.filter(tantivy::tokenizer::AlphaNumOnlyFilter)

.filter(tantivy::tokenizer::AsciiFoldingFilter)

.filter(tantivy::tokenizer::AsciiFoldingFilter)

.filter(tantivy::tokenizer::AlphaNumOnlyFilter)

I have not. I will add some tests for this and make sure I have the order correct (I have no idea)

Added some unit tests to verify tokenizer behavior. You were correct, these were not in the right order.

wjones127 · 2025-02-24T18:02:03Z

rust/lance-index/src/scalar/ngram.rs

+    fn deep_size_of_children(&self, _: &mut deepsize::Context) -> usize {
+        self.cache.weighted_size() as usize


Why is this the correct answer? Wouldn't it be 4 bytes + Size of NGramPostingList per entry?

TBH it's unclear what weighted size does.

When we define the cache we say:

cache: Cache::builder() .max_capacity(*CACHE_SIZE as u64) .weigher(|_, posting: &Arc<NGramPostingList>| posting.deep_size_of() as u32) .build(),

So the weigher is set to deep_size_of for the NGramPostingList. In other words, this is a bytes-based cache (not an entry-based cache). As a result, the weighted_size should be a fast approximation of the total index size (I may be wrong here but this is my interpretation)

Oh I missed the weigher. That is cool!

Yes, but I can't take credit for it 😆 . I stole this pattern from @BubbleCal 😆

wjones127 · 2025-02-24T18:04:43Z

rust/lance-index/src/scalar/ngram.rs

+    fn process_batch(&mut self, batch: &RecordBatch) {
+        let text_col = batch.column(0).as_string::<i32>();
+        let row_id_col = batch.column(1).as_primitive::<UInt64Type>();


Should do we assert that there are zero nulls in both columns?

What happens if a user trains this on a string field that has nulls?

Good question. Bad things. I will fix this and add tests.

Nulls should now be handled correctly (we build a nulls bitmap). The index doesn't yet handle IS NULL but that's a parser problem and shouldn't be an index problem.

wkalt · 2025-02-24T21:15:28Z

rust/lance-index/src/scalar/ngram.rs

+        Self::validate_schema(schema.as_ref())?;
+
+        while let Some(batch) = data.try_next().await? {
+            self.process_batch(&batch);


how easy would this be to parallelize? would this be the place in the code to do that?

It would be good to parallelize. Perhaps not trivially easy because we have to build up a series of per-thread maps and then merge them in the end. We do this in the trainer for the inverted index. We also spill that index to disk but I don't think we will need to do that here since the # of trigrams is bounded.

Deferring for follow-up to keep this PR constrained for now.

wkalt · 2025-02-24T21:25:48Z

rust/lance-index/src/scalar/ngram.rs

+
+/// An ngram index
+///
+/// This index stores a mapping from ngrams to the row ids that contain them.


I think it might be useful to mention that this is implemented as an inverted index, even though it's not the same as an index_type=INVERTED in our code base.

Revised the comment

wkalt · 2025-02-24T21:59:51Z

rust/lance-index/src/scalar.rs

+    StringContains(String),
+    // TODO: In the future we should be able to do string-insensitive contains
+    // as well as partial matches (e.g. LIKE 'foo%') and potentially even
+    // some regular expressions


We should also be able to accelerate regular equality queries out of this index.

Since AFAIK we only support one index type per column, will it be annoying if users need to pick between faster "contains" and faster equality, by choosing between ngram and btree/bitmap?

Yeah, my hope was that we could support multiple indices (at least multiple index types) on a single column but it might be useful to do equality with an ngram index since it should be much smaller for big text.

I'm going to add support for equality and IS NULL in a future PR

wkalt · 2025-02-24T22:15:10Z

rust/lance-index/src/scalar/expression.rs

+                    (IndexExprResult::Exact(lhs), IndexExprResult::AtLeast(_)) => {
+                        // We could do better here, elements in both lhs and rhs are known
+                        // to be true and don't require a recheck.  We only need to recheck
+                        // elements in lhs that are not in rhs


it seems like we almost want these IndexExprResults to be sets of enums. That would also let us handle cases like negation in the new ngram index, where

select * from foo where not contains(t, "abcde")

currently would get a full recheck, but if we recognized the negation and served it from the index, we'd only need to recheck the rows that had overlapping trigrams and could answer exactly for the others.

Yes. Right now we return:

{ block_list: RowIdTreeMap, allow_list: RowIdTreeMap, }

I think we eventually want something like:

{ definitely_false: RowIdTreeMap, likely_false: RowIdTreeMap, likely_true: RowIdTreeMap, definitely_true: RowIdTreeMap, }

However, we don't really have the machinery to process that kind of result. I was thinking this could be tackled around the same time after we tackle the "index size too large for a take" scenario. Let's sync on this today.

wkalt · 2025-02-24T22:18:18Z

rust/lance-index/src/scalar.rs

+    ///
+    /// If false is returned then the query still may be answered exactly but if true is returned
+    /// then the query must be answered exactly
+    fn can_answer_exact(&self, query: &dyn AnyQuery) -> bool;


Do we actually need this with the changes ScalarIndexExpr?

No. Right now we rely on AnyQuery::needs_recheck. I don't love that pattern though. AnyQuery isn't really the right spot to make that check. It works today because there is only one index that can answer TextQuery but there could (in theory) be indices in the future that can answer TextQuery exactly. Or maybe a more likely scenario is an index that can answer SargableQuery inexactly. I'd like to eventually get rid of AnyQuery::needs_recheck but I couldn't quite work it out to use ScalarIndex::can_answer_exact because today we need to make the decision early in the planning process before we have access to the scalar index. So I'm leaving this in as the long term goal.

wkalt · 2025-02-25T03:10:48Z

rust/lance-index/src/scalar/ngram.rs

+        let row_id_col = batch.column(1).as_primitive::<UInt64Type>();
+        for (text, row_id) in text_col.iter().zip(row_id_col.values()) {
+            let text = text.unwrap();
+            let tokens = self.tokenizer.clone().tokenize(text);


should this list be deduplicated? I don't see any indication tantivy handles that for us

I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).

one more thing to pile on - I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?

I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).

I'm not sure I understand this point

I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?

It could probably be substantial. I've removed the excess allocations.

should this list be deduplicated? I don't see any indication tantivy handles that for us

Tantivy does not deduplicate. I'm not sure how much savings we would get in training. We'd have to go back to allocating the vector so that we could sort the tokens. We can test it later.

It might be better to deduplicate on search, as I think right now we end up doing a bitmap AND of a treemap with itself which isn't trivial.

@westonpace I did a bit of experimentation with the patch on a dataset of reddit comments. My findings were,

No savings from sort + dedup - works out slower. I think probably inserting "duplicates" into the bitmap is very quick.

~25% speedup in the build from a caching strategy, where I split the input strings on spaces, tokenize each "word" individually, and maintain a cache of word tokenizations (hash map). I didn't verify that this results in the same tokenizations as tokenizing the full input in one shot but I think it probably would? Might be something to explore.

My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.

My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.

Yep. Just testing now on 200M large docs and it's pretty brutal. I'll try and get a follow-up with parallelization quickly.

wkalt · 2025-02-25T03:17:30Z

rust/lance-index/src/scalar/ngram.rs

+        Self {
+            tokenizer,
+            // Default capacity loosely based on case insensitive ascii ngrams with punctuation stripped
+            tokens_map: HashMap::with_capacity(4 * 26 * 26 * 26),


where does the 4 come from? Also this assumes trigrams not ngrams I think

I noticed some indices with more than 26 * 26 * 26 items but didn't chase it down. Now I've realized the problem was that I wasn't considering the digits (0-9). I've updated to 363636.

wkalt · 2025-02-25T03:31:40Z

rust/lance/src/dataset/scanner.rs

@@ -1876,6 +1865,12 @@ impl Scanner {
            (**self.dataset.fragments()).clone()
        };

+        // If this unwrap fails we have a bug because we shouldn't be using this function unless we've already
+        // checked that there is an index query
+        let index_expr = filter_plan.index_query.as_ref().unwrap();


might be more of a philosophical question, but why should that crash the server? In a multitenant environment that would bring down all users, not just the one with the messed up install/making the request.

Discussed offline. Will summarize here:

I used unwrap because I'd prefer to get a stack trace (not just a line number) if this error occurred. Why was index_query unset? Was the user maybe using a table provider and we hit some weird corner case? Was this just some branch of create_plan I hadn't considered? Hopefully the user provides enough context but I'd rather not hope.

I also use unwrap because if panics are bringing down a multi-tenant server then that isn't good (catch_unwind can be used to deal with that).

Also, translating every panic into error makes the error return pretty much meaningless. Instead of "this will return error if X, Y, or Z happens" I have to say, at least, "this will return error if X, Y, Z, or something I don't know about happened". However, I will admit this ship has sailed. Users probably assume just about any Lance call can fail for just about any reason. There are just too many things that can go wrong in a database.

I will admit that panic is not great. Panics do still bring down applications. Internal error would at least record the line number which is usually good enough.

Maybe we can make a simple macro / ext function like safe_expect which returns an internal error and also logs the current stack trace?

FWIW I just ran into this 😆

E ValueError: LanceError(IO): Internal error: compute_projection: expected struct. E This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker, /home/pace/dev/lance/rust/lance/src/dataset/scanner.rs:1497:29

Fortunately, the string compute_projection: expected struct was unique enough to make up for the missing line number.

wkalt · 2025-02-25T14:20:41Z

rust/lance-index/src/scalar/ngram.rs

+    pub static ref TOKENS_FIELD: Field = Field::new(TOKENS_COL, DataType::Utf8, false);
+    pub static ref POSTINGS_FIELD: Field = Field::new(POSTING_LIST_COL, DataType::Binary, false);
+    pub static ref POSTINGS_SCHEMA: SchemaRef = Arc::new(Schema::new(vec![TOKENS_FIELD.clone(), POSTINGS_FIELD.clone()]));
+    /// Currently we ALWAYS use trigrams with ascii folding and lower casing.  We may want to make this configurable in the future.


what's the benefit of ascii folding? Won't that cause a lower rate of success on recheck?

I suppose I was thinking we might someday want an ascii_folding_contains(prompt, cafe) but I suppose that would mean we'd need to write the equivalent UDF for the in-memory check.

I was wondering if it might be costing us time in the build as well

add support for ngram indices

Co-authored-by: Will Jones <willjones127@gmail.com>

wkalt · 2025-02-25T16:23:50Z

This looks good to me. I have done some local testing with a 10M row dataset and the impact is significant, at least for selective searches. The index is slow to build and could benefit from parallelization but that doesn't need to hold things up for now.

github-actions bot added enhancement New feature or request python labels Feb 21, 2025

wjones127 reviewed Feb 24, 2025

View reviewed changes

wkalt reviewed Feb 24, 2025

View reviewed changes

wkalt reviewed Feb 25, 2025

View reviewed changes

westonpace added 4 commits February 25, 2025 06:58

add support for inexact indices which need a recheck

7a22f95

add support for ngram indices

Add missing license header

db12d30

Addressing PR review

cdae108

Minor cleanup

892c0e6

westonpace force-pushed the feat/ngram-index branch from 511edc5 to 892c0e6 Compare February 25, 2025 15:07

westonpace and others added 2 commits February 25, 2025 07:08

Update rust/lance-index/benches/ngram.rs

febb0bc

Co-authored-by: Will Jones <willjones127@gmail.com>

Address clippy suggestions

39754ac

wkalt approved these changes Feb 25, 2025

View reviewed changes

Remove typo from test

3d43c6c

westonpace merged commit 59d6596 into lancedb:main Feb 25, 2025
27 checks passed

		.filter(tantivy::tokenizer::AlphaNumOnlyFilter)
		.filter(tantivy::tokenizer::AsciiFoldingFilter)

		fn deep_size_of_children(&self, _: &mut deepsize::Context) -> usize {
		self.cache.weighted_size() as usize

feat: add support for ngram indices #3468

feat: add support for ngram indices #3468

Conversation

westonpace commented Feb 21, 2025

codecov-commenter commented Feb 21, 2025 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wkalt commented Feb 25, 2025

codecov-commenter commented Feb 21, 2025 •

edited

Loading