-
Notifications
You must be signed in to change notification settings - Fork 265
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for ngram indices #3468
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3468 +/- ##
==========================================
- Coverage 78.68% 78.46% -0.22%
==========================================
Files 251 252 +1
Lines 93030 93792 +762
Branches 93030 93792 +762
==========================================
+ Hits 73201 73597 +396
- Misses 16853 17200 +347
- Partials 2976 2995 +19
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if field.data_type() != DataType::Utf8 { | ||
return Err(Error::InvalidInput { | ||
source: "NGram index can only be created on Utf8 type columns".into(), | ||
location: location!(), | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it todo to support LargeString? If it's not too hard, might be worth supporting it from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be too hard but I don't know if it would be a good idea.
Well, if it's "I'm using large string because polars uses large string" then its fine.
If it's "I'm using large string becuase my strings are several MB apiece" then I think an ngram index will be mostly ineffective since most strings at that size will have most trigrams in them anyways.
So it's a question between locking out users and opening up a footgun. I suppose the footgun case is probably rare enough that it might be better to be inclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with skipping for now. TBH I'm not sure what our policy should be for string type support. It seems nice to be agnostic about it. It would also be nice to future proof for more StringView support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are primarily storage I think that, in a perfect world, we would store string, large_string, and string_view as simply "string" and allow the encoding to be chosen at read time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferring for now as this PR is getting quite complicated.
rust/lance-index/src/scalar/ngram.rs
Outdated
.filter(tantivy::tokenizer::AlphaNumOnlyFilter) | ||
.filter(tantivy::tokenizer::AsciiFoldingFilter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW have you tested that ascii folding works in combination with AlphaNumOnly
? I wonder if we might need to switch the order so the folding happens before the filter.
.filter(tantivy::tokenizer::AlphaNumOnlyFilter) | |
.filter(tantivy::tokenizer::AsciiFoldingFilter) | |
.filter(tantivy::tokenizer::AsciiFoldingFilter) | |
.filter(tantivy::tokenizer::AlphaNumOnlyFilter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not. I will add some tests for this and make sure I have the order correct (I have no idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests to verify tokenizer behavior. You were correct, these were not in the right order.
fn deep_size_of_children(&self, _: &mut deepsize::Context) -> usize { | ||
self.cache.weighted_size() as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this the correct answer? Wouldn't it be 4 bytes + Size of NGramPostingList
per entry?
TBH it's unclear what weighted size does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we define the cache we say:
cache: Cache::builder()
.max_capacity(*CACHE_SIZE as u64)
.weigher(|_, posting: &Arc<NGramPostingList>| posting.deep_size_of() as u32)
.build(),
So the weigher
is set to deep_size_of
for the NGramPostingList
. In other words, this is a bytes-based cache (not an entry-based cache). As a result, the weighted_size
should be a fast approximation of the total index size (I may be wrong here but this is my interpretation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I missed the weigher. That is cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I can't take credit for it 😆 . I stole this pattern from @BubbleCal 😆
fn process_batch(&mut self, batch: &RecordBatch) { | ||
let text_col = batch.column(0).as_string::<i32>(); | ||
let row_id_col = batch.column(1).as_primitive::<UInt64Type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should do we assert that there are zero nulls in both columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a user trains this on a string field that has nulls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Bad things. I will fix this and add tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nulls should now be handled correctly (we build a nulls bitmap). The index doesn't yet handle IS NULL but that's a parser problem and shouldn't be an index problem.
Self::validate_schema(schema.as_ref())?; | ||
|
||
while let Some(batch) = data.try_next().await? { | ||
self.process_batch(&batch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how easy would this be to parallelize? would this be the place in the code to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to parallelize. Perhaps not trivially easy because we have to build up a series of per-thread maps and then merge them in the end. We do this in the trainer for the inverted index. We also spill that index to disk but I don't think we will need to do that here since the # of trigrams is bounded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deferring for follow-up to keep this PR constrained for now.
rust/lance-index/src/scalar/ngram.rs
Outdated
|
||
/// An ngram index | ||
/// | ||
/// This index stores a mapping from ngrams to the row ids that contain them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be useful to mention that this is implemented as an inverted index, even though it's not the same as an index_type=INVERTED in our code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised the comment
StringContains(String), | ||
// TODO: In the future we should be able to do string-insensitive contains | ||
// as well as partial matches (e.g. LIKE 'foo%') and potentially even | ||
// some regular expressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also be able to accelerate regular equality queries out of this index.
Since AFAIK we only support one index type per column, will it be annoying if users need to pick between faster "contains" and faster equality, by choosing between ngram and btree/bitmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my hope was that we could support multiple indices (at least multiple index types) on a single column but it might be useful to do equality with an ngram index since it should be much smaller for big text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to add support for equality and IS NULL in a future PR
(IndexExprResult::Exact(lhs), IndexExprResult::AtLeast(_)) => { | ||
// We could do better here, elements in both lhs and rhs are known | ||
// to be true and don't require a recheck. We only need to recheck | ||
// elements in lhs that are not in rhs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like we almost want these IndexExprResults to be sets of enums. That would also let us handle cases like negation in the new ngram index, where
select * from foo where not contains(t, "abcde")
currently would get a full recheck, but if we recognized the negation and served it from the index, we'd only need to recheck the rows that had overlapping trigrams and could answer exactly for the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Right now we return:
{
block_list: RowIdTreeMap,
allow_list: RowIdTreeMap,
}
I think we eventually want something like:
{
definitely_false: RowIdTreeMap,
likely_false: RowIdTreeMap,
likely_true: RowIdTreeMap,
definitely_true: RowIdTreeMap,
}
However, we don't really have the machinery to process that kind of result. I was thinking this could be tackled around the same time after we tackle the "index size too large for a take" scenario. Let's sync on this today.
/// | ||
/// If false is returned then the query still may be answered exactly but if true is returned | ||
/// then the query must be answered exactly | ||
fn can_answer_exact(&self, query: &dyn AnyQuery) -> bool; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need this with the changes ScalarIndexExpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Right now we rely on AnyQuery::needs_recheck
. I don't love that pattern though. AnyQuery
isn't really the right spot to make that check. It works today because there is only one index that can answer TextQuery
but there could (in theory) be indices in the future that can answer TextQuery
exactly. Or maybe a more likely scenario is an index that can answer SargableQuery
inexactly. I'd like to eventually get rid of AnyQuery::needs_recheck
but I couldn't quite work it out to use ScalarIndex::can_answer_exact
because today we need to make the decision early in the planning process before we have access to the scalar index. So I'm leaving this in as the long term goal.
rust/lance-index/src/scalar/ngram.rs
Outdated
let row_id_col = batch.column(1).as_primitive::<UInt64Type>(); | ||
for (text, row_id) in text_col.iter().zip(row_id_col.values()) { | ||
let text = text.unwrap(); | ||
let tokens = self.tokenizer.clone().tokenize(text); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this list be deduplicated? I don't see any indication tantivy handles that for us
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one more thing to pile on - I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if during the build step, there would be any value in caching computed trigrams and potentially splitting the input to try and get better utility out of the cache (e.g for words).
I'm not sure I understand this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious how expensive the per-record allocations for token vectors are in the build phase. Does that consume meaningful time vs reusing a vector?
It could probably be substantial. I've removed the excess allocations.
should this list be deduplicated? I don't see any indication tantivy handles that for us
Tantivy does not deduplicate. I'm not sure how much savings we would get in training. We'd have to go back to allocating the vector so that we could sort the tokens. We can test it later.
It might be better to deduplicate on search, as I think right now we end up doing a bitmap AND of a treemap with itself which isn't trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace I did a bit of experimentation with the patch on a dataset of reddit comments. My findings were,
- No savings from sort + dedup - works out slower. I think probably inserting "duplicates" into the bitmap is very quick.
- ~25% speedup in the build from a caching strategy, where I split the input strings on spaces, tokenize each "word" individually, and maintain a cache of word tokenizations (hash map). I didn't verify that this results in the same tokenizations as tokenizing the full input in one shot but I think it probably would? Might be something to explore.
My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest concern with the patch is super long build times but I think the biggest win on that will probably be parallelization at this point.
Yep. Just testing now on 200M large docs and it's pretty brutal. I'll try and get a follow-up with parallelization quickly.
rust/lance-index/src/scalar/ngram.rs
Outdated
Self { | ||
tokenizer, | ||
// Default capacity loosely based on case insensitive ascii ngrams with punctuation stripped | ||
tokens_map: HashMap::with_capacity(4 * 26 * 26 * 26), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does the 4 come from? Also this assumes trigrams not ngrams I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed some indices with more than 26 * 26 * 26 items but didn't chase it down. Now I've realized the problem was that I wasn't considering the digits (0-9). I've updated to 363636.
@@ -1876,6 +1865,12 @@ impl Scanner { | |||
(**self.dataset.fragments()).clone() | |||
}; | |||
|
|||
// If this unwrap fails we have a bug because we shouldn't be using this function unless we've already | |||
// checked that there is an index query | |||
let index_expr = filter_plan.index_query.as_ref().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be more of a philosophical question, but why should that crash the server? In a multitenant environment that would bring down all users, not just the one with the messed up install/making the request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Will summarize here:
I used unwrap because I'd prefer to get a stack trace (not just a line number) if this error occurred. Why was index_query
unset? Was the user maybe using a table provider and we hit some weird corner case? Was this just some branch of create_plan I hadn't considered? Hopefully the user provides enough context but I'd rather not hope.
I also use unwrap because if panics are bringing down a multi-tenant server then that isn't good (catch_unwind can be used to deal with that).
Also, translating every panic into error makes the error return pretty much meaningless. Instead of "this will return error if X, Y, or Z happens" I have to say, at least, "this will return error if X, Y, Z, or something I don't know about happened". However, I will admit this ship has sailed. Users probably assume just about any Lance call can fail for just about any reason. There are just too many things that can go wrong in a database.
I will admit that panic is not great. Panics do still bring down applications. Internal error would at least record the line number which is usually good enough.
Maybe we can make a simple macro / ext function like safe_expect
which returns an internal error and also logs the current stack trace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I just ran into this 😆
E ValueError: LanceError(IO): Internal error: compute_projection: expected struct.
E This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker, /home/pace/dev/lance/rust/lance/src/dataset/scanner.rs:1497:29
Fortunately, the string compute_projection: expected struct
was unique enough to make up for the missing line number.
pub static ref TOKENS_FIELD: Field = Field::new(TOKENS_COL, DataType::Utf8, false); | ||
pub static ref POSTINGS_FIELD: Field = Field::new(POSTING_LIST_COL, DataType::Binary, false); | ||
pub static ref POSTINGS_SCHEMA: SchemaRef = Arc::new(Schema::new(vec![TOKENS_FIELD.clone(), POSTINGS_FIELD.clone()])); | ||
/// Currently we ALWAYS use trigrams with ascii folding and lower casing. We may want to make this configurable in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the benefit of ascii folding? Won't that cause a lower rate of success on recheck?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose I was thinking we might someday want an ascii_folding_contains(prompt, cafe)
but I suppose that would mean we'd need to write the equivalent UDF for the in-memory check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it might be costing us time in the build as well
add support for ngram indices
511edc5
to
892c0e6
Compare
Co-authored-by: Will Jones <willjones127@gmail.com>
This looks good to me. I have done some local testing with a 10M row dataset and the impact is significant, at least for selective searches. The index is slow to build and could benefit from parallelization but that doesn't need to hold things up for now. |
Ngram indices are indices that can speed up various string filters. To start with they will be able to speed up
contains(col, 'substr')
filters. They work by creating a bitmap for each ngram (short sequence of characters) in a value. For example, consider an index of 1-grams. This would create a bitmap for each letter of the alphabet. Then, at query time, we can use this to narrow down which strings could potentially satisfy the query.This is the first scalar index that requires a "recheck" step. It doesn't tell us exactly which rows satisfy the query. It only narrows down the list. Other indices that might behave like this are bloom filters and zone maps. This means that we need to still apply the filter on the results of the index search. A good portion of this PR is adding support for this concept into the scanner.