-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support multivector type #2005
Conversation
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
…ch 'main' of https://github.com/lancedb/lancedb into multivec Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
})?; | ||
|
||
let mut is_binary = false; | ||
if let arrow_schema::DataType::FixedSizeList(element, dim) = field.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lance would check this, so remove it
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
python/python/tests/test_query.py
Outdated
def multivec_table(tmp_path) -> lancedb.table.Table: | ||
db = lancedb.connect(tmp_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not create this on disk if we don't need it to be there
def multivec_table(tmp_path) -> lancedb.table.Table: | |
db = lancedb.connect(tmp_path) | |
def multivec_table() -> lancedb.table.Table: | |
db = lancedb.connect("memory://") |
python/python/tests/test_query.py
Outdated
async def multivec_table_async(tmp_path) -> AsyncTable: | ||
conn = await lancedb.connect_async( | ||
tmp_path, read_consistency_interval=timedelta(seconds=0) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def multivec_table_async(tmp_path) -> AsyncTable: | |
conn = await lancedb.connect_async( | |
tmp_path, read_consistency_interval=timedelta(seconds=0) | |
) | |
async def multivec_table_async() -> AsyncTable: | |
conn = await lancedb.connect_async( | |
"memory://", read_consistency_interval=timedelta(seconds=0) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
docs/src/search.md
Outdated
|
||
You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors `mq`, the similarity (distance) from it to any multivector `mv` in the dataset, is defined as: | ||
|
||
**maxsim(mq, mv) = sum(max(sim(mq[i], mv[j])))** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unclear what the max and sum are over. I think you are missing that part of the formula, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems a math formula is not supported, I will post an image here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add the math
plugin if you want: https://squidfunk.github.io/mkdocs-material/reference/math/#katex (I'd prefer the katex over mathjax, I think)
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's some improvements to do for the documentation to make it clearer, but that can be done in a follow up.
|
||
LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali). | ||
|
||
You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors `mq`, the similarity (distance) from it to any multivector `mv` in the dataset, is defined as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say similarity is 1 - distance
?
@@ -138,6 +138,36 @@ LanceDB supports binary vectors as a data type, and has the ability to search bi | |||
--8<-- "python/python/tests/docs/test_binary_vector.py:async_binary_vector" | |||
``` | |||
|
|||
## Multivector type | |||
|
|||
LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still missing an intuitive explanation of what this query type means, and how it's different from the single vector situation. Could you perhaps root this in an example? Is the main example chunked documents, where each row is a full document, and each vector made for a chunk?
=== "sync API" | ||
|
||
```python | ||
--8<-- "python/python/tests/docs/test_multivector.py:imports" | ||
|
||
--8<-- "python/python/tests/docs/test_multivector.py:sync_multivector" | ||
``` | ||
|
||
=== "async API" | ||
|
||
```python | ||
--8<-- "python/python/tests/docs/test_multivector.py:imports" | ||
|
||
--8<-- "python/python/tests/docs/test_multivector.py:async_multivector" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized these files are missing. Do you have a copy of them somewhere @BubbleCal ?
No description provided.