Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support multivector type #2005

Merged
merged 10 commits into from
Jan 13, 2025
Merged

feat: support multivector type #2005

merged 10 commits into from
Jan 13, 2025

Conversation

BubbleCal
Copy link
Contributor

No description provided.

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@github-actions github-actions bot added enhancement New feature or request Python Python SDK Rust Rust related issues labels Jan 9, 2025
…ch 'main' of https://github.com/lancedb/lancedb into multivec

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
})?;

let mut is_binary = false;
if let arrow_schema::DataType::FixedSizeList(element, dim) = field.data_type() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lance would check this, so remove it

@BubbleCal BubbleCal marked this pull request as ready for review January 9, 2025 09:56
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Comment on lines 72 to 73
def multivec_table(tmp_path) -> lancedb.table.Table:
db = lancedb.connect(tmp_path)
Copy link
Contributor

@wjones127 wjones127 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not create this on disk if we don't need it to be there

Suggested change
def multivec_table(tmp_path) -> lancedb.table.Table:
db = lancedb.connect(tmp_path)
def multivec_table() -> lancedb.table.Table:
db = lancedb.connect("memory://")

Comment on lines 98 to 101
async def multivec_table_async(tmp_path) -> AsyncTable:
conn = await lancedb.connect_async(
tmp_path, read_consistency_interval=timedelta(seconds=0)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def multivec_table_async(tmp_path) -> AsyncTable:
conn = await lancedb.connect_async(
tmp_path, read_consistency_interval=timedelta(seconds=0)
)
async def multivec_table_async() -> AsyncTable:
conn = await lancedb.connect_async(
"memory://", read_consistency_interval=timedelta(seconds=0)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

python/python/tests/test_query.py Show resolved Hide resolved

You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors `mq`, the similarity (distance) from it to any multivector `mv` in the dataset, is defined as:

**maxsim(mq, mv) = sum(max(sim(mq[i], mv[j])))**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what the max and sum are over. I think you are missing that part of the formula, right?

Copy link
Contributor Author

@BubbleCal BubbleCal Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems a math formula is not supported, I will post an image here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add the math plugin if you want: https://squidfunk.github.io/mkdocs-material/reference/math/#katex (I'd prefer the katex over mathjax, I think)

Signed-off-by: BubbleCal <bubble-cal@outlook.com>
Signed-off-by: BubbleCal <bubble-cal@outlook.com>
@BubbleCal BubbleCal requested a review from wjones127 January 13, 2025 13:17
@BubbleCal BubbleCal enabled auto-merge (squash) January 13, 2025 14:08
Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some improvements to do for the documentation to make it clearer, but that can be done in a follow up.


LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).

You can index on a column with multivector type and search on it, the query can be single vector or multiple vectors. If the query is multiple vectors `mq`, the similarity (distance) from it to any multivector `mv` in the dataset, is defined as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we say similarity is 1 - distance?

@@ -138,6 +138,36 @@ LanceDB supports binary vectors as a data type, and has the ability to search bi
--8<-- "python/python/tests/docs/test_binary_vector.py:async_binary_vector"
```

## Multivector type

LanceDB supports multivector type, this is useful when you have multiple vectors for a single item (e.g. with ColBert and ColPali).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still missing an intuitive explanation of what this query type means, and how it's different from the single vector situation. Could you perhaps root this in an example? Is the main example chunked documents, where each row is a full document, and each vector made for a chunk?

@BubbleCal BubbleCal merged commit 66cbf6b into lancedb:main Jan 13, 2025
21 checks passed
Comment on lines +155 to +169
=== "sync API"

```python
--8<-- "python/python/tests/docs/test_multivector.py:imports"

--8<-- "python/python/tests/docs/test_multivector.py:sync_multivector"
```

=== "async API"

```python
--8<-- "python/python/tests/docs/test_multivector.py:imports"

--8<-- "python/python/tests/docs/test_multivector.py:async_multivector"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized these files are missing. Do you have a copy of them somewhere @BubbleCal ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python Python SDK Rust Rust related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants