-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core[minor]: Add BaseIndex abstraction #24206
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@@ -421,29 +827,3 @@ async def adelete_keys(self, keys: Sequence[str]) -> None: | |||
keys: A list of keys to delete. | |||
""" | |||
self.delete_keys(keys) | |||
|
|||
|
|||
class UpsertResponse(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to top of file
f"{self.__class__.__name__} does not yet support delete_by_filter." | ||
) | ||
|
||
def get_by_filter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to filter and order by similarity? Would that be a get_by_filter
with a sort
based on the similarity? Or is that a similarity search with a filter? Something else?
The above makes me wonder if similarity search could be expressed as a sort, in which case basic similarity search is filter = None, sort = Similarity
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about introducing another method to capture search semantics.
def get_by_query(
self,
*,
query: Q, # <-- e.g., text query or vector or BaseMedia (up to the implementation)
filter: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
limit: Optional[int] = None,
page: int = 0, # <-- Pagination added
sort: Optional[Sort] = None,
**kwargs,
) -> List[Hit]: # Note it's a List rather than Iterable, and a `hit` rather than `Document` which could accommodate extra metadata like score, optional content etc.
i agree that sort and similarity can be viewed as the same thing. I need to think a bit more to see if we can merge them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bjchambers Would standardization of filters, the ability to query by vector and the ability to retrieve vectors be sufficient to implement the search traversal patterns needed for GraphVectorStore?
At indexing time, the implementation can standardized fields into metadata; e.g., namespace
(e.g., edge, node), edge_type
(just for edges), edge_data
(e.g., the value of an href).
At query time, the implementation can rely on the ability to query by vector and filter on the standardized fields. Is that sufficient to implement GraphVectorStore? Is there a way to limit traversal by depth correctly (to avoid re-exploring nodes?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering what interface would be needed so that instead of inheriting from vectorstore we can use compositional pattern with vectorstore and code against the vectorstore's interface.
class VectorStoreGraphRag:
vectorstore: Vectorstore
def traverse(...):
# implementation relying on an instance of vectorstore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Some of it depends on the traversal. Currently, it does the following things:
- Retrieve the top-k nodes given a vector (we're working on also doing a version that supports filters). This would fit the query-by-vector and query-by-vector-with-filter.
- For MMR-traversals: Retrieve the top-k nodes given a vector with a given incoming "tag". I believe we could support this if we put the incoming tags into the metadata, and supported either
{"incoming_tags": {"$contains": tag } }
(for a single "contains") or{"incoming_tags": { "$intersects": tags }}
(for multiple tags). This would be a query-by-vector-with-filter. - For non-MMR traversals: Retrieve all nodes with given incoming tags. This would be just
query-by-filter
.
Depth limits are handled at query time by tracking the IDs / depths at which they are discovered (eg., the traversal process issues multiple underlying queries for adjacencies).
So, I think this would require:
- Query-by-vector
- Query-by-filter
- Query-by-vector-with-filter
- Filter for at least "list/set metadata field contains a given value", and maybe intersection (contains any of the given values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also "query-by-ID" to get all the nodes with the selected IDs, but that could be part of the above. Currently, it does the traversal with just the IDs and then fetches those IDs. So maybe also:
query-ids-by-vector
query-ids-by-filter
query-ids-by-vector-and-filter
query-by-ids
The logic being that each node may be reached many times during the traversal -- rather than pulling back all the data each time, it is likely better to just pull back the data necessary for traversing. That is currently (i) the ID used for identifying the reachable nodes, (ii) the embedding (used for the MMR score calculation), and (iii) the outgoing tags (used for finding the next set of edges). So maybe a version that accepts a list of which fields to actually fetch (eg., we would only need the ID, the metadata["outgoing"], and the embedding). But that could perhaps be a future optimization...
I could see a case for filtering on mime type too. Maybe possible via the
filter here, but thought I’d toss that out.
…On Fri, Jul 12, 2024 at 2:58 PM Eugene Yurtsev ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In libs/core/langchain_core/vectorstores/base.py
<#24206 (comment)>
:
> + Args:
+ filter: Filter to apply to documents.
+ **kwargs: Other keyword arguments that subclasses might use.
+
+ Returns:
+ DeleteResponse: A response object that contains the list of IDs that were
+ successfully deleted from the vectorstore and the list of IDs that failed
+ to be deleted.
+
+ .. versionadded:: 0.2.15
+ """
+ raise NotImplementedError(
+ f"{self.__class__.__name__} does not yet support delete_by_filter."
+ )
+
+ def get_by_filter(
I was thinking about introducing another method to capture search
semantics.
def get_by_query(
self,
*,
query: Q, # <-- e.g., text query or vector or BaseMedia (up to the implementation)
filter: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
limit: Optional[int] = None,
page: int = 0, # <-- Pagination added
sort: Optional[Sort] = None,
**kwargs,
) -> List[Hit]: # <-- A hit would contain information like score, optional content etc.
------------------------------
i agree that sort and similarity can be viewed as the same thing. I need
to think a bit more to see if we can merge them.
—
Reply to this email directly, view it on GitHub
<#24206 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAIY6BUUWUEBLIR4V67GH3ZMBGQRAVCNFSM6AAAAABKZR7CMOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNZVG44TSNZVGI>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
InMemoryRecordManager, | ||
RecordManager, | ||
UpsertResponse, | ||
) | ||
|
||
__all__ = [ | ||
"aindex", | ||
"BaseIndex", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I find the "Index" part of this name confusing -- this might be because Index's in databases/programming are a different thing. Something like "Store" or "Collection" would be more intuitive to me.
# Local scope to avoid circular imports | ||
from langchain_core.vectorstores import VectorStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a rule of thumb, circular imports tend to cause subtle problems are are usually best avoided all together.
FilterOp = Literal[ | ||
"$eq", "$ne", "$lt", "$lte", "$gt", "$gte", "$in", "$nin", "$and", "$not", "$or" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use an Enum here?
from langchain_core.utils import abatch_iterate, batch_iterate | ||
|
||
|
||
class Sort(TypedDict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think in general, Dataclasses are recommended over TypedDict. Any reason to prefer a TypedDict here?
if it is provided. If the ID is not provided, the upsert method is free | ||
to generate an ID for the content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide more strict guidance here? It's common to stick to RFC2119 terms.
Specifically -- is this "SHOULD" (recommended) or "MAY" (Optional) behavior?
to generate an ID for the content. | ||
|
||
When an ID is specified and the content already exists in the vectorstore, | ||
the upsert method should update the content with the new data. If the content |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to insert without upsert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's necessary to be able to add without specifically having upsert semantics? At the moment, docs have an optional ID associated with them, so the upsert works like an add in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a relational database perspective, the conflict depends on what the primary key is. Is the assumption always the the ID is a primary key, or can you have primary keys on different fields? I might want to add a new entry, but not overwrite if their is an existing one.
""" | ||
|
||
@beta(message="Added in 0.2.11. The API is subject to change.") | ||
async def astreaming_upsert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this have the same warning as streaming_upsert
?
|
||
OR | ||
|
||
bool: A boolean indicating whether the delete operation was successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about partial successes? Should that be "true" or "false" is some items were removed?
def delete( | ||
self, ids: Optional[List[str]] = None, **kwargs: Any | ||
) -> Union[DeleteResponse, None, bool]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would prefer introducing a new method with a different return as opposed to trying to overload this one. Otherwise don't get what I want from type checking here -- I have to do extra work to sus out the type. This also makes it easier to deprecate the function in the future.
We're going to close this PR for now. The main reason is that it picks up too much complexity for functionality that while may be useful isn't super critical. We do have the much simpler DocumentIndexer abstraction right now. (https://python.langchain.com/api_reference/core/indexing/langchain_core.indexing.base.DocumentIndex.html#langchain_core.indexing.base.DocumentIndex) |
This PR introduces the
BaseIndex
abstraction.Context
We're creating a base index abstraction that defines a standard interface w/ respect to BaseMedia, which is content that has an ID and metadata.
The interface is designed to support the following operations:
The interface does NOT support queries against the actual content of the BaseMedia -- this is left for more specialized interfaces like a vectorstore.
Goals:
describe
classmethod)We're creating a separate interface rather than updating VectorStore to allow other LangChain classes to also implement the same indexing API (ParentDocumentRetriever)
Existing vectorstores are largely unaffected will inherit from BaseIndex without any modifcations, raising NotImplementedErrors for the new methods.
A test suite will be provided to test index functionality based on information in the
describe()
classmethod (which will cover vectorstores as well)TODOs
BaseIndex
vs.Index
(usingBaseIndex
to avoid collision with theindex
function)delete
API