core[minor]: Add BaseIndex abstraction #24206

eyurtsev · 2024-07-12T20:49:34Z

This PR introduces the BaseIndex abstraction.

Context

We're creating a base index abstraction that defines a standard interface w/ respect to BaseMedia, which is content that has an ID and metadata.

The interface is designed to support the following operations:

Storing content in the index.
Retrieving content by ID.
Querying the content based on the metadata associated with the content.

The interface does NOT support queries against the actual content of the BaseMedia -- this is left for more specialized interfaces like a vectorstore.

Goals:

Define precise semantics (e.g., upsert vs. add), no exception for get_by_ids when content is missing etc.
Add functionality which is useful for administration (e.g., get_by_ids, delete_by_filter etc.)
BaseIndex can support binary data via MediaBlob
Define a standard interface for filtering
Create a standard way for vectorstores to describe what functionality they support (describe classmethod)

We're creating a separate interface rather than updating VectorStore to allow other LangChain classes to also implement the same indexing API (ParentDocumentRetriever)

Existing vectorstores are largely unaffected will inherit from BaseIndex without any modifcations, raising NotImplementedErrors for the new methods.

A test suite will be provided to test index functionality based on information in the describe() classmethod (which will cover vectorstores as well)

TODOs

Decide name BaseIndex vs. Index (using BaseIndex to avoid collision with the index function)
Confirm return type of delete API
Confirm location of abstraction and names of methods / variables
Add missing async methods

vercel · 2024-07-12T20:49:36Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Jul 12, 2024 9:14pm

eyurtsev · 2024-07-12T21:12:29Z

libs/core/langchain_core/indexing/base.py

@@ -421,29 +827,3 @@ async def adelete_keys(self, keys: Sequence[str]) -> None:
            keys: A list of keys to delete.
        """
        self.delete_keys(keys)
-
-
-class UpsertResponse(TypedDict):


moved to top of file

bjchambers · 2024-07-12T21:43:12Z

libs/core/langchain_core/vectorstores/base.py

+            f"{self.__class__.__name__} does not yet support delete_by_filter."
+        )
+
+    def get_by_filter(


Is there a way to filter and order by similarity? Would that be a get_by_filter with a sort based on the similarity? Or is that a similarity search with a filter? Something else?

The above makes me wonder if similarity search could be expressed as a sort, in which case basic similarity search is filter = None, sort = Similarity?

I was thinking about introducing another method to capture search semantics.

def get_by_query( self, *, query: Q, # <-- e.g., text query or vector or BaseMedia (up to the implementation) filter: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, limit: Optional[int] = None, page: int = 0, # <-- Pagination added sort: Optional[Sort] = None, **kwargs, ) -> List[Hit]: # Note it's a List rather than Iterable, and a `hit` rather than `Document` which could accommodate extra metadata like score, optional content etc.

i agree that sort and similarity can be viewed as the same thing. I need to think a bit more to see if we can merge them.

@bjchambers Would standardization of filters, the ability to query by vector and the ability to retrieve vectors be sufficient to implement the search traversal patterns needed for GraphVectorStore?

At indexing time, the implementation can standardized fields into metadata; e.g., namespace (e.g., edge, node), edge_type (just for edges), edge_data (e.g., the value of an href).

At query time, the implementation can rely on the ability to query by vector and filter on the standardized fields. Is that sufficient to implement GraphVectorStore? Is there a way to limit traversal by depth correctly (to avoid re-exploring nodes?)

I'm wondering what interface would be needed so that instead of inheriting from vectorstore we can use compositional pattern with vectorstore and code against the vectorstore's interface.

class VectorStoreGraphRag: vectorstore: Vectorstore def traverse(...): # implementation relying on an instance of vectorstore

Good question. Some of it depends on the traversal. Currently, it does the following things:

Retrieve the top-k nodes given a vector (we're working on also doing a version that supports filters). This would fit the query-by-vector and query-by-vector-with-filter.

For MMR-traversals: Retrieve the top-k nodes given a vector with a given incoming "tag". I believe we could support this if we put the incoming tags into the metadata, and supported either {"incoming_tags": {"$contains": tag } } (for a single "contains") or {"incoming_tags": { "$intersects": tags }} (for multiple tags). This would be a query-by-vector-with-filter.

For non-MMR traversals: Retrieve all nodes with given incoming tags. This would be just query-by-filter.

Depth limits are handled at query time by tracking the IDs / depths at which they are discovered (eg., the traversal process issues multiple underlying queries for adjacencies).

So, I think this would require:

Query-by-vector

Query-by-filter

Query-by-vector-with-filter

Filter for at least "list/set metadata field contains a given value", and maybe intersection (contains any of the given values).

Also "query-by-ID" to get all the nodes with the selected IDs, but that could be part of the above. Currently, it does the traversal with just the IDs and then fetches those IDs. So maybe also:

query-ids-by-vector

query-ids-by-filter

query-ids-by-vector-and-filter

query-by-ids

The logic being that each node may be reached many times during the traversal -- rather than pulling back all the data each time, it is likely better to just pull back the data necessary for traversing. That is currently (i) the ID used for identifying the reachable nodes, (ii) the embedding (used for the MMR score calculation), and (iii) the outgoing tags (used for finding the next set of edges). So maybe a version that accepts a list of which fields to actually fetch (eg., we would only need the ID, the metadata["outgoing"], and the embedding). But that could perhaps be a future optimization...

bjchambers · 2024-07-12T22:56:43Z

I could see a case for filtering on mime type too. Maybe possible via the filter here, but thought I’d toss that out.

…

On Fri, Jul 12, 2024 at 2:58 PM Eugene Yurtsev ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In libs/core/langchain_core/vectorstores/base.py <#24206 (comment)> : > + Args: + filter: Filter to apply to documents. + **kwargs: Other keyword arguments that subclasses might use. + + Returns: + DeleteResponse: A response object that contains the list of IDs that were + successfully deleted from the vectorstore and the list of IDs that failed + to be deleted. + + .. versionadded:: 0.2.15 + """ + raise NotImplementedError( + f"{self.__class__.__name__} does not yet support delete_by_filter." + ) + + def get_by_filter( I was thinking about introducing another method to capture search semantics. def get_by_query( self, *, query: Q, # <-- e.g., text query or vector or BaseMedia (up to the implementation) filter: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None, limit: Optional[int] = None, page: int = 0, # <-- Pagination added sort: Optional[Sort] = None, **kwargs, ) -> List[Hit]: # <-- A hit would contain information like score, optional content etc. ------------------------------ i agree that sort and similarity can be viewed as the same thing. I need to think a bit more to see if we can merge them. — Reply to this email directly, view it on GitHub <#24206 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAIY6BUUWUEBLIR4V67GH3ZMBGQRAVCNFSM6AAAAABKZR7CMOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCNZVG44TSNZVGI> . You are receiving this because you commented.Message ID: ***@***.***>

kurtisvg · 2024-07-15T20:01:56Z

libs/core/langchain_core/indexing/__init__.py

    InMemoryRecordManager,
    RecordManager,
    UpsertResponse,
 )

 __all__ = [
    "aindex",
+    "BaseIndex",


note: I find the "Index" part of this name confusing -- this might be because Index's in databases/programming are a different thing. Something like "Store" or "Collection" would be more intuitive to me.

kurtisvg · 2024-07-15T20:03:59Z

libs/core/langchain_core/indexing/api.py

+    # Local scope to avoid circular imports
+    from langchain_core.vectorstores import VectorStore


As a rule of thumb, circular imports tend to cause subtle problems are are usually best avoided all together.

kurtisvg · 2024-07-15T20:07:42Z

libs/core/langchain_core/indexing/base.py

+FilterOp = Literal[
+    "$eq", "$ne", "$lt", "$lte", "$gt", "$gte", "$in", "$nin", "$and", "$not", "$or"
+]


Any reason not to use an Enum here?

kurtisvg · 2024-07-15T20:09:05Z

libs/core/langchain_core/indexing/base.py

+from langchain_core.utils import abatch_iterate, batch_iterate
+
+
+class Sort(TypedDict):


nit: I think in general, Dataclasses are recommended over TypedDict. Any reason to prefer a TypedDict here?

kurtisvg · 2024-07-15T20:13:22Z

libs/core/langchain_core/indexing/base.py

+        if it is provided. If the ID is not provided, the upsert method is free
+        to generate an ID for the content.


Can you provide more strict guidance here? It's common to stick to RFC2119 terms.

Specifically -- is this "SHOULD" (recommended) or "MAY" (Optional) behavior?

kurtisvg · 2024-07-15T20:14:15Z

libs/core/langchain_core/indexing/base.py

+        to generate an ID for the content.
+
+        When an ID is specified and the content already exists in the vectorstore,
+        the upsert method should update the content with the new data. If the content


Is there a way to insert without upsert?

Do you think it's necessary to be able to add without specifically having upsert semantics? At the moment, docs have an optional ID associated with them, so the upsert works like an add in that case.

From a relational database perspective, the conflict depends on what the primary key is. Is the assumption always the the ID is a primary key, or can you have primary keys on different fields? I might want to add a new entry, but not overwrite if their is an existing one.

kurtisvg · 2024-07-15T20:16:56Z

libs/core/langchain_core/indexing/base.py

+        """
+
+    @beta(message="Added in 0.2.11. The API is subject to change.")
+    async def astreaming_upsert(


should this have the same warning as streaming_upsert?

kurtisvg · 2024-07-15T20:18:17Z

libs/core/langchain_core/indexing/base.py

+
+            OR
+
+            bool: A boolean indicating whether the delete operation was successful.


What about partial successes? Should that be "true" or "false" is some items were removed?

kurtisvg · 2024-07-15T20:22:54Z

libs/core/langchain_core/vectorstores/base.py

+    def delete(
+        self, ids: Optional[List[str]] = None, **kwargs: Any
+    ) -> Union[DeleteResponse, None, bool]:


In general, I would prefer introducing a new method with a different return as opposed to trying to overload this one. Otherwise don't get what I want from type checking here -- I have to do extra work to sus out the type. This also makes it easier to deprecate the function in the future.

eyurtsev · 2024-09-18T14:14:05Z

We're going to close this PR for now. The main reason is that it picks up too much complexity for functionality that while may be useful isn't super critical.

We do have the much simpler DocumentIndexer abstraction right now. (https://python.langchain.com/api_reference/core/indexing/langchain_core.indexing.base.DocumentIndex.html#langchain_core.indexing.base.DocumentIndex)

eyurtsev added 15 commits July 10, 2024 15:37

qxqx

b1d0bf4

Stashing current work

64f4966

qxqx

5bd2ece

qxqx

465ed2e

qxqx

00f451c

qx

decb12f

x

4795a02

x

acfaf78

x

f4b4717

xx

d7b7a5f

x

c8ea0ec

x

8e9a67a

Merge branch 'master' into eugene/indexing_abstraction_minimal

f7be715

qxqx

3c6a7fc

x

ecae630

x

b3cf1ae

eyurtsev added 08 RFC Request for Comment. Solicit design input from other contributors Ɑ: core Related to langchain-core Ɑ: vector store Related to vector store module labels Jul 12, 2024

eyurtsev added 2 commits July 12, 2024 17:03

linting

a596b1b

x

8be42bb

eyurtsev mentioned this pull request Jul 12, 2024

RFC indexing interface #23544

Closed

x

dc4797a

eyurtsev commented Jul 12, 2024

View reviewed changes

eyurtsev requested review from cbornet and baskaryan July 12, 2024 21:13

eyurtsev marked this pull request as ready for review July 12, 2024 21:19

dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. 🤖:refactor A large refactor of a feature(s) or restructuring of many files labels Jul 12, 2024

bjchambers reviewed Jul 12, 2024

View reviewed changes

kurtisvg reviewed Jul 15, 2024

View reviewed changes

eyurtsev closed this Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core[minor]: Add BaseIndex abstraction #24206

core[minor]: Add BaseIndex abstraction #24206

eyurtsev commented Jul 12, 2024 •

edited

Loading

vercel bot commented Jul 12, 2024 •

edited

Loading

eyurtsev Jul 12, 2024

bjchambers Jul 12, 2024

eyurtsev Jul 12, 2024 •

edited

Loading

eyurtsev Jul 15, 2024

eyurtsev Jul 15, 2024 •

edited

Loading

bjchambers Jul 15, 2024

bjchambers Jul 15, 2024

bjchambers commented Jul 12, 2024 via email

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

eyurtsev Jul 15, 2024

kurtisvg Jul 17, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

kurtisvg Jul 15, 2024

eyurtsev commented Sep 18, 2024

		# Local scope to avoid circular imports
		from langchain_core.vectorstores import VectorStore

		from langchain_core.utils import abatch_iterate, batch_iterate


		class Sort(TypedDict):

		if it is provided. If the ID is not provided, the upsert method is free
		to generate an ID for the content.


		OR

		bool: A boolean indicating whether the delete operation was successful.

core[minor]: Add BaseIndex abstraction #24206

core[minor]: Add BaseIndex abstraction #24206

Conversation

eyurtsev commented Jul 12, 2024 • edited Loading

Context

TODOs

vercel bot commented Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev Jul 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjchambers commented Jul 12, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eyurtsev commented Sep 18, 2024

eyurtsev commented Jul 12, 2024 •

edited

Loading

vercel bot commented Jul 12, 2024 •

edited

Loading

eyurtsev Jul 12, 2024 •

edited

Loading

eyurtsev Jul 15, 2024 •

edited

Loading