Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core[minor]: Add BaseIndex abstraction #24206

Closed
wants to merge 19 commits into from

Conversation

eyurtsev
Copy link
Collaborator

@eyurtsev eyurtsev commented Jul 12, 2024

This PR introduces the BaseIndex abstraction.

Context

We're creating a base index abstraction that defines a standard interface w/ respect to BaseMedia, which is content that has an ID and metadata.

The interface is designed to support the following operations:

  1. Storing content in the index.
  2. Retrieving content by ID.
  3. Querying the content based on the metadata associated with the content.

The interface does NOT support queries against the actual content of the BaseMedia -- this is left for more specialized interfaces like a vectorstore.

Goals:

  1. Define precise semantics (e.g., upsert vs. add), no exception for get_by_ids when content is missing etc.
  2. Add functionality which is useful for administration (e.g., get_by_ids, delete_by_filter etc.)
  3. BaseIndex can support binary data via MediaBlob
  4. Define a standard interface for filtering
  5. Create a standard way for vectorstores to describe what functionality they support (describe classmethod)

We're creating a separate interface rather than updating VectorStore to allow other LangChain classes to also implement the same indexing API (ParentDocumentRetriever)

Existing vectorstores are largely unaffected will inherit from BaseIndex without any modifcations, raising NotImplementedErrors for the new methods.

A test suite will be provided to test index functionality based on information in the describe() classmethod (which will cover vectorstores as well)

TODOs

  • Decide name BaseIndex vs. Index (using BaseIndex to avoid collision with the index function)
  • Confirm return type of delete API
  • Confirm location of abstraction and names of methods / variables
  • Add missing async methods

Copy link

vercel bot commented Jul 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jul 12, 2024 9:14pm

@eyurtsev eyurtsev added 08 RFC Request for Comment. Solicit design input from other contributors Ɑ: core Related to langchain-core Ɑ: vector store Related to vector store module labels Jul 12, 2024
@@ -421,29 +827,3 @@ async def adelete_keys(self, keys: Sequence[str]) -> None:
keys: A list of keys to delete.
"""
self.delete_keys(keys)


class UpsertResponse(TypedDict):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to top of file

@eyurtsev eyurtsev requested review from cbornet and baskaryan July 12, 2024 21:13
@eyurtsev eyurtsev marked this pull request as ready for review July 12, 2024 21:19
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. 🤖:refactor A large refactor of a feature(s) or restructuring of many files labels Jul 12, 2024
f"{self.__class__.__name__} does not yet support delete_by_filter."
)

def get_by_filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to filter and order by similarity? Would that be a get_by_filter with a sort based on the similarity? Or is that a similarity search with a filter? Something else?

The above makes me wonder if similarity search could be expressed as a sort, in which case basic similarity search is filter = None, sort = Similarity?

Copy link
Collaborator Author

@eyurtsev eyurtsev Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about introducing another method to capture search semantics.

def get_by_query(
        self,
        *,
        query: Q, # <-- e.g., text query or vector or BaseMedia (up to the implementation)
        filter: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
        limit: Optional[int] = None,
        page: int = 0, # <-- Pagination added
        sort: Optional[Sort] = None,
        **kwargs,
) -> List[Hit]:  # Note it's a List rather than Iterable, and a `hit` rather than `Document` which could accommodate extra metadata like score, optional content etc.

i agree that sort and similarity can be viewed as the same thing. I need to think a bit more to see if we can merge them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjchambers Would standardization of filters, the ability to query by vector and the ability to retrieve vectors be sufficient to implement the search traversal patterns needed for GraphVectorStore?

At indexing time, the implementation can standardized fields into metadata; e.g., namespace (e.g., edge, node), edge_type (just for edges), edge_data (e.g., the value of an href).

At query time, the implementation can rely on the ability to query by vector and filter on the standardized fields. Is that sufficient to implement GraphVectorStore? Is there a way to limit traversal by depth correctly (to avoid re-exploring nodes?)

Copy link
Collaborator Author

@eyurtsev eyurtsev Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering what interface would be needed so that instead of inheriting from vectorstore we can use compositional pattern with vectorstore and code against the vectorstore's interface.

class VectorStoreGraphRag:
   vectorstore: Vectorstore
   def traverse(...):
      # implementation relying on an instance of vectorstore

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Some of it depends on the traversal. Currently, it does the following things:

  1. Retrieve the top-k nodes given a vector (we're working on also doing a version that supports filters). This would fit the query-by-vector and query-by-vector-with-filter.
  2. For MMR-traversals: Retrieve the top-k nodes given a vector with a given incoming "tag". I believe we could support this if we put the incoming tags into the metadata, and supported either {"incoming_tags": {"$contains": tag } } (for a single "contains") or {"incoming_tags": { "$intersects": tags }} (for multiple tags). This would be a query-by-vector-with-filter.
  3. For non-MMR traversals: Retrieve all nodes with given incoming tags. This would be just query-by-filter.

Depth limits are handled at query time by tracking the IDs / depths at which they are discovered (eg., the traversal process issues multiple underlying queries for adjacencies).

So, I think this would require:

  1. Query-by-vector
  2. Query-by-filter
  3. Query-by-vector-with-filter
  4. Filter for at least "list/set metadata field contains a given value", and maybe intersection (contains any of the given values).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also "query-by-ID" to get all the nodes with the selected IDs, but that could be part of the above. Currently, it does the traversal with just the IDs and then fetches those IDs. So maybe also:

  1. query-ids-by-vector
  2. query-ids-by-filter
  3. query-ids-by-vector-and-filter
  4. query-by-ids

The logic being that each node may be reached many times during the traversal -- rather than pulling back all the data each time, it is likely better to just pull back the data necessary for traversing. That is currently (i) the ID used for identifying the reachable nodes, (ii) the embedding (used for the MMR score calculation), and (iii) the outgoing tags (used for finding the next set of edges). So maybe a version that accepts a list of which fields to actually fetch (eg., we would only need the ID, the metadata["outgoing"], and the embedding). But that could perhaps be a future optimization...

@bjchambers
Copy link
Contributor

bjchambers commented Jul 12, 2024 via email

InMemoryRecordManager,
RecordManager,
UpsertResponse,
)

__all__ = [
"aindex",
"BaseIndex",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I find the "Index" part of this name confusing -- this might be because Index's in databases/programming are a different thing. Something like "Store" or "Collection" would be more intuitive to me.

Comment on lines +271 to +272
# Local scope to avoid circular imports
from langchain_core.vectorstores import VectorStore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a rule of thumb, circular imports tend to cause subtle problems are are usually best avoided all together.

Comment on lines +44 to +46
FilterOp = Literal[
"$eq", "$ne", "$lt", "$lte", "$gt", "$gte", "$in", "$nin", "$and", "$not", "$or"
]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use an Enum here?

from langchain_core.utils import abatch_iterate, batch_iterate


class Sort(TypedDict):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think in general, Dataclasses are recommended over TypedDict. Any reason to prefer a TypedDict here?

Comment on lines +172 to +173
if it is provided. If the ID is not provided, the upsert method is free
to generate an ID for the content.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more strict guidance here? It's common to stick to RFC2119 terms.

Specifically -- is this "SHOULD" (recommended) or "MAY" (Optional) behavior?

to generate an ID for the content.

When an ID is specified and the content already exists in the vectorstore,
the upsert method should update the content with the new data. If the content

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to insert without upsert?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's necessary to be able to add without specifically having upsert semantics? At the moment, docs have an optional ID associated with them, so the upsert works like an add in that case.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a relational database perspective, the conflict depends on what the primary key is. Is the assumption always the the ID is a primary key, or can you have primary keys on different fields? I might want to add a new entry, but not overwrite if their is an existing one.

"""

@beta(message="Added in 0.2.11. The API is subject to change.")
async def astreaming_upsert(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this have the same warning as streaming_upsert?


OR

bool: A boolean indicating whether the delete operation was successful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about partial successes? Should that be "true" or "false" is some items were removed?

Comment on lines +238 to +240
def delete(
self, ids: Optional[List[str]] = None, **kwargs: Any
) -> Union[DeleteResponse, None, bool]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would prefer introducing a new method with a different return as opposed to trying to overload this one. Otherwise don't get what I want from type checking here -- I have to do extra work to sus out the type. This also makes it easier to deprecate the function in the future.

@eyurtsev
Copy link
Collaborator Author

We're going to close this PR for now. The main reason is that it picks up too much complexity for functionality that while may be useful isn't super critical.

We do have the much simpler DocumentIndexer abstraction right now. (https://python.langchain.com/api_reference/core/indexing/langchain_core.indexing.base.DocumentIndex.html#langchain_core.indexing.base.DocumentIndex)

@eyurtsev eyurtsev closed this Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
08 RFC Request for Comment. Solicit design input from other contributors Ɑ: core Related to langchain-core 🤖:refactor A large refactor of a feature(s) or restructuring of many files size:XL This PR changes 500-999 lines, ignoring generated files. Ɑ: vector store Related to vector store module
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants