Replace GraphVectorStore with Metadata-Based Graph Traversal Retrievers #29102

epinzur · 2025-01-08T18:06:45Z

epinzur
Jan 8, 2025

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Replace the GraphVectorStore interface and implementations with retrievers that traverse relationships stored in metadata. These retrievers allow graph traversal using the vector + metadata search functionalities available in many vector stores. Additionally, contribute implementations of LazyGraphRAG and other techniques based on these retrievers.

Motivation

We have released the following implementations under the GraphVectorStore interface:

langchain-community.graph_vectorstores.Cassandra
langchain-astradb.AstraDBGraphVectorStore

These implementations rely on the primary interfaces: GraphVectorStore, Link, and Node; located at langchain-community.graph_vectorstores.base.

However, we have been informed that further contributions to LangChain using the GraphVectorStore interface will not be accepted. This makes our ChromaDB and OpenSearch implementations non-viable for PRs.

It was suggested to transition to the DocumentIndex interface. After evaluation, we believe the BaseRetriever interface is a better fit because it enables graph traversal as a lightweight layer on top of existing vector stores via metadata.

Example:

Document(page_content="Barack Obama won the Nobel Peace Prize in 2009 for his efforts to strengthen international diplomacy.", metadata={"person": "obama"})

Document(page_content="Barack is an avid basketball fan and often played pickup games during his time in the White House.", metadata={"person": "obama"})

Graph traversal between these documents can be achieved by querying the metadata relationship:
{"person": "obama"}.

Proposal (If applicable)

We propose the following changes:

Introduce Metadata-Based Graph Traversal Retrievers

Create retrievers that implement the BaseRetriever interface and traverse metadata relationships. These retrievers eliminate the need for Link and Node objects and focus directly on Document metadata for traversal.
We will start by replicating the current functionality of GraphVectorStore with two retrievers:
- GraphTraversalRetriever
- GraphMMRTraversalRetriever
Introduce VectorStoreAdapter Interfaces

To address gaps in the current VectorStore interface, we propose using VectorStoreAdapter interfaces. These will define the additional methods necessary for graph traversal.

Why adapters?
Adapters provide a flexible way to integrate graph traversal functionality without requiring immediate changes to the VectorStore interface.

Note that these adapters don't provide new functionality. They merely expose existing capabilities of most vector stores through a standard interface. This is also why these are easy to write, and we consider them just "adapters" rather than full-on implementations.

In the future, if the VectorStore interface adds support for richer metadata filtering (as discussed in LangChain PR #24206) and vector embedding retrieval, the adapters can be phased out.

Adapter Implementation
- To start, we will provide adapters for AstraDB, Cassandra, ChromaDB, and OpenSearch.
- Users of other vector stores can implement their own adapters and contribute them back to the community.
Refactor LinkExtractors into DocumentTransformers
The existing LinkExtractors will be converted into DocumentTransformers that add metadata to Document objects. This approach supports graph traversal while also enabling broader use cases, such as metadata-based filtering.

Example API:

# use a document transformer to add metadata to documents
extractor = GLiNEREntityExtractor(['topic'])
documents = extractor.transform_documents(documents)

# add the documents to a standard vector store
store = AstraDBVectorStore(embedding_function=OpenAIEmbeddings())
store.add_documents(documents)

# create a graph retriever using an adapter. 
# traverse over edges using the 'topic' metadata key.
retriever = GraphTraversalRetriever(
    store=AstraTraversalAdapter(vector_store=store),
    edges=['topic'],
)

# invoke the retriever
retriever.invoke("hockey")

Benefits:

Declarative Traversal
Traversal edges are defined at retriever initialization using metadata keys, allowing a single vector store to support multiple traversal strategies. This enables fast ingestion and defers detailed graph creation, as seen in Microsoft's LazyGraphRAG.
Narrower Abstraction
The BaseRetriever interface is read-only, emphasizing that any vector store can be traversed as a graph without requiring special ingestion processes.
Broad Compatibility
Adapters enable graph retrieval on any vector store, maximizing flexibility and extensibility.
Metadata Efficiency
Avoids embedding Link objects in metadata, reducing the risk of hitting document size limits. This approach is more idiomatic, existing metadata fields are used naturally, rather than requiring special Link fields.

Discussion Points:

Why BaseRetriever, Not DocumentIndex?
Graph traversal is a property of querying data, not writing or configuring it. Using BaseRetriever keeps the focus on reading and leaves ingestion to the vector store.
Shared or Individual Adapter Definitions?
Should there be a shared Adapter interface for all Graph Traversal Retrievers or individual definitions for each? Where should the Adapter code live?
Location for Graph Algorithms?
Where should graph algorithms like LazyGraphRAG, which can be implemented as a chain on top of a retriever, live within the codebase?

efriis · 2025-01-09T22:58:36Z

efriis
Jan 9, 2025
Maintainer

Thanks for writing up! A few questions

could you share an interface for VectorStoreAdapter?
could you share the exposed interface for a "graph algorithm," and how does it differ from a retriever
would you be open to implementing this in one of the existing packages (e.g. langchain-astradb) as a standalone first to get feedback on how people use it before we introduce ones that are meant to be generalizable between services?

1 reply

bjchambers Jan 10, 2025

could you share the exposed interface for a "graph algorithm," and how does it differ from a retriever

As an example graph algorithm, LazyGraphRAG is implemented as a chain, which performs graph retrieval (using the retriever), then creates communities from the retrieved subgraph, extracts claims from each community, and then reranks those claims and uses them to answer the question. The interface is a chain that can be applied to any retriever and returns the top claims, allowing it to be used in a larger chain.

would you be open to implementing this in one of the existing packages (e.g. langchain-astradb) as a standalone first to get feedback on how people use it before we introduce ones that are meant to be generalizable between services?

Our concern here is that graph techniques on top of metadata is generally useful. Implementing it in the existing store-specific packages makes it seem like it only works for those specific stores. Similarly, the graphrag implementations are chains that can be applied to any Documents and assemble an in-memory graph based on relationships in the metadata. Nothing about these is store-specific.

epinzur · 2025-01-10T13:41:28Z

epinzur
Jan 10, 2025
Author

Below is our current thought for the VectorStoreAdapter interface.

Beyond standard retrieval, implementations of this adapter inject the document embedding into the retrieved document metadata.

Additionally the query-based methods return the embedded query vector. This is done to support server-side embedding.

from abc import abstractmethod
from typing import (
    Any,
    Dict,
    List,
    Optional,
    Sequence,
    Tuple,
)

from langchain_core.documents import Document
from langchain_core.embeddings import Embeddings
from langchain_core.runnables import run_in_executor
from langchain_core.vectorstores import VectorStore


class VectorStoreAdapter:
    _base_vector_store: VectorStore

    @property
    def _safe_embedding(self) -> Embeddings:
        if not self._base_vector_store.embeddings:
            msg = "Missing embedding"
            raise ValueError(msg)
        return self._base_vector_store.embeddings

    def similarity_search_with_embedding(
        self,
        query: str,
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> Tuple[List[float], List[Document]]:
        """Returns docs (with embeddings) most similar to the query.

        Also returns the embedded query vector.

        Args:
            query: Input text.
            k: Number of Documents to return. Defaults to 4.
            filter: Filter on the metadata to apply.
            **kwargs: Additional keyword arguments.

        Returns:
            A tuple of:
                * The embedded query vector
                * List of Documents most similar to the query vector.
                  Documents should have their embedding added to
                  their metadata under the METADATA_EMBEDDING_KEY key.
        """
        query_embedding = self._safe_embedding.embed_query(text=query)
        docs = self.similarity_search_with_embedding_by_vector(
            embedding=query_embedding,
            k=k,
            filter=filter,
            **kwargs,
        )
        return query_embedding, docs

    async def asimilarity_search_with_embedding(
        self,
        query: str,
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> Tuple[List[float], List[Document]]:
        """Returns docs (with embeddings) most similar to the query.

        Also returns the embedded query vector.

        Args:
            query: Input text.
            k: Number of Documents to return. Defaults to 4.
            filter: Filter on the metadata to apply.
            **kwargs: Additional keyword arguments.

        Returns:
            A tuple of:
                * The embedded query vector
                * List of Documents most similar to the query vector.
                  Documents should have their embedding added to
                  their metadata under the METADATA_EMBEDDING_KEY key.
        """
        return await run_in_executor(
            None, self.similarity_search_with_embedding, query, k, filter, **kwargs
        )

    @abstractmethod
    def similarity_search_with_embedding_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Returns docs (with embeddings) most similar to the query vector.

        Args:
            embedding: Embedding to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            filter: Filter on the metadata to apply.
            **kwargs: Additional keyword arguments.

        Returns:
            List of Documents most similar to the query vector.
                Documents should have their embedding added to
                their metadata under the METADATA_EMBEDDING_KEY key.
        """

    async def asimilarity_search_with_embedding_by_vector(
        self,
        embedding: List[float],
        k: int = 4,
        filter: Optional[Dict[str, str]] = None,
        **kwargs: Any,
    ) -> List[Document]:
        """Returns docs (with embeddings) most similar to the query vector.

        Args:
            embedding: Embedding to look up documents similar to.
            k: Number of Documents to return. Defaults to 4.
            filter: Filter on the metadata to apply.
            **kwargs: Additional keyword arguments.

        Returns:
            List of Documents most similar to the query vector.
                Documents should have their embedding added to
                their metadata under the METADATA_EMBEDDING_KEY key.
        """
        return await run_in_executor(
            None,
            self.similarity_search_with_embedding_by_vector,
            embedding,
            k,
            filter,
            **kwargs,
        )

    @abstractmethod
    def get(
        self,
        ids: Sequence[str],
        /,
        **kwargs: Any,
    ) -> list[Document]:
        """Get documents by id.

        Fewer documents may be returned than requested if some IDs are not found or
        if there are duplicated IDs.

        Users should not assume that the order of the returned documents matches
        the order of the input IDs. Instead, users should rely on the ID field of the
        returned documents.

        This method should **NOT** raise exceptions if no documents are found for
        some IDs.

        Args:
            ids: List of IDs to get.
            kwargs: Additional keyword arguments. These are up to the implementation.

        Returns:
            List[Document]: List of documents that were found.
        """

    async def aget(
        self,
        ids: Sequence[str],
        /,
        **kwargs: Any,
    ) -> list[Document]:
        """Get documents by id.

        Fewer documents may be returned than requested if some IDs are not found or
        if there are duplicated IDs.

        Users should not assume that the order of the returned documents matches
        the order of the input IDs. Instead, users should rely on the ID field of the
        returned documents.

        This method should **NOT** raise exceptions if no documents are found for
        some IDs.

        Args:
            ids: List of IDs to get.
            kwargs: Additional keyword arguments. These are up to the implementation.

        Returns:
            List[Document]: List of documents that were found.
        """
        return await run_in_executor(
            None,
            self.get,
            ids,
            **kwargs,
        )

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace GraphVectorStore with Metadata-Based Graph Traversal Retrievers #29102

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Replace GraphVectorStore with Metadata-Based Graph Traversal Retrievers #29102

epinzur Jan 8, 2025

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 2 comments · 1 reply

efriis Jan 9, 2025 Maintainer

bjchambers Jan 10, 2025

epinzur Jan 10, 2025 Author

epinzur
Jan 8, 2025

Replies: 2 comments 1 reply

efriis
Jan 9, 2025
Maintainer

epinzur
Jan 10, 2025
Author