Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in checkpoint saving whilst Sentence Transformer embeddings are generated #8

Open
agstephens opened this issue Jan 20, 2025 · 0 comments

Comments

@agstephens
Copy link
Member

Might be able to checkpoint the transformer model.

Yes. The easiest approach is to take advantage of Chroma’s built-in persistence and store your embeddings on disk. Then on subsequent runs, you can load from the persisted database instead of re-embedding all the documents. The key steps are:

  1. Embed Your Documents with a persist_directory
  2. Call persist() to Save
  3. Load the Persisted Database on the Next Run

Below is a simplified example of how you might do it:

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

def embed_and_persist_docs():
    # 1. Load documents
    loader = DirectoryLoader("./docs", glob="*.txt")
    docs = loader.load()

    # 2. Split documents into chunks
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(docs)

    # 3. Create and save the embeddings
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

    # IMPORTANT: specify `persist_directory`
    vectordb = Chroma.from_documents(
        documents=docs,
        embedding=embeddings,
        persist_directory="./chroma_db"
    )

    # 4. Persist to disk
    vectordb.persist()

def load_existing_db():
    # On subsequent runs, you can load directly from the persisted DB
    embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
    vectordb = Chroma(
        persist_directory="./chroma_db",
        embedding_function=embeddings
    )
    
    # Then you can query or do similarity searches without re-embedding
    query = "What is the doc about?"
    docs = vectordb.similarity_search(query)
    print(docs)

How This Solves Your Problem

  1. No Re-Embedding: Once the embeddings are computed and persisted to the directory (e.g. ./chroma_db), you can load them from disk in future runs. That way, you skip the embedding step entirely if nothing has changed.

  2. Checkpointing While Embedding:

    • A simple form of checkpointing is to do smaller batches of documents, then call persist() after each batch. If your script crashes or is interrupted, you can pick up from the last persisted batch rather than starting over.
    • Concretely, you might:
      1. Split all your documents into subsets (or even one document at a time).
      2. Embed each subset and add them to the Chroma database (via Chroma.add_documents(...)).
      3. Call vectordb.persist().
      4. Move on to the next subset.

    Here’s a sketch for batching:

    def batch_embedding(docs, batch_size=10):
        embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
        vectordb = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
    
        for i in range(0, len(docs), batch_size):
            batch_docs = docs[i : i + batch_size]
            vectordb.add_documents(batch_docs)
            vectordb.persist()
            print(f"Persisted batch {i // batch_size + 1}")

    If your script dies in the middle, it should already have persisted the previous batches. On restart, you can detect which batches were completed (e.g., by checking how many documents are in the DB or by keeping a simple batch counter file) and pick up from where you left off.

Alternative: Caching Embedding Calls

If you simply want to avoid calling the embedding API for any text chunk that has already been embedded before, LangChain also provides a caching utility for LLM calls. However, embedding caching is not fully built-in at the same level; you’d typically have to implement your own function that checks if the text was already embedded (e.g., by storing a hash in a local SQLite or JSON file) before calling OpenAI again.

But in most use cases, Chroma’s local persistence (as shown above) is sufficient to ensure you don’t re-embed the same documents every time you run your script. It’s often the easiest “one-line” solution for checkpointing your embeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant