Skip to content

A tidy approach to vector databases

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

flmnh-ai/tidyvec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

eb57605 · Feb 28, 2025

History

10 Commits
Feb 25, 2025
Feb 28, 2025
Feb 28, 2025
Feb 28, 2025
Feb 25, 2025
Feb 28, 2025
Feb 25, 2025
Feb 25, 2025
Feb 25, 2025
Feb 28, 2025
Feb 25, 2025
Feb 25, 2025
Feb 25, 2025
Feb 25, 2025
Feb 28, 2025
Feb 25, 2025
Feb 25, 2025

Repository files navigation

tidyvec

pkgdown R-CMD-check Lifecycle: experimental CRAN status

Overview

tidyvec is a lightweight vector database for the tidyverse ecosystem. It enables you to:

  • Store and query vector embeddings alongside your data in tibbles
  • Generate embeddings for text and images
  • Find similar items using vector similarity search
  • Visualize embedding spaces
  • Seamlessly integrate with dplyr, ggplot2, and other tidyverse packages

Why tidyvec?

While specialized vector databases like FAISS and Pinecone offer high performance for large-scale applications, they often require leaving the familiar tidyverse workflow. tidyvec bridges this gap by:

  • Keeping it tidy: Store embeddings right in your tibbles
  • Familiar syntax: Use standard dplyr verbs before and after vector operations
  • Low friction: No need to switch contexts between data wrangling and similarity search
  • Multimodal support: Work with text, images, or any data type that can be embedded

Installation

You can install tidyvec from GitHub:

# install.packages("remotes")
remotes::install_github("flmnh-ai/tidyvec")

For neural embedding models via HuggingFace, you'll also need Python with the required dependencies:

# Set up Python environment with required packages
tidyvec::setup_python()

Basic Usage

Text Embeddings and Search

library(tidyverse)
library(tidyvec)

# Create a collection of books
books <- tibble(
  title = c(
    "The Art of Data Science",
    "Advanced R Programming",
    "Tidy Data Visualization",
    "Statistical Learning Methods",
    "Machine Learning with R"
  ),
  description = c(
    "A comprehensive guide to data analysis using modern techniques",
    "Deep dive into R programming for advanced users",
    "Creating beautiful visualizations with ggplot2 and the tidyverse",
    "Introduction to statistical learning methods and their applications",
    "Practical machine learning approaches with R examples"
  )
)

# Create a TF-IDF embedder and embed the descriptions
embedder <- embedder_tfidf(books$description)
books_vec <- books %>%
  vec(embedding_fn = embedder) %>%
  embed(content_column = "description")

# Find similar books using the `%~%` operator
"data visualization techniques" %~% books_vec %>%
  select(title, similarity)

Working with Images

# Create a CLIP embedder for images
clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal")

# Get paths to example images included with the package
img_paths <- c(
  cat = system.file("extdata/images", "cat.jpg", package = "tidyvec"),
  dog = system.file("extdata/images", "dog.jpg", package = "tidyvec"),
  beach = system.file("extdata/images", "beach.jpg", package = "tidyvec"),
  mountain = system.file("extdata/images", "mountain.jpg", package = "tidyvec"),
  city = system.file("extdata/images", "city.jpg", package = "tidyvec")
)

# Create an image collection
images <- tibble(
  id = names(img_paths),
  path = unname(img_paths),
  category = c("pet", "pet", "nature", "nature", "urban")
) %>%
  vec(embedding_fn = clip_embedder) %>%
  embed(content_column = "path")

# Find images similar to text
"a cat playing" %~% images %>%
  select(id, path, similarity)

# Find similar images and visualize them
"a dog on a beach" %~% images %>%
  viz_images(path_column = "path", label_columns = c("id", "category"))

Key Features

Vector Collections

The vec() function transforms a tibble into a vector collection:

# Create a basic collection
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec()

# With a custom embedding function
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = my_custom_embedder)

Embedding Generation

Generate embeddings using built-in or custom embedding functions:

# TF-IDF embeddings
documents <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# HuggingFace neural embeddings
comments <- tibble(text = c("I love this product", "Terrible experience")) %>%
  vec(embedding_fn = embedder_hf("sentence-transformers/all-MiniLM-L6-v2")) %>%
  embed(content_column = "text")

Similarity Search

Find similar items with the nearest() function or %~% operator:

# Find nearest neighbors
my_collection %>%
  nearest("query text", n = 5)

# Or using the similarity operator
"query text" %~% my_collection

Embedding Visualization

Visualize your embedding space:

my_collection %>%
  viz_embeddings(method = "umap", labels = "id", color = "category")

Advanced Examples

Combining with Tidyverse Operations

# Filter first, then search
books_vec %>%
  filter(year >= 2020) %>%
  nearest("visualization techniques", n = 2)

# Search first, then filter results
books_vec %>%
  nearest("R programming", n = 10) %>%
  filter(similarity > 0.5) %>%
  arrange(desc(year))

Building a Simple RAG (Retrieval-Augmented Generation) System

# Split document into chunks
document_chunks <- tibble(
  chunk_id = paste0("chunk", 1:10),
  text = c("R is a programming language for statistical computing.", 
           "The tidyverse is a collection of R packages for data science.", 
           # ... more chunks
           ),
  source = "R Documentation"
) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# Query relevant chunks
query_results <- document_chunks %>%
  nearest("How do I visualize data in R?", n = 3)

# Use results to generate an answer with an LLM
query_results %>%
  select(text, similarity)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This package is licensed under the MIT License - see the LICENSE file for details.

About

A tidy approach to vector databases

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages