tidyvec is a lightweight vector database for the tidyverse ecosystem. It enables you to:
- Store and query vector embeddings alongside your data in tibbles
- Generate embeddings for text and images
- Find similar items using vector similarity search
- Visualize embedding spaces
- Seamlessly integrate with dplyr, ggplot2, and other tidyverse packages
While specialized vector databases like FAISS and Pinecone offer high performance for large-scale applications, they often require leaving the familiar tidyverse workflow. tidyvec bridges this gap by:
- Keeping it tidy: Store embeddings right in your tibbles
- Familiar syntax: Use standard dplyr verbs before and after vector operations
- Low friction: No need to switch contexts between data wrangling and similarity search
- Multimodal support: Work with text, images, or any data type that can be embedded
You can install tidyvec from GitHub:
# install.packages("remotes")
remotes::install_github("flmnh-ai/tidyvec")
For neural embedding models via HuggingFace, you'll also need Python with the required dependencies:
# Set up Python environment with required packages
tidyvec::setup_python()
library(tidyverse)
library(tidyvec)
# Create a collection of books
books <- tibble(
title = c(
"The Art of Data Science",
"Advanced R Programming",
"Tidy Data Visualization",
"Statistical Learning Methods",
"Machine Learning with R"
),
description = c(
"A comprehensive guide to data analysis using modern techniques",
"Deep dive into R programming for advanced users",
"Creating beautiful visualizations with ggplot2 and the tidyverse",
"Introduction to statistical learning methods and their applications",
"Practical machine learning approaches with R examples"
)
)
# Create a TF-IDF embedder and embed the descriptions
embedder <- embedder_tfidf(books$description)
books_vec <- books %>%
vec(embedding_fn = embedder) %>%
embed(content_column = "description")
# Find similar books using the `%~%` operator
"data visualization techniques" %~% books_vec %>%
select(title, similarity)
# Create a CLIP embedder for images
clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal")
# Get paths to example images included with the package
img_paths <- c(
cat = system.file("extdata/images", "cat.jpg", package = "tidyvec"),
dog = system.file("extdata/images", "dog.jpg", package = "tidyvec"),
beach = system.file("extdata/images", "beach.jpg", package = "tidyvec"),
mountain = system.file("extdata/images", "mountain.jpg", package = "tidyvec"),
city = system.file("extdata/images", "city.jpg", package = "tidyvec")
)
# Create an image collection
images <- tibble(
id = names(img_paths),
path = unname(img_paths),
category = c("pet", "pet", "nature", "nature", "urban")
) %>%
vec(embedding_fn = clip_embedder) %>%
embed(content_column = "path")
# Find images similar to text
"a cat playing" %~% images %>%
select(id, path, similarity)
# Find similar images and visualize them
"a dog on a beach" %~% images %>%
viz_images(path_column = "path", label_columns = c("id", "category"))
The vec()
function transforms a tibble into a vector collection:
# Create a basic collection
my_collection <- tibble(text = c("sample text", "another example")) %>%
vec()
# With a custom embedding function
my_collection <- tibble(text = c("sample text", "another example")) %>%
vec(embedding_fn = my_custom_embedder)
Generate embeddings using built-in or custom embedding functions:
# TF-IDF embeddings
documents <- tibble(text = c("sample text", "another example")) %>%
vec(embedding_fn = embedder_tfidf(.$text)) %>%
embed(content_column = "text")
# HuggingFace neural embeddings
comments <- tibble(text = c("I love this product", "Terrible experience")) %>%
vec(embedding_fn = embedder_hf("sentence-transformers/all-MiniLM-L6-v2")) %>%
embed(content_column = "text")
Find similar items with the nearest()
function or %~%
operator:
# Find nearest neighbors
my_collection %>%
nearest("query text", n = 5)
# Or using the similarity operator
"query text" %~% my_collection
Visualize your embedding space:
my_collection %>%
viz_embeddings(method = "umap", labels = "id", color = "category")
# Filter first, then search
books_vec %>%
filter(year >= 2020) %>%
nearest("visualization techniques", n = 2)
# Search first, then filter results
books_vec %>%
nearest("R programming", n = 10) %>%
filter(similarity > 0.5) %>%
arrange(desc(year))
# Split document into chunks
document_chunks <- tibble(
chunk_id = paste0("chunk", 1:10),
text = c("R is a programming language for statistical computing.",
"The tidyverse is a collection of R packages for data science.",
# ... more chunks
),
source = "R Documentation"
) %>%
vec(embedding_fn = embedder_tfidf(.$text)) %>%
embed(content_column = "text")
# Query relevant chunks
query_results <- document_chunks %>%
nearest("How do I visualize data in R?", n = 3)
# Use results to generate an answer with an LLM
query_results %>%
select(text, similarity)
Contributions are welcome! Please feel free to submit a Pull Request.
This package is licensed under the MIT License - see the LICENSE file for details.