tidyvec

Overview

tidyvec is a lightweight vector database for the tidyverse ecosystem. It enables you to:

Store and query vector embeddings alongside your data in tibbles
Generate embeddings for text and images
Find similar items using vector similarity search
Visualize embedding spaces
Seamlessly integrate with dplyr, ggplot2, and other tidyverse packages

Why tidyvec?

While specialized vector databases like FAISS and Pinecone offer high performance for large-scale applications, they often require leaving the familiar tidyverse workflow. tidyvec bridges this gap by:

Keeping it tidy: Store embeddings right in your tibbles
Familiar syntax: Use standard dplyr verbs before and after vector operations
Low friction: No need to switch contexts between data wrangling and similarity search
Multimodal support: Work with text, images, or any data type that can be embedded

Installation

You can install tidyvec from GitHub:

# install.packages("remotes")
remotes::install_github("flmnh-ai/tidyvec")

For neural embedding models via HuggingFace, you'll also need Python with the required dependencies:

# Set up Python environment with required packages
tidyvec::setup_python()

Basic Usage

Text Embeddings and Search

library(tidyverse)
library(tidyvec)

# Create a collection of books
books <- tibble(
  title = c(
    "The Art of Data Science",
    "Advanced R Programming",
    "Tidy Data Visualization",
    "Statistical Learning Methods",
    "Machine Learning with R"
  ),
  description = c(
    "A comprehensive guide to data analysis using modern techniques",
    "Deep dive into R programming for advanced users",
    "Creating beautiful visualizations with ggplot2 and the tidyverse",
    "Introduction to statistical learning methods and their applications",
    "Practical machine learning approaches with R examples"
  )
)

# Create a TF-IDF embedder and embed the descriptions
embedder <- embedder_tfidf(books$description)
books_vec <- books %>%
  vec(embedding_fn = embedder) %>%
  embed(content_column = "description")

# Find similar books using the `%~%` operator
"data visualization techniques" %~% books_vec %>%
  select(title, similarity)

Working with Images

# Create a CLIP embedder for images
clip_embedder <- embedder_hf("openai/clip-vit-base-patch32", modality = "multimodal")

# Get paths to example images included with the package
img_paths <- c(
  cat = system.file("extdata/images", "cat.jpg", package = "tidyvec"),
  dog = system.file("extdata/images", "dog.jpg", package = "tidyvec"),
  beach = system.file("extdata/images", "beach.jpg", package = "tidyvec"),
  mountain = system.file("extdata/images", "mountain.jpg", package = "tidyvec"),
  city = system.file("extdata/images", "city.jpg", package = "tidyvec")
)

# Create an image collection
images <- tibble(
  id = names(img_paths),
  path = unname(img_paths),
  category = c("pet", "pet", "nature", "nature", "urban")
) %>%
  vec(embedding_fn = clip_embedder) %>%
  embed(content_column = "path")

# Find images similar to text
"a cat playing" %~% images %>%
  select(id, path, similarity)

# Find similar images and visualize them
"a dog on a beach" %~% images %>%
  viz_images(path_column = "path", label_columns = c("id", "category"))

Key Features

Vector Collections

The vec() function transforms a tibble into a vector collection:

# Create a basic collection
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec()

# With a custom embedding function
my_collection <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = my_custom_embedder)

Embedding Generation

Generate embeddings using built-in or custom embedding functions:

# TF-IDF embeddings
documents <- tibble(text = c("sample text", "another example")) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# HuggingFace neural embeddings
comments <- tibble(text = c("I love this product", "Terrible experience")) %>%
  vec(embedding_fn = embedder_hf("sentence-transformers/all-MiniLM-L6-v2")) %>%
  embed(content_column = "text")

Similarity Search

Find similar items with the nearest() function or %~% operator:

# Find nearest neighbors
my_collection %>%
  nearest("query text", n = 5)

# Or using the similarity operator
"query text" %~% my_collection

Embedding Visualization

Visualize your embedding space:

my_collection %>%
  viz_embeddings(method = "umap", labels = "id", color = "category")

Advanced Examples

Combining with Tidyverse Operations

# Filter first, then search
books_vec %>%
  filter(year >= 2020) %>%
  nearest("visualization techniques", n = 2)

# Search first, then filter results
books_vec %>%
  nearest("R programming", n = 10) %>%
  filter(similarity > 0.5) %>%
  arrange(desc(year))

Building a Simple RAG (Retrieval-Augmented Generation) System

# Split document into chunks
document_chunks <- tibble(
  chunk_id = paste0("chunk", 1:10),
  text = c("R is a programming language for statistical computing.", 
           "The tidyverse is a collection of R packages for data science.", 
           # ... more chunks
           ),
  source = "R Documentation"
) %>%
  vec(embedding_fn = embedder_tfidf(.$text)) %>%
  embed(content_column = "text")

# Query relevant chunks
query_results <- document_chunks %>%
  nearest("How do I visualize data in R?", n = 3)

# Use results to generate an answer with an LLM
query_results %>%
  select(text, similarity)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This package is licensed under the MIT License - see the LICENSE file for details.

Name	Name	Last commit message	Last commit date
Latest commit nick-gauthier fix image links in package Feb 28, 2025 eb57605 · Feb 28, 2025 History 10 Commits
.github	.github	update pkgdown	Feb 25, 2025
R	R	move image files	Feb 28, 2025
inst/images	inst/images	move image files	Feb 28, 2025
man	man	add siglip embedders	Feb 28, 2025
pkgdown/favicon	pkgdown/favicon	add pkgdown site	Feb 25, 2025
vignettes	vignettes	fix image links in package	Feb 28, 2025
.Rbuildignore	.Rbuildignore	initial commit	Feb 25, 2025
.gitattributes	.gitattributes	initial commit	Feb 25, 2025
.gitignore	.gitignore	update pkgdown	Feb 25, 2025
DESCRIPTION	DESCRIPTION	correct installation address	Feb 28, 2025
LICENSE	LICENSE	initial commit	Feb 25, 2025
LICENSE.md	LICENSE.md	initial commit	Feb 25, 2025
NAMESPACE	NAMESPACE	initial commit	Feb 25, 2025
NEWS.md	NEWS.md	add pkgdown site	Feb 25, 2025
README.md	README.md	correct installation address	Feb 28, 2025
_pkgdown.yml	_pkgdown.yml	Revert "Update _pkgdown.yml"	Feb 25, 2025
tidyvec.Rproj	tidyvec.Rproj	initial commit	Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

tidyvec

Overview

Why tidyvec?

Installation

Basic Usage

Text Embeddings and Search

Working with Images

Key Features

Vector Collections

Embedding Generation

Similarity Search

Embedding Visualization

Advanced Examples

Combining with Tidyverse Operations

Building a Simple RAG (Retrieval-Augmented Generation) System

Contributing

License

About

Licenses found

Releases

Packages

Languages

License

flmnh-ai/tidyvec

Folders and files

Latest commit

History

Repository files navigation

tidyvec

Overview

Why tidyvec?

Installation

Basic Usage

Text Embeddings and Search

Working with Images

Key Features

Vector Collections

Embedding Generation

Similarity Search

Embedding Visualization

Advanced Examples

Combining with Tidyverse Operations

Building a Simple RAG (Retrieval-Augmented Generation) System

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages