RAGXiv

RAGXiv is a project that integrates data from arXiv with semantic metadata using various tools like Semantic Scholar and Qdrant, along with machine learning models for text analysis. This project handles the full pipeline from downloading arXiv data, processing it, to launching a web application for interactive exploration.

Installation

Prerequisites

python=3.11
poetry
docker + docker-compose
Ollama | lm-studio | other LLM

Steps

Create a virtual environment:

python -m venv .venv

Activate the virtual environment:

Windows

.venv\Scripts\activate

Linux

source .venv/bin/activate

Install poetry via pip:

pip install poetry

Install dependencies specified in pyproject.toml:

poetry install

Docker

Pull and compose docker

docker-compose up -d

Environment Variables

Create a .env file in the root directory with the following variables:

DATABASE_URL=$DATABASE_URL
QDRANT_API_KEY=$QDRANT_API_KEY
QDRANT_URL=http://localhost:6333
GEMINI_API_KEY=$GEMINI_API_KEY
LANGFUSE_PK=$LANGFUSE_PK
LANGFUSE_SK=$LANGFUSE_SK
LANGFUSE_HOST="https://cloud.langfuse.com"

Data loading and cleaning

Download the JSON snapshot from Kaggle containing all papers.
Run src/data_processing/kaggle_data_processing.ipynb to clean the data and create a sample dataset.
Execute src/data_processing/request_paper_metadata.py to fetch metadata (authors, references, etc.) for the papers from Semantic Scholar.
Run src/data_processing/download_papers.py to download the complete papers from arXiv.
Execute src/data_processing/eda.py to visualize the data and perform exploratory data analysis.
Run src/scripts/index.py to index the papers in Qdrant and store the metadata in the database.
Start the web application by running streamlit run src/app.py!

These steps outline the process from data acquisition to running the final web application, including data processing, metadata retrieval, paper downloading, data visualization, indexing, and launching the Streamlit app.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.streamlit		.streamlit
.vscode		.vscode
app		app
assets		assets
data		data
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAGXiv

Installation

Prerequisites

Steps

Docker

Environment Variables

Data loading and cleaning

About

Releases

Packages

Contributors 2

Languages

Gabriel9753/RAGXiv

Folders and files

Latest commit

History

Repository files navigation

RAGXiv

Installation

Prerequisites

Steps

Docker

Environment Variables

Data loading and cleaning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages