RAGXiv is a project that integrates data from arXiv with semantic metadata using various tools like Semantic Scholar and Qdrant, along with machine learning models for text analysis. This project handles the full pipeline from downloading arXiv data, processing it, to launching a web application for interactive exploration.
python=3.11
poetry
docker
+docker-compose
Ollama
|lm-studio
| other LLM
Create a virtual environment:
python -m venv .venv
Activate the virtual environment:
- Windows
.venv\Scripts\activate
- Linux
source .venv/bin/activate
Install poetry via pip:
pip install poetry
Install dependencies specified in pyproject.toml
:
poetry install
Pull and compose docker
docker-compose up -d
Create a .env
file in the root directory with the following variables:
DATABASE_URL=$DATABASE_URL
QDRANT_API_KEY=$QDRANT_API_KEY
QDRANT_URL=http://localhost:6333
GEMINI_API_KEY=$GEMINI_API_KEY
LANGFUSE_PK=$LANGFUSE_PK
LANGFUSE_SK=$LANGFUSE_SK
LANGFUSE_HOST="https://cloud.langfuse.com"
-
Download the JSON snapshot from Kaggle containing all papers.
-
Run
src/data_processing/kaggle_data_processing.ipynb
to clean the data and create a sample dataset. -
Execute
src/data_processing/request_paper_metadata.py
to fetch metadata (authors, references, etc.) for the papers from Semantic Scholar. -
Run
src/data_processing/download_papers.py
to download the complete papers from arXiv. -
Execute
src/data_processing/eda.py
to visualize the data and perform exploratory data analysis. -
Run
src/scripts/index.py
to index the papers in Qdrant and store the metadata in the database. -
Start the web application by running
streamlit run src/app.py
!
These steps outline the process from data acquisition to running the final web application, including data processing, metadata retrieval, paper downloading, data visualization, indexing, and launching the Streamlit app.