CUDA Documentation QA System

This project implements a Question Answering (QA) system for CUDA documentation. It crawls the NVIDIA CUDA documentation, processes the data, stores it in a vector database, and uses advanced retrieval techniques to answer user queries.

Features

Web crawling of NVIDIA CUDA documentation
Advanced data chunking based on semantic similarity
Vector embedding creation and storage in Milvus database
Query expansion for improved retrieval
Hybrid retrieval combining BM25 and BERT-based methods
Question answering using a Language Model

Setup Instructions

Prerequisites

Python 3.7+
pip (Python package installer)

Installation

Clone the repository:
Create a virtual environment (optional but recommended):
Install the required dependencies:

Dependencies

The main dependencies for this project are:

scrapy: For web crawling
sentence-transformers: For text embeddings
nltk: For natural language processing tasks
rank_bm25: For BM25 retrieval
torch and transformers: For working with transformer models
streamlit: For creating web applications
selenium and webdriver_manager: For web scraping
pymilvus: For interacting with the Milvus vector database

For a complete list of dependencies, refer to the requirements.txt file.

Running the System

Ensure that you have a Milvus server running. Refer to the Milvus documentation for installation and setup instructions.
Run the main script: 3. The system will start by crawling the CUDA documentation, processing the data, and storing it in the Milvus database. This initial setup may take some time.
Once the setup is complete, you can start asking questions about CUDA. The system will provide answers based on the retrieved information.
To exit the system, type 'quit' when prompted for a question.

Project Structure

main.py: The main script that orchestrates the entire process.
crawler/web_crawler.py: Contains the web crawling logic.
data_processing/chunking.py: Implements advanced data chunking techniques.
data_processing/embedding.py: Handles the creation of vector embeddings.
vector_db/milvus_db.py: Manages interactions with the Milvus database.
retrieval/query_expansion.py: Implements query expansion techniques.
retrieval/hybrid_retrieval.py: Contains the hybrid retrieval logic.
qa/llm_qa.py: Manages the question answering process using a language model.

Customization

You can adjust the embedding model by modifying the SentenceTransformer model in main.py.
The depth of web crawling can be adjusted in the crawl_data function (currently set to 5 levels).
The number of retrieved chunks for answering can be modified by changing the top_k parameter in the retrieve method call.

Troubleshooting

If you encounter any issues:

Ensure all dependencies are correctly installed.
Check that the Milvus server is running and accessible.
Verify that you have a stable internet connection for web crawling and model downloads.

For any persistent problems, please open an issue in the GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
crawler		crawler
data_processing		data_processing
utils		utils
vector_db		vector_db
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Documentation QA System

Features

Setup Instructions

Prerequisites

Installation

Dependencies

Running the System

Project Structure

Customization

Troubleshooting

About

Releases

Packages

Languages

vaibhavdangar09/CUDAQuest-Semantic-Crawl-to-Answer-Engine

Folders and files

Latest commit

History

Repository files navigation

CUDA Documentation QA System

Features

Setup Instructions

Prerequisites

Installation

Dependencies

Running the System

Project Structure

Customization

Troubleshooting

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages