This project implements a Question Answering (QA) system for CUDA documentation. It crawls the NVIDIA CUDA documentation, processes the data, stores it in a vector database, and uses advanced retrieval techniques to answer user queries.
- Web crawling of NVIDIA CUDA documentation
- Advanced data chunking based on semantic similarity
- Vector embedding creation and storage in Milvus database
- Query expansion for improved retrieval
- Hybrid retrieval combining BM25 and BERT-based methods
- Question answering using a Language Model
- Python 3.7+
- pip (Python package installer)
- Clone the repository:
- Create a virtual environment (optional but recommended):
- Install the required dependencies:
The main dependencies for this project are:
- scrapy: For web crawling
- sentence-transformers: For text embeddings
- nltk: For natural language processing tasks
- rank_bm25: For BM25 retrieval
- torch and transformers: For working with transformer models
- streamlit: For creating web applications
- selenium and webdriver_manager: For web scraping
- pymilvus: For interacting with the Milvus vector database
For a complete list of dependencies, refer to the requirements.txt
file.
-
Ensure that you have a Milvus server running. Refer to the Milvus documentation for installation and setup instructions.
-
Run the main script: 3. The system will start by crawling the CUDA documentation, processing the data, and storing it in the Milvus database. This initial setup may take some time.
-
Once the setup is complete, you can start asking questions about CUDA. The system will provide answers based on the retrieved information.
-
To exit the system, type 'quit' when prompted for a question.
main.py
: The main script that orchestrates the entire process.crawler/web_crawler.py
: Contains the web crawling logic.data_processing/chunking.py
: Implements advanced data chunking techniques.data_processing/embedding.py
: Handles the creation of vector embeddings.vector_db/milvus_db.py
: Manages interactions with the Milvus database.retrieval/query_expansion.py
: Implements query expansion techniques.retrieval/hybrid_retrieval.py
: Contains the hybrid retrieval logic.qa/llm_qa.py
: Manages the question answering process using a language model.
- You can adjust the embedding model by modifying the
SentenceTransformer
model inmain.py
. - The depth of web crawling can be adjusted in the
crawl_data
function (currently set to 5 levels). - The number of retrieved chunks for answering can be modified by changing the
top_k
parameter in theretrieve
method call.
If you encounter any issues:
- Ensure all dependencies are correctly installed.
- Check that the Milvus server is running and accessible.
- Verify that you have a stable internet connection for web crawling and model downloads.
For any persistent problems, please open an issue in the GitHub repository.