ResearchIQ is a cutting-edge Retrieval-Augmented Generation (RAG) application designed to assist users in extracting meaningful insights from documents through Question Answering (QnA) and Summarization features. It leverages a robust tech stack and state-of-the-art models to ensure accurate and efficient results.
RAGG.-.Made.with.Clipchamp.1.1.mp4
Users can upload documents, and the application processes the document to extract headings and content using Adobe PDF Services. Key-value pairs (Heading: Content) are created after preprocessing.
- Handle Emojis, Slangs, Punctuations, and ShortForms
- Spelling Corrections
- Part-of-Speech (POS) Tagging
- Handling Pronouns and Special Characters
- Tokenization
- Convert text to lowercase and generate n-grams
- Remove Special Characters
- Remove Extra Whitespaces
Users can ask questions about the uploaded document. The process includes:
- Converting the question into embeddings using Sentence Transformer.
- Fetching relevant content from the document using ChromaDB and cosine similarity.
- Using the Groq (Llama-70b) model to generate precise answers based on the top 5 matching data points.
- Title-wise Summarization: Generate summaries for specific headings extracted from the document.
- Whole Document Summarization: Summarize the entire document. For large documents, content is split into 6000-token segments (approx. 24,000 words per call) due to the Groq model's max token limit.
- Backend: Django or FastAPI(For Speed)
- Frontend: Streamlit
- Vector Database: ChromaDB
- Language Model: Groq (Llama-70b)
- Embedding Creation: Sentence Transformer
- Document Processing: Adobe PDF Services
- Containerization: Docker
- Hashing:
hashlib
(to avoid redundant API calls for duplicate documents)
Endpoint | Description |
---|---|
EXTRACTOR_API_URL |
Uploads and processes documents to extract headings and content. |
QNA_API_URL |
Handles user questions and returns answers using RAG. |
SUMMARIZER_API_URL |
Summarizes the entire document. |
SUMMARIZER_API_HEADING_URL |
Summarizes content under specific headings. |
SUMMARIZER_API_TITLE_URL |
Summarizes specific titles from the document. |
- populate your env file as given in the sample
- Groq API Key: Get your API key here.
- Adobe PDF Services Credentials: Generate credentials here.
-
Clone the repository:
git clone [https://github.com/abhi526691/promptEngineering](https://github.com/abhi526691/ResearchIQ) cd ResearchIQ
-
Install the required dependencies:
pip install -r requirements.txt
-
(Optional) Set up a virtual environment:
python -m venv env source env/bin/activate # On Windows: .\env\Scripts\activate
- Clone the repository.
- Build and run the Docker container:
docker-compose up --build
-
Backend (Django):
cd backend python manage.py runserver
-
Frontend (Streamlit):
cd frontend streamlit run app.py
While Adobe PDF Services is the primary extraction tool, the following alternatives are also available:
- AWS Textract
- Azure Form Recognizer
- PyMuPDF
- PyPDF
- Upload a Document:
- The document is processed, and key-value pairs (Heading: Content) are extracted.
- Preprocessing ensures clean and structured data.
- Ask Questions:
- Use the QnA feature to get precise answers to your queries.
- Generate Summaries:
- Choose between Heading-wise or Whole Document summarization.
Contributions are welcome! Please fork the repository and submit a pull request.
This project is licensed under the MIT License.