This notebook demonstrates the implementation of multimodal RAG using an academic article specific to the healthcare domain. mRAGs extract information not only from text but also from other modalities like images, videos, and audio. They are powerful because a significant amount of information can be hidden in images that might not be mentioned in the text.
The main difference between RAG and mRAG is:
- Text-based RAG: Takes only text as an input to the retriever. For RAG implementation, please see my repo on Retrieval Augmentation Generation.
- Multimodal RAG: Uses different modalities like images, video, audio, code and other media formats in addition as an input.
- Richer and more comprehensive knowledge: Processes both text and visual information.
- Improved reasoning capability: Including visual clues can make the model more informed to infer across different data modalities.
Vector for the input data. The input data can be text, image, and video data. For example, possible use cases could be video content moderation or image classification.
In this notebook I am using Google's Geimini 1.5 Pro LLM.
These LLMs are multimodal, which means that the models can process information from multiple modalities, including text, images, audio, and video. Reasoning is one of the tasks Gemini performs well in. Gemini is among the top 5 models by LMSYS ranking
I used a real-world study on breast cancer that was published in 2020. Here is the link. The model works better with high-resolution images. I have also downloaded the chart used in the article to aid performance.
This notebook uses Google Cloud Platform.
! pip3 install --upgrade --user google-cloud-aiplatform pymupdf rich
- Gemini LLM: Gemini 1.5 Pro is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio, and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.
- Data dependency: Needs high-quality paired text and visuals.
- Computationally demanding: Processing multimodal data is resource-intensive.
- Domain-specific: Models trained on general data may not excel in specialized fields like medicine.
- Lack of explainability: Understanding how these models work can be tricky, hindering trust and adoption.