Data

Multi-Modal Reasoning

This notebook demonstrates the implementation of multimodal RAG using an academic article specific to the healthcare domain. mRAGs extract information not only from text but also from other modalities like images, videos, and audio. They are powerful because a significant amount of information can be hidden in images that might not be mentioned in the text.

The main difference between RAG and mRAG is:

Text-based RAG: Takes only text as an input to the retriever. For RAG implementation, please see my repo on Retrieval Augmentation Generation.
Multimodal RAG: Uses different modalities like images, video, audio, code and other media formats in addition as an input.

Multimodal RAG Architecture

Advantages of Multimodal RAG:

Richer and more comprehensive knowledge: Processes both text and visual information.
Improved reasoning capability: Including visual clues can make the model more informed to infer across different data modalities.

Multimodal Embeddings:

Vector for the input data. The input data can be text, image, and video data. For example, possible use cases could be video content moderation or image classification.

In this notebook I am using Google's Geimini 1.5 Pro LLM.

Why Gemini?

These LLMs are multimodal, which means that the models can process information from multiple modalities, including text, images, audio, and video. Reasoning is one of the tasks Gemini performs well in. Gemini is among the top 5 models by LMSYS ranking

Data

I used a real-world study on breast cancer that was published in 2020. Here is the link. The model works better with high-resolution images. I have also downloaded the chart used in the article to aid performance.

Installation

This notebook uses Google Cloud Platform.

! pip3 install --upgrade --user google-cloud-aiplatform pymupdf rich

Libraries

Gemini LLM: Gemini 1.5 Pro is a foundation model that performs well at a variety of multimodal tasks such as visual understanding, classification, summarization, and creating content from image, audio, and video. It's adept at processing visual and text inputs such as photographs, documents, infographics, and screenshots.

Limitations:

Data dependency: Needs high-quality paired text and visuals.
Computationally demanding: Processing multimodal data is resource-intensive.
Domain-specific: Models trained on general data may not excel in specialized fields like medicine.
Lack of explainability: Understanding how these models work can be tricky, hindering trust and adoption.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
images		images
utils		utils
.gitignore		.gitignore
MultiModalRAG.ipynb		MultiModalRAG.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Modal Reasoning

Multimodal RAG Architecture

Advantages of Multimodal RAG:

Multimodal Embeddings:

Why Gemini?

Data

Installation

Libraries

Limitations:

About

Releases

Packages

Languages

necibeahat/Multi-Modal-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Multi-Modal Reasoning

Multimodal RAG Architecture

Advantages of Multimodal RAG:

Multimodal Embeddings:

Why Gemini?

Data

Installation

Libraries

Limitations:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages