Skip to content

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

License

Notifications You must be signed in to change notification settings

Yuan-ManX/ai-multimodal-timeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

AI Multimodal Timeline

AI Multimodal Timeline

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Table of Contents

Project List

Multimodal Model

Date Source Description Paper Model
2024-11 Oasis Oasis is an interactive world model developed by Decart and Etched. Based on diffusion transformers, Oasis takes in user keyboard input and generates gameplay in an autoregressive manner. Hugging Face
2024-10 Unbounded Unbounded: A Generative Infinite Game of Character Life Simulation. arXiv Website
2024-10 Janus Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. arXiv Hugging Face
2024-09 LLaVA-3D LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness. arXiv
2024-09 Emu3 Emu3: Next-Token Prediction is All You Need. Hugging Face
2024-09 Moshi Moshi: a speech-text foundation model for real time dialogue. Hugging Face
2024-09 Qwen2-VL Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud. Hugging Face
2024-08 Eagle Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders. arXiv
2024-08 Mini-Omni Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming. arXiv Hugging Face
2024-08 GameNGen GameNGen - Diffusion Models Are Real-Time Game Engines. arXiv
2024-08 Sapiens Sapiens: Foundation for Human Vision Models. arXiv
2024-08 Show-o Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. arXiv
2024-08 LLaVA-OneVision LLaVA-OneVision: Easy Visual Task Transfer. arXiv Hugging Face
2024-08 AI Scientist The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. arXiv
2024-08 Mini-Monkey Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models. arXiv
2024-08 VITA VITA: Towards Open-Source Interactive Omni Multimodal LLM. arXiv
2024-08 Lumina-mGPT Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. arXiv
2024-07 Any2Point Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding. arXiv
2024-07 SOLO SOLO: A Single Transformer for Scalable Vision-Language Modeling. arXiv
2024-07 Kangaroo Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input. Hugging Face
2024-07 SEED-Story SEED-Story: Multimodal Long Story Generation with Large Language Model. arXiv Hugging Face
2024-07 VTA-LDM Video-to-Audio Generation with Hidden Alignment. arXiv Hugging Face
2024-07 Qwen2-Audio Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud. arXiv
2024-07 Moshi Moshi is an experimental conversational AI. Website
2024-07 Anole Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation. Hugging Face
2024-06 Cambrian-1 A Fully Open, Vision-Centric Exploration of Multimodal LLMs. arXiv Hugging Face
2024-06 EVF-SAM EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. arXiv Hugging Face
2024-06 MINT-1T Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv
2024-06 OmniTokenizer A Joint Image-Video Tokenizer for Visual Generation. arXiv Website
2024-06 ml-4m A framework for training any-to-any multimodal foundation models. arXiv Website
2024-06 LongVA Long Context Transfer from Language to Vision. arXiv Hugging Face
2024-06 VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv Hugging Face
2024-05 ManyICL Many-Shot In-Context Learning in Multimodal Foundation Models. arXiv
2024-05 Contrastive ALignment (CAL) Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. arXiv
2024-05 Groma Grounded Multimodal Large Language Model with Localized Visual Tokenization. arXiv Hugging Face
2024-05 CogVLM2 GPT4V-level open-source multi-modal model based on Llama3-8B. Hugging Face
2024-05 Chameleon Mixed-Modal Early-Fusion Foundation Models. arXiv
2024-05 Lumina-T2X Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. arXiv Hugging Face
2024-05 MiniCPM-Llama3-V 2.5 MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. Hugging Face
2024-05 Gemini Build with state-of-the-art generative models and tools to make AI helpful for everyone. API
2024-05 GPT-4o GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. API
2024-04 MyGO Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. arXiv
2024-04 InternLM-XComposer2 InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. arXiv Hugging Face
2024-02 AnyGPT AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. arXiv
2024-01 MMVP Exploring the Visual Shortcomings of Multimodal LLMs. arXiv
2023-12 V* Guided Visual Search as a Core Mechanism in Multimodal LLMs. arXiv
2023-12 Tokenize Anything Tokenize Anything via Prompting. arXiv Hugging Face
2023-12 VILA VILA: On Pre-training for Visual Language Models. arXiv Hugging Face
2023-11 LEO An Embodied Generalist Agent in 3D World. arXiv Website
2023-11 ShareGPT4V Improving Large Multi-Modal Models with Better Captions. arXiv Hugging Face
2023-11 Video-LLaVA Learning United Visual Representation by Alignment Before Projection. arXiv Hugging Face
2023-10 LanguageBind Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. arXiv Hugging Face
2023-07 Emu Emu: Generative Multimodal Models from BAAI. arXiv Hugging Face
2023-05 ImageBind One Embedding Space To Bind Them All. arXiv Website
2022-11 EVA EVA: Visual Representation Fantasies from BAAI. arXiv Hugging Face

^ Back to Contents ^

LLM

Date Source Description Paper Model
2024-08 LongWriter LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. arXiv Hugging Face
2024-07 DCLM DataComp for Language Models arXiv Hugging Face
2024-07 Index-1.9B A SOTA lightweight multilingual LLM Hugging Face
2024-06 Claude 3.5 Sonnet Claude 3.5 Sonnet API
2024-06 Nemotron-4 Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. arXiv Hugging Face
2024-06 Qwen2 Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. Hugging Face
2024-04 Llama 3 Meta Llama 3 is the next generation of our state-of-the-art open source large language model. Hugging Face
2024-03 Claude 3 Talk with Claude, an AI assistant from Anthropic. API
2024-03 Grok-1 The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1. Hugging Face
2023-11 Mixtral Open and portable generative AI for devs and businesses. arXiv Hugging Face
2023-09 Baichuan 2 A series of large language models developed by Baichuan Intelligent Technology. Hugging Face
2023-07 GPT-4 GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. API

^ Back to Contents ^

Agent

Date Source Description Paper Model
2024-10 TEN Agent TEN Agent is the world’s first real-time multimodal agent integrated with the OpenAI Realtime API, RTC, and features weather checks, web search, vision, and RAG capabilities. Website
2024-08 Twitter Twitter Personality is a web application that analyzes your Twitter handle to create a personalized personality profile using Wordware AI Agent. Website
2024-08 MindSearch 🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT).
2024-08 MMRole MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents. arXiv
2024-08 Agent K An autoagentic AGI that is self-evolving and modular.
2024-08 LangGraph Studio LangGraph Studio offers a new way to develop LLM applications by providing a specialized agent IDE that enables visualization, interaction, and debugging of complex agentic applications.
2024-07 LLama Agentic System Agentic components of the Llama Stack APIs.
2024-07 TaskGen A Task-based agentic framework building on StrictJSON outputs by LLM agents.
2024-07 IoA An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity.
2024-07 OmAgent A multimodal agent framework for solving complex tasks. arXiv
2024-06 GraphRAG A modular graph-based Retrieval-Augmented Generation (RAG) system. Website
2024-06 Mixture of Agents (MoA) Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv
2024-06 Buffer of Thoughts Thought-Augmented Reasoning with Large Language Models. arXiv
2024-06 Translation Agent Agentic translation using reflection workflow.
2024-06 Atomic Agents The Atomic Agents framework is designed to be modular, extensible, and easy to use.
2024-05 Pipecat Open Source framework for voice and multimodal conversational AI.
2024-02 V-IRL Grounding Virtual Intelligence in Real Life. arXiv

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date Source Description Paper Model
2024-07 CosyVoice Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
2024-06 DEX-TTS Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability. arXiv Website
2024-05 ChatTTS ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.
2023-06 StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv Hugging Face

Audio/Automatic Speech Recognition

Date Source Description Paper Model
2024-07 SenseVoice SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). Hugging Face
2024-05 TeleSpeech-ASR Large speech model-super multi-dialect ASR. Hugging Face
2022-12 Whisper Whisper is a general-purpose speech recognition model. arXiv API

Audio/Audio Generation

Date Source Description Paper Model
2024-07 FoleyCrafter FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. arXiv Hugging Face
2024-06 SEE-2-SOUND Zero-Shot Spatial Environment-to-Spatial Sound. arXiv
2024-05 Make-An-Audio 3 Transforming Text into Audio via Flow-based Large Diffusion Transformers. arXiv Hugging Face

^ Back to Contents ^

Image

Date Source Description Paper Model
2024-09 StoryMaker StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation. arXiv Hugging Face
2024-08 CSGO CSGO: Content-Style Composition in Text-to-Image Generation. arXiv
2024-08 FLUX This repo contains minimal inference code to run text-to-image and image-to-image with our Flux latent rectified flow transformers. Hugging Face
2024-08 Segment Anything Model 2 (SAM 2) SAM 2: Segment Anything in Images and Videos. arXiv Hugging Face
2024-07 CatVTON CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. arXiv Hugging Face
2024-07 UltraEdit UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. arXiv Hugging Face
2024-07 UltraPixel UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks. arXiv
2024-07 PaintsUndo PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings.
2024-07 Kolors Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis. Hugging Face
2024-06 Depth Anything V2 Depth Anything V2. arXiv Hugging Face
2024-06 AutoStudio Crafting Consistent Subjects in Multi-turn Interactive Image Generation. arXiv
2024-06 MimicBrush Zero-shot Image Editing with Reference Imitation. arXiv Hugging Face
2024-06 LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv Hugging Face
2024-05 Omost Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. Hugging Face
2024-05 Hunyuan-DiT A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv Hugging Face
2024-02 MIGC MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. arXiv
2023-10 DALL·E 3 DALL·E is a AI system that can create realistic images and art from a description in natural language. API

^ Back to Contents ^

Video

Date Source Description Paper Model
2024-11 LTX-Video LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. Hugging Face
2024-09 MIMO MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling. arXiv Website
2024-09 DrawingSpinUp DrawingSpinUp: 3D Animation from Single Character Drawings. arXiv Website
2024-09 ViewCrafter ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis. arXiv Website
2024-08 CogVideoX CogVideoX is an open-source version of the video generation model, which is homologous to 清影. Hugging Face
2024-07 Tora Tora: Trajectory-oriented Diffusion Transformer for Video Generation. arXiv Website
2024-06 Diffutoon High-Resolution Editable Toon Shading via Diffusion Models. arXiv Website
2024-05 Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
2024-05 Video-of-Thought Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. Website
2024-05 MOFA-Video MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model. arXiv Hugging Face
2024-05 MotionLLM Understanding Human Behaviors from Human Motions and Videos. arXiv
2024-05 Vidu Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. arXiv
2024-02 Sora Sora is an AI model that can create realistic and imaginative scenes from text instructions. Technical Report
2023-11 Pika Pika is the idea-to-video platform that sets your creativity in motion.
2023-03 Runway Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.

^ Back to Contents ^

Music

Date Source Description Paper Model
2024-05 Diff-BGM A Diffusion Model for Video Background Music Generation. arXiv
2024-04 Udio Udio - AI Music Generator Website
2023-12 Suno Suno is building a future where anyone can make great music. Website
2023-12 Soundry AI Generative AI tools including text-to-sound and infinite sample packs. Website
2023-12 Sonauto Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style. Website

^ Back to Contents ^

3D

Date Source Description Paper Model
2024-11 Hunyuan3D-1.0 Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation. arXiv Hugging Face
2024-09 3DTopia-XL 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion. arXiv Website
2024-08 SF3D SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv Hugging Face
2024-07 HoloDreamer HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions. arXiv Website
2024-07 DreamCatalyst DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation. arXiv Website
2024-07 CharacterGen CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization. arXiv Website
2024-07 GALA3D GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting. arXiv Website
2024-06 Unique3D High-Quality and Efficient 3D Mesh Generation from a Single Image. arXiv Hugging Face
2024-06 DreamGaussian4D Generative 4D Gaussian Splatting. arXiv Hugging Face
2024-03 GaussCtrl GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing. arXiv
2024-03 GaussianCube A Structured and Explicit Radiance Representation for 3D Generative Modeling. arXiv Hugging Face
2024-03 TripoSR Fast 3D Object Reconstruction from a Single Image. arXiv Hugging Face

^ Back to Contents ^

About

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published