Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-11 | Oasis | Oasis is an interactive world model developed by Decart and Etched. Based on diffusion transformers, Oasis takes in user keyboard input and generates gameplay in an autoregressive manner. | Hugging Face | |
2024-10 | Unbounded | Unbounded: A Generative Infinite Game of Character Life Simulation. | arXiv | Website |
2024-10 | Janus | Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation. | arXiv | Hugging Face |
2024-09 | LLaVA-3D | LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness. | arXiv | |
2024-09 | Emu3 | Emu3: Next-Token Prediction is All You Need. | Hugging Face | |
2024-09 | Moshi | Moshi: a speech-text foundation model for real time dialogue. | Hugging Face | |
2024-09 | Qwen2-VL | Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud. | Hugging Face | |
2024-08 | Eagle | Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders. | arXiv | |
2024-08 | Mini-Omni | Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming. | arXiv | Hugging Face |
2024-08 | GameNGen | GameNGen - Diffusion Models Are Real-Time Game Engines. | arXiv | |
2024-08 | Sapiens | Sapiens: Foundation for Human Vision Models. | arXiv | |
2024-08 | Show-o | Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. | arXiv | |
2024-08 | LLaVA-OneVision | LLaVA-OneVision: Easy Visual Task Transfer. | arXiv | Hugging Face |
2024-08 | AI Scientist | The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery. | arXiv | |
2024-08 | Mini-Monkey | Mini-Monkey: Multi-Scale Adaptive Cropping for Multimodal Large Language Models. | arXiv | |
2024-08 | VITA | VITA: Towards Open-Source Interactive Omni Multimodal LLM. | arXiv | |
2024-08 | Lumina-mGPT | Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining. | arXiv | |
2024-07 | Any2Point | Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding. | arXiv | |
2024-07 | SOLO | SOLO: A Single Transformer for Scalable Vision-Language Modeling. | arXiv | |
2024-07 | Kangaroo | Kangaroo: A Powerful Video-Language Model Supporting Long-context Video Input. | Hugging Face | |
2024-07 | SEED-Story | SEED-Story: Multimodal Long Story Generation with Large Language Model. | arXiv | Hugging Face |
2024-07 | VTA-LDM | Video-to-Audio Generation with Hidden Alignment. | arXiv | Hugging Face |
2024-07 | Qwen2-Audio | Qwen2-Audio chat & pretrained large audio language model proposed by Alibaba Cloud. | arXiv | |
2024-07 | Moshi | Moshi is an experimental conversational AI. | Website | |
2024-07 | Anole | Anole: An Open, Autoregressive and Native Multimodal Models for Interleaved Image-Text Generation. | Hugging Face | |
2024-06 | Cambrian-1 | A Fully Open, Vision-Centric Exploration of Multimodal LLMs. | arXiv | Hugging Face |
2024-06 | EVF-SAM | EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model. | arXiv | Hugging Face |
2024-06 | MINT-1T | Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. | arXiv | |
2024-06 | OmniTokenizer | A Joint Image-Video Tokenizer for Visual Generation. | arXiv | Website |
2024-06 | ml-4m | A framework for training any-to-any multimodal foundation models. | arXiv | Website |
2024-06 | LongVA | Long Context Transfer from Language to Vision. | arXiv | Hugging Face |
2024-06 | VideoLLaMA 2 | Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. | arXiv | Hugging Face |
2024-05 | ManyICL | Many-Shot In-Context Learning in Multimodal Foundation Models. | arXiv | |
2024-05 | Contrastive ALignment (CAL) | Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. | arXiv | |
2024-05 | Groma | Grounded Multimodal Large Language Model with Localized Visual Tokenization. | arXiv | Hugging Face |
2024-05 | CogVLM2 | GPT4V-level open-source multi-modal model based on Llama3-8B. | Hugging Face | |
2024-05 | Chameleon | Mixed-Modal Early-Fusion Foundation Models. | arXiv | |
2024-05 | Lumina-T2X | Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
2024-05 | MiniCPM-Llama3-V 2.5 | MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. | Hugging Face | |
2024-05 | Gemini | Build with state-of-the-art generative models and tools to make AI helpful for everyone. | API | |
2024-05 | GPT-4o | GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. | API | |
2024-04 | MyGO | Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion. | arXiv | |
2024-04 | InternLM-XComposer2 | InternLM-XComposer2 is a groundbreaking vision-language large model (VLLM) excelling in free-form text-image composition and comprehension. | arXiv | Hugging Face |
2024-02 | AnyGPT | AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling. | arXiv | |
2024-01 | MMVP | Exploring the Visual Shortcomings of Multimodal LLMs. | arXiv | |
2023-12 | V* | Guided Visual Search as a Core Mechanism in Multimodal LLMs. | arXiv | |
2023-12 | Tokenize Anything | Tokenize Anything via Prompting. | arXiv | Hugging Face |
2023-12 | VILA | VILA: On Pre-training for Visual Language Models. | arXiv | Hugging Face |
2023-11 | LEO | An Embodied Generalist Agent in 3D World. | arXiv | Website |
2023-11 | ShareGPT4V | Improving Large Multi-Modal Models with Better Captions. | arXiv | Hugging Face |
2023-11 | Video-LLaVA | Learning United Visual Representation by Alignment Before Projection. | arXiv | Hugging Face |
2023-10 | LanguageBind | Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment. | arXiv | Hugging Face |
2023-07 | Emu | Emu: Generative Multimodal Models from BAAI. | arXiv | Hugging Face |
2023-05 | ImageBind | One Embedding Space To Bind Them All. | arXiv | Website |
2022-11 | EVA | EVA: Visual Representation Fantasies from BAAI. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-08 | LongWriter | LongWriter: Unleashing 10,000+ Word Generation from Long Context LLMs. | arXiv | Hugging Face |
2024-07 | DCLM | DataComp for Language Models | arXiv | Hugging Face |
2024-07 | Index-1.9B | A SOTA lightweight multilingual LLM | Hugging Face | |
2024-06 | Claude 3.5 Sonnet | Claude 3.5 Sonnet | API | |
2024-06 | Nemotron-4 | Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. | arXiv | Hugging Face |
2024-06 | Qwen2 | Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud. | Hugging Face | |
2024-04 | Llama 3 | Meta Llama 3 is the next generation of our state-of-the-art open source large language model. | Hugging Face | |
2024-03 | Claude 3 | Talk with Claude, an AI assistant from Anthropic. | API | |
2024-03 | Grok-1 | The weights and architecture of our 314 billion parameter Mixture-of-Experts model, Grok-1. | Hugging Face | |
2023-11 | Mixtral | Open and portable generative AI for devs and businesses. | arXiv | Hugging Face |
2023-09 | Baichuan 2 | A series of large language models developed by Baichuan Intelligent Technology. | Hugging Face | |
2023-07 | GPT-4 | GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-10 | TEN Agent | TEN Agent is the world’s first real-time multimodal agent integrated with the OpenAI Realtime API, RTC, and features weather checks, web search, vision, and RAG capabilities. | Website | |
2024-08 | Twitter Personality is a web application that analyzes your Twitter handle to create a personalized personality profile using Wordware AI Agent. | Website | ||
2024-08 | MindSearch | 🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT). | ||
2024-08 | MMRole | MMRole: A Comprehensive Framework for Developing and Evaluating Multimodal Role-Playing Agents. | arXiv | |
2024-08 | Agent K | An autoagentic AGI that is self-evolving and modular. | ||
2024-08 | LangGraph Studio | LangGraph Studio offers a new way to develop LLM applications by providing a specialized agent IDE that enables visualization, interaction, and debugging of complex agentic applications. | ||
2024-07 | LLama Agentic System | Agentic components of the Llama Stack APIs. | ||
2024-07 | TaskGen | A Task-based agentic framework building on StrictJSON outputs by LLM agents. | ||
2024-07 | IoA | An open-source framework for collaborative AI agents, enabling diverse, distributed agents to team up and tackle complex tasks through internet-like connectivity. | ||
2024-07 | OmAgent | A multimodal agent framework for solving complex tasks. | arXiv | |
2024-06 | GraphRAG | A modular graph-based Retrieval-Augmented Generation (RAG) system. | Website | |
2024-06 | Mixture of Agents (MoA) | Mixture-of-Agents Enhances Large Language Model Capabilities. | arXiv | |
2024-06 | Buffer of Thoughts | Thought-Augmented Reasoning with Large Language Models. | arXiv | |
2024-06 | Translation Agent | Agentic translation using reflection workflow. | ||
2024-06 | Atomic Agents | The Atomic Agents framework is designed to be modular, extensible, and easy to use. | ||
2024-05 | Pipecat | Open Source framework for voice and multimodal conversational AI. | ||
2024-02 | V-IRL | Grounding Virtual Intelligence in Real Life. | arXiv |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | CosyVoice | Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability. | ||
2024-06 | DEX-TTS | Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability. | arXiv | Website |
2024-05 | ChatTTS | ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. | ||
2023-06 | StyleTTS 2 | Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | SenseVoice | SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED). | Hugging Face | |
2024-05 | TeleSpeech-ASR | Large speech model-super multi-dialect ASR. | Hugging Face | |
2022-12 | Whisper | Whisper is a general-purpose speech recognition model. | arXiv | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-07 | FoleyCrafter | FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. | arXiv | Hugging Face |
2024-06 | SEE-2-SOUND | Zero-Shot Spatial Environment-to-Spatial Sound. | arXiv | |
2024-05 | Make-An-Audio 3 | Transforming Text into Audio via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-09 | StoryMaker | StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation. | arXiv | Hugging Face |
2024-08 | CSGO | CSGO: Content-Style Composition in Text-to-Image Generation. | arXiv | |
2024-08 | FLUX | This repo contains minimal inference code to run text-to-image and image-to-image with our Flux latent rectified flow transformers. | Hugging Face | |
2024-08 | Segment Anything Model 2 (SAM 2) | SAM 2: Segment Anything in Images and Videos. | arXiv | Hugging Face |
2024-07 | CatVTON | CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models. | arXiv | Hugging Face |
2024-07 | UltraEdit | UltraEdit: Instruction-based Fine-Grained Image Editing at Scale. | arXiv | Hugging Face |
2024-07 | UltraPixel | UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks. | arXiv | |
2024-07 | PaintsUndo | PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings. | ||
2024-07 | Kolors | Kolors: Effective Training of Diffusion Model for Photorealistic Text-to-Image Synthesis. | Hugging Face | |
2024-06 | Depth Anything V2 | Depth Anything V2. | arXiv | Hugging Face |
2024-06 | AutoStudio | Crafting Consistent Subjects in Multi-turn Interactive Image Generation. | arXiv | |
2024-06 | MimicBrush | Zero-shot Image Editing with Reference Imitation. | arXiv | Hugging Face |
2024-06 | LlamaGen | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. | arXiv | Hugging Face |
2024-05 | Omost | Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. | Hugging Face | |
2024-05 | Hunyuan-DiT | A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. | arXiv | Hugging Face |
2024-02 | MIGC | MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis. | arXiv | |
2023-10 | DALL·E 3 | DALL·E is a AI system that can create realistic images and art from a description in natural language. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-11 | LTX-Video | LTX-Video is the first DiT-based video generation model that can generate high-quality videos in real-time. | Hugging Face | |
2024-09 | MIMO | MIMO: Controllable Character Video Synthesis with Spatial Decomposed Modeling. | arXiv | Website |
2024-09 | DrawingSpinUp | DrawingSpinUp: 3D Animation from Single Character Drawings. | arXiv | Website |
2024-09 | ViewCrafter | ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis. | arXiv | Website |
2024-08 | CogVideoX | CogVideoX is an open-source version of the video generation model, which is homologous to 清影. | Hugging Face | |
2024-07 | Tora | Tora: Trajectory-oriented Diffusion Transformer for Video Generation. | arXiv | Website |
2024-06 | Diffutoon | High-Resolution Editable Toon Shading via Diffusion Models. | arXiv | Website |
2024-05 | Video-MME | The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. | ||
2024-05 | Video-of-Thought | Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. | Website | |
2024-05 | MOFA-Video | MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model. | arXiv | Hugging Face |
2024-05 | MotionLLM | Understanding Human Behaviors from Human Motions and Videos. | arXiv | |
2024-05 | Vidu | Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. | arXiv | |
2024-02 | Sora | Sora is an AI model that can create realistic and imaginative scenes from text instructions. | Technical Report | |
2023-11 | Pika | Pika is the idea-to-video platform that sets your creativity in motion. | ||
2023-03 | Runway | Runway is an applied AI research company shaping the next era of art, entertainment and human creativity. |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-05 | Diff-BGM | A Diffusion Model for Video Background Music Generation. | arXiv | |
2024-04 | Udio | Udio - AI Music Generator | Website | |
2023-12 | Suno | Suno is building a future where anyone can make great music. | Website | |
2023-12 | Soundry AI | Generative AI tools including text-to-sound and infinite sample packs. | Website | |
2023-12 | Sonauto | Sonauto is an AI music editor that turns prompts, lyrics, or melodies into full songs in any style. | Website |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-11 | Hunyuan3D-1.0 | Hunyuan3D-1.0: A Unified Framework for Text-to-3D and Image-to-3D Generation. | arXiv | Hugging Face |
2024-09 | 3DTopia-XL | 3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion. | arXiv | Website |
2024-08 | SF3D | SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. | arXiv | Hugging Face |
2024-07 | HoloDreamer | HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions. | arXiv | Website |
2024-07 | DreamCatalyst | DreamCatalyst: Fast and High-Quality 3D Editing via Controlling Editability and Identity Preservation. | arXiv | Website |
2024-07 | CharacterGen | CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization. | arXiv | Website |
2024-07 | GALA3D | GALA3D: Towards Text-to-3D Complex Scene Generation via Layout-guided Generative Gaussian Splatting. | arXiv | Website |
2024-06 | Unique3D | High-Quality and Efficient 3D Mesh Generation from a Single Image. | arXiv | Hugging Face |
2024-06 | DreamGaussian4D | Generative 4D Gaussian Splatting. | arXiv | Hugging Face |
2024-03 | GaussCtrl | GaussCtrl: Multi-View Consistent Text-Driven 3D Gaussian Splatting Editing. | arXiv | |
2024-03 | GaussianCube | A Structured and Explicit Radiance Representation for 3D Generative Modeling. | arXiv | Hugging Face |
2024-03 | TripoSR | Fast 3D Object Reconstruction from a Single Image. | arXiv | Hugging Face |