Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVA/README.md at main · haotian-liu/LLaVA #628

Open
1 task
irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Open
1 task

LLaVA/README.md at main · haotian-liu/LLaVA #628

irthomasthomas opened this issue Feb 27, 2024 · 1 comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic

Comments

@irthomasthomas
Copy link
Owner

LLaVA/README.md at main · haotian-liu/LLaVA

🌋 LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

📢 LLaVA-NeXT Blog Project Page Demo Data Model Zoo

🤝Community Contributions: llama.cpp Colab 🤗Space Replicate AutoGen BakLLaVA

Improved Baselines with Visual Instruction Tuning Paper HF

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) Paper HF

Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Release

  • [1/30] 🔥 LLaVA-NeXT (LLaVA-1.6) is out! With additional scaling to LLaVA-1.5, LLaVA-NeXT-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.
  • [11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). Project Page Demo Code Paper
  • [11/2] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. Project Page Demo Code Paper
  • [10/26] 🔥 LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts) (script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
  • [10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! 🤗 Demo
  • [10/5] 🔥 LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo. The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here.
  • [9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project LLavA-RLHF.
  • [9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
More
  • [11/6] Support Intel dGPU and CPU platforms. More details here.
  • [10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support!
  • [10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
  • [10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5.
  • [9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper "Multimodal Foundation Models: From Specialists to General-Purpose Assistants".

  • [7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
  • [6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out Slides Notes YouTube Bilibli.
  • [6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
  • [6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
  • [5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
  • [5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
  • [4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
  • [4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

Code License

Usage and License Notices: This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for the dataset and the specific licenses for base language models for checkpoints trained using the dataset (e.g. Llama community license for LLaMA-2 and Vicuna-v1.5). This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Contents

Suggested labels

@irthomasthomas irthomasthomas added Algorithms Sorting, Learning or Classifying. All algorithms go here. MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic labels Feb 27, 2024
@irthomasthomas
Copy link
Owner Author

irthomasthomas commented Feb 27, 2024

/### Related issues

#184: Robin: Multimodal (Visual-Language) Models.  - CERC-AAI Lab - Robin v1.0

### DetailsSimilarity score: 0.89 - [ ] [CERC-AAI Lab - Robin v1.0](https://sites.google.com/view/irinalab/blog/robin-v1-0)
The Robin team is proud to present Robin, a suite of  Multimodal (Visual-Language) Models. 

These models outperform, or perform on par with, the state of the art models of similar scale. 
In the ever-evolving realm of artificial intelligence, the intersection of language understanding and visual perception has paved the way for groundbreaking multimodal models. We study different components and methods to merge pretrained vision and language models with the goal to build better visual language models. 
As part of this first milestone, we release this LLaVA-fork enabling the Mistral-7B & Open-Hermes-2.5 language models to process images. We combine the pretrained LLMs (Vicuna, Mistral and OpenHermes 2.5) and Vision models (CLIP and SigLIP), and further enhance capabilities by finetuning the vision encoder.

Models detailed bellow are available here: https://huggingface.co/agi-collective
The code used is available here: https://github.com/AGI-Collective/Robin/releases/tag/v1.0.0
Also, some related work by our team on aligning multimodal models: https://arxiv.org/abs/2304.13765
LLaVA Architecture Overview
The LLaVA architecture, an acronym for Large Language and Vision Assistant, represents a multimodal Visual Language Model (VLM). At its core, LLaVA integrates a pretrained language model with a pretrained vision encoder, connected through a projection layer. In its original incarnation, the Vicuna model served as the language foundation, while the CLIP ViT-Large from OpenAI assumes the role of the vision encoder.
Building upon this foundation, as part of the first milestone we study the impact of different language models, vision encoders and the effect of finetuning the vision encoder on the performance of our multimodal model. Notably, our journey led us to experiment with the fusion of various versions of the Mistral AI LLM model and the DeepMind SigLip visual encoder.
Architecture Variations
Our model variations are best encapsulated in the table below, outlining the diverse combinations of language models, vision encoders and the fine-tuning strategy.

#459: llama2

### DetailsSimilarity score: 0.89 - [ ] [llama2](https://ollama.ai/library/llama2)

Llama 2

The most popular model for general use.

265.8K Pulls
Updated 4 weeks ago

Overview

Llama 2 is released by Meta Platforms, Inc. This model is trained on 2 trillion tokens, and by default supports a context length of 4096. Llama 2 Chat models are fine-tuned on over 1 million human annotations, and are made for chat.

CLI

Open the terminal and run

ollama run llama2

API

Example using curl:

curl -X POST http://localhost:11434/api/generate -d '{
  "model": "llama2",
  "prompt":"Why is the sky blue?"
 }'

API documentation

Memory requirements

  • 7b models generally require at least 8GB of RAM
  • 13b models generally require at least 16GB of RAM
  • 70b models generally require at least 64GB of RAM

If you run into issues with higher quantization levels, try using the q4 model or shut down any other programs that are using a lot of memory.

Model variants

  • Chat: fine-tuned for chat/dialogue use cases. These are the default in Ollama, and for models tagged with -chat in the tags tab.

    Example: ollama run llama2

  • Pre-trained: without the chat fine-tuning. This is tagged as -text in the tags tab.

    Example: ollama run llama2:text

By default, Ollama uses 4-bit quantization. To try other quantization levels, please use the other tags. The number after the q represents the number of bits used for quantization (i.e. q4 means 4-bit quantization). The higher the number, the more accurate the model is, but the slower it runs, and the more memory it requires.

References

Suggested labels

{ "label-name": "llama2-model", "description": "A powerful text model for chat, dialogue, and general use.", "repo": "ollama.ai/library/llama2", "confidence": 91.74 }

#625: unsloth/README.md at main · unslothai/unsloth

### DetailsSimilarity score: 0.88 - [ ] [unsloth/README.md at main · unslothai/unsloth](https://github.com/unslothai/unsloth/blob/main/README.md?plain=1)

unsloth/README.md at main · unslothai/unsloth

unsloth logo



Finetune Mistral, Gemma, Llama 2-5x faster with 70% less memory!

✨ Finetune for Free

All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use
Gemma 7b ▶️ Start on Colab 2.4x faster 58% less
Mistral 7b ▶️ Start on Colab 2.2x faster 62% less
Llama-2 7b ▶️ Start on Colab 2.2x faster 43% less
TinyLlama ▶️ Start on Colab 3.9x faster 74% less
CodeLlama 34b A100 ▶️ Start on Colab 1.9x faster 27% less
Mistral 7b 1xT4 ▶️ Start on Kaggle 5x faster* 62% less
DPO - Zephyr ▶️ Start on Colab 1.9x faster 19% less

🦥 Unsloth.ai News

🔗 Links and Resources

Type Links
📚 Wiki & FAQ Read Our Wiki
📜 Documentation Read The Doc
💾 Installation unsloth/README.md
  Twitter (aka X) Follow us on X
🥇 Benchmarking Performance Tables
🌐 Released Models Unsloth Releases
✍️ Blog Read our Blogs

⭐ Key Features

  • All kernels written in OpenAI's Triton language. Manual backprop engine.
  • 0% loss in accuracy - no approximation methods - all exact.
  • No change of hardware. Supports NVIDIA GPUs since 2018+. Minimum CUDA Capability 7.0 (V100, T4, Titan V, RTX 20, 30, 40x, A100, H100, L40 etc) Check your GPU! GTX 1070, 1080 works, but is slow.
  • Works on Linux and Windows via WSL.
  • Supports 4bit and 16bit QLoRA / LoRA finetuning via bitsandbytes.
  • Open source trains 5x faster - see Unsloth Pro for 30x faster training!
  • If you trained a model with 🦥Unsloth, you can use this cool sticker!  

🥇 Performance Benchmarking

1 A100 40GB 🤗Hugging Face Flash Attention 🦥Unsloth Open Source 🦥Unsloth Pro
Alpaca 1x 1.04x 1.98x 15.64x
LAION Chip2 1x 0.92x 1.61x 20.73x
OASST 1x 1.19x 2.17x 14.83x
Slim Orca 1x 1.18x 2.22x 14.82x
Free Colab T4 Dataset 🤗Hugging Face Pytorch 2.1.1 🦥Unsloth 🦥 VRAM reduction
Llama-2 7b OASST 1x 1.19x 1.95x -43.3%
Mistral 7b Alpaca 1x 1.07x 1.56x -13.7%
Tiny Llama 1.1b Alpaca 1x 2.06x 3.87x -73.8%
DPO with Zephyr Ultra Chat 1x 1.09x 1.55x -18.6%

View on GitHub

Suggested labels

#494: Awesome-Efficient-LLM: A curated list for Efficient Large Language Models

### DetailsSimilarity score: 0.88 - [ ] [horseee/Awesome-Efficient-LLM: A curated list for Efficient Large Language Models](https://github.com/horseee/Awesome-Efficient-LLM#inference-acceleration)

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models:


Inference Acceleration


Updates

  • Sep 27, 2023: Add tag for papers accepted at NeurIPS'23.
  • Sep 6, 2023: Add a new subdirectory project/ to organize those projects designed for developing a lightweight LLM.
  • July 11, 2023: Create a new subdirectory efficient_plm/ for papers applicable to PLMs (such as BERT, BART) but have yet to be verified for their effectiveness on LLMs.

Contributing

If you'd like to include your paper or need to update any details, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

Suggested labels

{ "label-name": "efficient-llm-acceleration", "description": "Inference acceleration techniques for efficient large language models.", "repo": "horseee/Awesome-Efficient-LLM", "confidence": 70.8 }

#317: treaming-llm: Efficient Streaming Language Models with Attention Sinks

### DetailsSimilarity score: 0.88 - [ ] [mit-han-lab/streaming-llm: Efficient Streaming Language Models with Attention Sinks](https://github.com/mit-han-lab/streaming-llm)

Usage

Environment Setup

conda create -yn streaming python=3.8
conda activate streaming

pip install torch torchvision torchaudio
pip install transformers==4.33.0 accelerate datasets evaluate wandb scikit-learn scipy sentencepiece

python setup.py develop
Run Streaming Llama Chatbot

CUDA_VISIBLE_DEVICES=0 python examples/run_streaming_llama.py --enable_streaming
FAQ

What does "working on infinite-length inputs" imply for LLMs?

Handling infinite-length text with LLMs presents challenges. Notably, storing all previous Key and Value (KV) states demands significant memory, and models might struggle to generate text beyond their training sequence length. StreamingLLM addresses this by retaining only the most recent tokens and attention sinks, discarding intermediate tokens. This enables the model to generate coherent text from recent tokens without a cache reset — a capability not seen in earlier methods.

Is the context window of LLMs expanded?

No. The context window remains unchanged. Only the most recent tokens and attention sinks are retained, discarding middle tokens. This means the model can only process the latest tokens. The context window remains constrained by its initial pre-training. For instance, if Llama-2 is pre-trained with a context window of 4096 tokens, then the maximum cache size for StreamingLLM on Llama-2 remains 4096.

Can I input an extensive text, like a book, into StreamingLLM for summarization?

While you can input a lengthy text, the model will only recognize the latest tokens. Thus, if a book is an input, StreamingLLM might only summarize the concluding paragraphs, which might not be very insightful. As emphasized earlier, we neither expand the LLMs' context window nor enhance their long-term memory. StreamingLLM's strength lies in generating fluent text from recent tokens without needing a cache refresh.

What is the ideal use case for StreamingLLM?

StreamingLLM is optimized for streaming applications, such as multi-round dialogues. It's ideal for scenarios where a model needs to operate continually without requiring extensive memory or dependency on past data. An example is a daily assistant based on LLMs. StreamingLLM would let the model function continuously, basing its responses on recent conversations without needing to refresh its cache. Earlier methods would either need a cache reset when the conversation length exceeded the training length (losing recent context) or recompute KV states from recent text history, which can be time-consuming.

@ShellLM ShellLM mentioned this issue Nov 13, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algorithms Sorting, Learning or Classifying. All algorithms go here. MachineLearning ML Models, Training and Inference Models LLM and ML model repos and links Papers Research papers Research personal research notes for a topic
Projects
None yet
Development

No branches or pull requests

1 participant