Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLaVa-Next: Update docs with batched inference #30857

Merged
merged 3 commits into from
May 20, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions docs/source/en/model_doc/llava_next.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/

## Usage example

### Single image inference

Here's how to load the model and perform inference in half-precision (`torch.float16`):

```python
Expand All @@ -94,6 +96,45 @@ output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
```

### Multi image inference

LLaVa-Next can perform inference with multiple images as input, where images either belong to the same prompt or different prompts (in batched inference). Here is how you can do it:

```python
import requests
from PIL import Image
import torch
from transformers import AutoProcessor, LlavaNextForConditionalGeneration

# Load the model in half-precision
model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto")
processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

# Get three different images
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image_stop = Image.open(requests.get(url, stream=True).raw)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image_cats = Image.open(requests.get(url, stream=True).raw)

url = "https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
image_snowman = Image.open(requests.get(url, stream=True).raw)

# Prepare a batched prompt, where the first one is a multi-turn conversation and the second is not
prompt = [
"[INST] <image>\nWhat is shown in this image? [/INST] There is a red stop sign in the image. [INST] <image>\nWhat about this image? How many cats do you see [/INST]",
"[INST] <image>\nWhat is shown in this image? [/INST]"
]

# We can simply feed images in the order they have to be used in the text prompt
# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens
inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device)
Comment on lines +129 to +131
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually quite surprising behaviour and doesn't match with other models which take multiple images per prompt e.g. Idefics2. I should have caught this in the #29850 PR. I would have expected the images to be in the structure [[image_stop, image_cats], [image_snowman]]. Can the processor accept both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, did not know Idefics is different. No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

Also, I did the same thing for Video-LLaVa which simply aligns images on a rolling basis, replacing the special token

Copy link
Member Author

@zucchini-nlp zucchini-nlp May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we should make it Idefics style? I guess it can be done by flattening a list inside an image processor, if we get a nested list. In other words, change make_list_of_images to accept nested image list

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.

The best analogy is with video image processor. The input to these can be:

  • A single image (frame)
  • A list of images (series of frames, with batch size 1): [image_0_0, image_0_1, image_0_2]
  • A nested list of images (series of frames with batch size b): [[image_0_0, image_0_1, image_0_2], [image_1_0, image_1_1, image_1_2]]

This is in effect a video-like input, where the number of frames can vary per sample.

It's also more analogous to the question-answering input format for tokenizers, where the structure for pairs of sentences to be tokenized is: [[text_a_0, text_a_1], [text_b_0, text_b_1]].

I'd rather the structure was more like this, as it makes it explicit and removes ambiguity with the input format between other image processors.

Since it already accepts this format, what I would suggest is enabling accepting either this flat format, or the nested format. This way, users will be more able to seamlessly switch between different models when using Auto classes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, similar to video processors I added a make_batch function which flattens the list if it's nested.

From user perspective nothing changes and the processor returns same shapes if nested list is passed. Added a test for that


# Generate
generate_ids = model.generate(**inputs, max_new_tokens=30)
processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
```

## Model optimization

### Quantization using Bitsandbytes
Expand Down
Loading