-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaVa-Next: Update docs with batched inference #30857
Conversation
I will see check how the fine-tuning works and will make a short colab on that 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding! Relevant for #29850
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating!
Q about the expected inputs to the processor
# We can simply feed images in the order they have to be used in the text prompt | ||
# Each "<image>" token uses one image leaving the next for the subsequent "<image>" tokens | ||
inputs = processor(text=prompt, images=[image_stop, image_cats, image_snowman], padding=True, return_tensors="pt").to(model.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite surprising behaviour and doesn't match with other models which take multiple images per prompt e.g. Idefics2. I should have caught this in the #29850 PR. I would have expected the images to be in the structure [[image_stop, image_cats], [image_snowman]]
. Can the processor accept both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, did not know Idefics is different. No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.
Also, I did the same thing for Video-LLaVa which simply aligns images on a rolling basis, replacing the special token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean we should make it Idefics style? I guess it can be done by flattening a list inside an image processor, if we get a nested list. In other words, change make_list_of_images
to accept nested image list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, LLaVa processors work same way as all other image processors so they do not accept nested lists of images.
The best analogy is with video image processor. The input to these can be:
- A single image (frame)
- A list of images (series of frames, with batch size 1):
[image_0_0, image_0_1, image_0_2]
- A nested list of images (series of frames with batch size b):
[[image_0_0, image_0_1, image_0_2], [image_1_0, image_1_1, image_1_2]]
This is in effect a video-like input, where the number of frames can vary per sample.
It's also more analogous to the question-answering input format for tokenizers, where the structure for pairs of sentences to be tokenized is: [[text_a_0, text_a_1], [text_b_0, text_b_1]]
.
I'd rather the structure was more like this, as it makes it explicit and removes ambiguity with the input format between other image processors.
Since it already accepts this format, what I would suggest is enabling accepting either this flat format, or the nested format. This way, users will be more able to seamlessly switch between different models when using Auto classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, similar to video processors I added a make_batch
function which flattens the list if it's nested.
From user perspective nothing changes and the processor returns same shapes if nested list is passed. Added a test for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding and for making the inputs more consistent across processors!
What does this PR do?
PR adds an example of batched inference for Llava-next in the docs section. Also as discussed with @NielsRogge , I verified that we can support multi-turn conversation format by feeding multiple images in one prompt. Both usage cases are added to the example code snippet.