-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616
base: main
Are you sure you want to change the base?
Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks for updating the pipeline. Left a couple comments
@@ -161,7 +161,7 @@ def __call__( | |||
width // self.patch_size | |||
) + self.num_additional_image_tokens | |||
if self.vision_feature_select_strategy == "default": | |||
num_image_tokens -= 1 | |||
num_image_tokens -= self.num_additional_image_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be 1
to work correctly with different ViT backbones. Was it causing any test failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:
pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
text = "<image> What this is? Assistant: This is"
outputs = pipe(image, text=text)
self.assertEqual(
outputs,
[
{
"input_text": "<image> What this is? Assistant: This is",
"generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
}
],
)
returns:
ValueError: Image features and image tokens do not match: tokens: 728, features 729
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy
for the model config and for processor. Will fix that on the hub :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! A few comments on my side
e3d95fd
to
37bb6fc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing doc / examples but nice otherwise!
What does this PR do?
Follows #34275
Process inputs directly in
apply_chat_template
instead of callingapply_chat_template
then the processor.This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the
images
arg even when using chat template, where the corresponding image is represented with a{"type": "image"}
in the chat.In the previous behavior, when the input was:
The output would be:
With no mention of the actual input images.
Now the output is:
Who can review?
@zucchini-nlp @Rocketknight1