Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

yonigozlan · 2025-01-10T18:08:09Z

What does this PR do?

Follows #34275
Process inputs directly in apply_chat_template instead of calling apply_chat_template then the processor.
This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the images arg even when using chat template, where the corresponding image is represented with a {"type": "image"} in the chat.

In the previous behavior, when the input was:

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s the difference between these two images?"},
            {"type": "image"},
            {"type": "image"},
        ],
    }
]
outputs = pipe([image_ny, image_chicago], text=messages)

The output would be:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

With no mention of the actual input images.
Now the output is:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

Who can review?

@zucchini-nlp @Rocketknight1

HuggingFaceDocBuilderDev · 2025-01-10T18:35:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp

Nice, thanks for updating the pipeline. Left a couple comments

zucchini-nlp · 2025-01-10T19:58:06Z

src/transformers/models/llava/processing_llava.py

@@ -161,7 +161,7 @@ def __call__(
                width // self.patch_size
            ) + self.num_additional_image_tokens
            if self.vision_feature_select_strategy == "default":
-                num_image_tokens -= 1
+                num_image_tokens -= self.num_additional_image_tokens


This should be 1 to work correctly with different ViT backbones. Was it causing any test failures?

Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf") image = "./tests/fixtures/tests_samples/COCO/000000039769.png" text = "<image> What this is? Assistant: This is" outputs = pipe(image, text=text) self.assertEqual( outputs, [ { "input_text": "<image> What this is? Assistant: This is", "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable", } ], )

returns:

ValueError: Image features and image tokens do not match: tokens: 728, features 729

in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy for the model config and for processor. Will fix that on the hub :)

src/transformers/pipelines/image_text_to_text.py

qubvel

Thanks! A few comments on my side

src/transformers/pipelines/image_text_to_text.py

ArthurZucker

Missing doc / examples but nice otherwise!

yonigozlan requested review from molbap, qubvel, Rocketknight1 and ArthurZucker as code owners January 10, 2025 18:08

zucchini-nlp reviewed Jan 13, 2025

View reviewed changes

qubvel reviewed Jan 13, 2025

View reviewed changes

src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved

src/transformers/pipelines/image_text_to_text.py Outdated Show resolved Hide resolved

yonigozlan added 2 commits January 13, 2025 17:31

tokenize inputs directly in apply_chat_template

1834fab

refactor processing

37bb6fc

yonigozlan force-pushed the vectorize-input-chat-image-text-to-text-pipeline branch from e3d95fd to 37bb6fc Compare January 13, 2025 17:31

ArthurZucker reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

yonigozlan commented Jan 10, 2025

HuggingFaceDocBuilderDev commented Jan 10, 2025

zucchini-nlp left a comment

zucchini-nlp Jan 10, 2025

yonigozlan Jan 13, 2025

zucchini-nlp Jan 14, 2025

qubvel left a comment

ArthurZucker left a comment

Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

Are you sure you want to change the base?

Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

Conversation

yonigozlan commented Jan 10, 2025

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented Jan 10, 2025

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Jan 10, 2025

Choose a reason for hiding this comment

yonigozlan Jan 13, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 14, 2025

Choose a reason for hiding this comment

qubvel left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment