Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process inputs directly in apply_chat_template in image-text-to-text pipeline #35616

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yonigozlan
Copy link
Member

What does this PR do?

Follows #34275
Process inputs directly in apply_chat_template instead of calling apply_chat_template then the processor.
This also means that a small part of the pipeline logic needed to change, but I think it's better now :).
The pipeline also supports passing images with the images arg even when using chat template, where the corresponding image is represented with a {"type": "image"} in the chat.

In the previous behavior, when the input was:

image_ny = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image_chicago = "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s the difference between these two images?"},
            {"type": "image"},
            {"type": "image"},
        ],
    }
]
outputs = pipe([image_ny, image_chicago], text=messages)

The output would be:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {"type": "image"},
                    {"type": "image"},
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

With no mention of the actual input images.
Now the output is:

[
    {
        "input_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            }
        ],
        "generated_text": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "What’s the difference between these two images?"},
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg",
                    },
                    {
                        "type": "image",
                        "image": "https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg",
                    },
                ],
            },
            {
                "role": "assistant",
                "content": "The first image shows a statue of Liberty in the foreground, while the second image shows a city skyline",
            },
        ],
    }
]

Who can review?

@zucchini-nlp @Rocketknight1

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for updating the pipeline. Left a couple comments

@@ -161,7 +161,7 @@ def __call__(
width // self.patch_size
) + self.num_additional_image_tokens
if self.vision_feature_select_strategy == "default":
num_image_tokens -= 1
num_image_tokens -= self.num_additional_image_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be 1 to work correctly with different ViT backbones. Was it causing any test failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this change, I'm getting errors on pipeline tests that use to work with llava-interleave. For example:

pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
        image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
        text = "<image> What this is? Assistant: This is"

        outputs = pipe(image, text=text)
        self.assertEqual(
            outputs,
            [
                {
                    "input_text": "<image> What this is? Assistant: This is",
                    "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
                }
            ],
        )

returns:

ValueError: Image features and image tokens do not match: tokens: 728, features 729

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case of llava-interleave-qwen-0.5b-hf I see a mismatch in vision_feature_select_strategy for the model config and for processor. Will fix that on the hub :)

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A few comments on my side

@yonigozlan yonigozlan force-pushed the vectorize-input-chat-image-text-to-text-pipeline branch from e3d95fd to 37bb6fc Compare January 13, 2025 17:31
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing doc / examples but nice otherwise!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants