Pixtral: vectorize patch embeddings and enable tests #35122

zucchini-nlp · 2024-12-06T12:52:43Z

What does this PR do?

Continuation on the discussion from #35110.

This PR gets rid of the loop over each image in the input for Pixtra and makes it more aligned with other VLMs. Now the model will pad the image on h/w dimensions and unpad it back after the vision patch embedding layer. That also helps us to get rid of extra dimension errors on processing code we've been having lately and remove the BacthFeatureMix

Test with the demo script from https://huggingface.co/mistral-community/pixtral-12b, as we don't have any test for pixtral as VLM at the moment. The generations match on text level

cc @Rocketknight1 wdyt about this? The design might need some changes as I had to make llava accept extra kwargs (image_sizes) to make the model work

HuggingFaceDocBuilderDev · 2024-12-06T13:19:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ywang96 · 2024-12-06T17:18:13Z

@zucchini-nlp One concern I have about padding the images before the vision backbone and then unpadding them back after is that then you'll be always padding every image in the batch to the max size unnecessarily (especially when models like Pixtral support 10+ multi-image in one inference call), which IMO could be a bit problematic. At that point it maybe just better to flatten and concat pixel values to the batched tensor of shape (total_num_patches, C, H, W) since you will need to pass in additional information (image_sizes in this case) anyways.

zucchini-nlp · 2024-12-08T12:21:57Z

@ywang96 hmm, indeed could induce some extra overhead by patching everything to "max-size", lemme see how much overhead comes from it

better to flatten and concat pixel values to the batched tensor of shape (total_num_patches, C, H, W) s

Not sure I got this. Currently we return pixels of shape (batch-size, C, H, W). What is the difference between making it total_num_patches? IIUC you mean smth like flattening on H*W and concating on that dim similar to what Qwen2-VL does?

zucchini-nlp · 2024-12-11T14:06:56Z

Made some evals for memory usage with inputs where all images are low resolution (100 x 100) and there is one with the highest resolution possible (1024 x 1024). The memory usage between old and the new processing logic is almost same, excluding memory used by model weights it is

Old-processor -> 41470 MiB
New processor -> 41875 MiB

Higher batch size results in OOM in both cases in A100 80GB

from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
processor.tokenizer.padding_side = "left"


url_dog = "https://picsum.photos/id/237/100/100"
url_mountain = "https://picsum.photos/seed/picsum/100/100"
url_stop = "https://picsum.photos/id/237/2000/2000"

chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image"},
        {"type": "text", "content": " live here?"}, 
        {"type": "image"},
      ]
    }
]
prompt = processor.apply_chat_template(chat)

chat_2 = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "What do you see here"}, 
        {"type": "image"},
      ]
    }
]
prompt_2 = processor.apply_chat_template(chat_2)


begin = torch.cuda.memory_allocated()

prompts = [prompt] * 40 + [prompt_2]
images = [[url_dog, url_mountain]] * 40 + [[url_stop]]


inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")

model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda:0", torch_dtype="float16")

inputs = inputs.to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=50)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

end = torch.cuda.max_memory_allocated()
print((end - begin) // 1024 ** 2, "MiB")

zucchini-nlp · 2025-01-10T15:09:59Z

Ready for review! Quick update:

This PR modifies Pixtral model code to act similarly to other VLMs that divide each image into patches. So, now we pad the inputs to max_patch_size and will have a more standard 4D pixel_values. After running the patch embeddings, the hidden_states can be unpadded back. Similar method is used in llava-next based models and in Emu3 (maybe also Molmo 🤔 )

Plus, I enabled the tests that were skipped and fixed what needed to be fixed. All slow tests for Pixtral are passing + ran a sanity check by generating with "pixtral-12b"

Rocketknight1

If tests pass, I'm happy with it! In general I strongly approve of making Pixtral more compliant with the other VLM code. Also, the existing code doesn't work with batched inputs at all, so more or less anything is an improvement 😅

ArthurZucker

Thanks! would be nice to add an explicit test with padded patches and etc

src/transformers/models/pixtral/image_processing_pixtral.py

src/transformers/models/pixtral/image_processing_pixtral_fast.py

ArthurZucker · 2025-01-21T09:51:27Z

src/transformers/models/pixtral/modeling_pixtral.py

    supports_gradient_checkpointing = True
-    _no_split_modules = ["PixtralVisionAttention"]
-    _skip_keys_device_placement = "past_key_values"


why don't we keep _skip_keys_device_placement?

Oh, it's because Pixtral is a vision-only model, as ViT. And we dont expect ViT to have cache at all. The VLM is anyway called with llava model class, which should be the one showing if we support cache or not (depending on LM)

ArthurZucker · 2025-01-21T09:51:39Z

src/transformers/models/pixtral/modeling_pixtral.py

    supports_gradient_checkpointing = True
-    _no_split_modules = ["PixtralVisionAttention"]
-    _skip_keys_device_placement = "past_key_values"
-    _supports_cache_class = True


pretty sure it does support cache class

src/transformers/models/pixtral/modeling_pixtral.py

ArthurZucker

Thanks! Let's run these slow test and merge

ArthurZucker · 2025-01-21T13:59:37Z

src/transformers/models/pixtral/image_processing_pixtral_fast.py

-        return BatchMixFeature(
-            data={"pixel_values": batch_images, "image_sizes": batch_image_sizes},
-            tensor_type=None,
+        for image in images:


ArthurZucker · 2025-01-21T14:00:36Z

tests/models/llava/test_modeling_llava.py

-        PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"
-
-        # image = Image.open(requests.get(url, stream=True).raw)
-        inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
-        generate_ids = model.generate(**inputs, max_new_tokens=500)
-        ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
-
-        # fmt: off
-        EXPECTED_GENERATION = """
-Describe the images.
-Sure, let's break down each image description:
-
-1. **Image 1:**
-   - **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
-   - **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.
-
-2. **Image 2:**
-   - **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
-   - **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.
-
-3. **Image 3:**
-   - **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
-   - **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.
-
-4. **Image 4:**
-   - **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
-   - **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.
-
-Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
-"""


IMO this test is important: it makes sure a single prompt can describe 4 images and this test was 1-1 passing so let's keep it

oke, but the test wasn't even loading the model correctly for me, so I remove it. I changed the ckpt now and used correct device for inputs

One thing is that the model runs on cpu, we can't put it to gpu even in fp16, and 4-bit hurts performance a lot. Will just leave as is then

ArthurZucker · 2025-01-21T14:02:37Z

run slow pixtral

github-actions · 2025-01-21T14:03:28Z

This comment contains run-slow, running the specified jobs: ['models/pixtral'] ...

* initial POC * - batch mix feature * fix tests * fix tests * make style * do not skip and instead fix tests * update * return back the test * correct text with the correct ckpt

zucchini-nlp added 2 commits December 6, 2024 13:44

initial POC

48ad560

- batch mix feature

36e6256

zucchini-nlp mentioned this pull request Dec 6, 2024

Keep image_sizes in output of PixtralProcessor #35110

Closed

5 tasks

zucchini-nlp added 2 commits December 11, 2024 11:17

fix tests

70f0ea4

fix tests

f01ac74

zucchini-nlp added 2 commits January 10, 2025 13:32

merge main

7579f76

make style

080736f

zucchini-nlp requested review from yonigozlan, qubvel, molbap, Rocketknight1 and ArthurZucker as code owners January 10, 2025 12:44

zucchini-nlp removed request for yonigozlan, molbap and qubvel January 10, 2025 14:53

do not skip and instead fix tests

d095b9e

zucchini-nlp changed the title ~~[WIP] Pixtral: vectorize patch embeddings~~ Pixtral: vectorize patch embeddings and enable tests Jan 10, 2025

Rocketknight1 approved these changes Jan 10, 2025

View reviewed changes

Merge branch 'main' into pixtral-vectorize

9d1b2ed

ArthurZucker reviewed Jan 21, 2025

View reviewed changes

update

7ba48f5

zucchini-nlp requested a review from ArthurZucker January 21, 2025 13:35

ArthurZucker approved these changes Jan 21, 2025

View reviewed changes

yonigozlan mentioned this pull request Jan 22, 2025

Refactoring of ImageProcessorFast #35069

Merged

zucchini-nlp added 4 commits January 29, 2025 09:54

return back the test

5ccb71f

Merge branch 'main' into pixtral-vectorize

544186b

correct text with the correct ckpt

2447d05

Merge branch 'main' into pixtral-vectorize

522132e

zucchini-nlp merged commit 9725e5b into huggingface:main Jan 30, 2025
25 checks passed

DarkLight1337 mentioned this pull request Feb 5, 2025

[VLM] Update compatibility with transformers 4.49 vllm-project/vllm#12781

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pixtral: vectorize patch embeddings and enable tests #35122

Pixtral: vectorize patch embeddings and enable tests #35122

zucchini-nlp commented Dec 6, 2024

HuggingFaceDocBuilderDev commented Dec 6, 2024

ywang96 commented Dec 6, 2024

zucchini-nlp commented Dec 8, 2024 •

edited

Loading

zucchini-nlp commented Dec 11, 2024 •

edited

Loading

zucchini-nlp commented Jan 10, 2025

Rocketknight1 left a comment

ArthurZucker left a comment

ArthurZucker Jan 21, 2025

zucchini-nlp Jan 21, 2025

ArthurZucker Jan 21, 2025

ArthurZucker left a comment

ArthurZucker Jan 21, 2025

ArthurZucker Jan 21, 2025

zucchini-nlp Jan 29, 2025

ArthurZucker commented Jan 21, 2025

github-actions bot commented Jan 21, 2025

Pixtral: vectorize patch embeddings and enable tests #35122

Pixtral: vectorize patch embeddings and enable tests #35122

Conversation

zucchini-nlp commented Dec 6, 2024

What does this PR do?

HuggingFaceDocBuilderDev commented Dec 6, 2024

ywang96 commented Dec 6, 2024

zucchini-nlp commented Dec 8, 2024 • edited Loading

zucchini-nlp commented Dec 11, 2024 • edited Loading

zucchini-nlp commented Jan 10, 2025

Rocketknight1 left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 21, 2025

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

ArthurZucker Jan 21, 2025

Choose a reason for hiding this comment

zucchini-nlp Jan 29, 2025

Choose a reason for hiding this comment

ArthurZucker commented Jan 21, 2025

github-actions bot commented Jan 21, 2025

zucchini-nlp commented Dec 8, 2024 •

edited

Loading

zucchini-nlp commented Dec 11, 2024 •

edited

Loading