Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pixtral: vectorize patch embeddings and enable tests #35122

Merged
merged 13 commits into from
Jan 30, 2025

Conversation

zucchini-nlp
Copy link
Member

What does this PR do?

Continuation on the discussion from #35110.

This PR gets rid of the loop over each image in the input for Pixtra and makes it more aligned with other VLMs. Now the model will pad the image on h/w dimensions and unpad it back after the vision patch embedding layer. That also helps us to get rid of extra dimension errors on processing code we've been having lately and remove the BacthFeatureMix

Test with the demo script from https://huggingface.co/mistral-community/pixtral-12b, as we don't have any test for pixtral as VLM at the moment. The generations match on text level

cc @Rocketknight1 wdyt about this? The design might need some changes as I had to make llava accept extra kwargs (image_sizes) to make the model work

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ywang96
Copy link

ywang96 commented Dec 6, 2024

@zucchini-nlp One concern I have about padding the images before the vision backbone and then unpadding them back after is that then you'll be always padding every image in the batch to the max size unnecessarily (especially when models like Pixtral support 10+ multi-image in one inference call), which IMO could be a bit problematic. At that point it maybe just better to flatten and concat pixel values to the batched tensor of shape (total_num_patches, C, H, W) since you will need to pass in additional information (image_sizes in this case) anyways.

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Dec 8, 2024

@ywang96 hmm, indeed could induce some extra overhead by patching everything to "max-size", lemme see how much overhead comes from it

better to flatten and concat pixel values to the batched tensor of shape (total_num_patches, C, H, W) s

Not sure I got this. Currently we return pixels of shape (batch-size, C, H, W). What is the difference between making it total_num_patches? IIUC you mean smth like flattening on H*W and concating on that dim similar to what Qwen2-VL does?

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Dec 11, 2024

Made some evals for memory usage with inputs where all images are low resolution (100 x 100) and there is one with the highest resolution possible (1024 x 1024). The memory usage between old and the new processing logic is almost same, excluding memory used by model weights it is

Old-processor -> 41470 MiB
New processor -> 41875 MiB

Higher batch size results in OOM in both cases in A100 80GB

from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
processor.tokenizer.padding_side = "left"


url_dog = "https://picsum.photos/id/237/100/100"
url_mountain = "https://picsum.photos/seed/picsum/100/100"
url_stop = "https://picsum.photos/id/237/2000/2000"

chat = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "Can this animal"}, 
        {"type": "image"},
        {"type": "text", "content": " live here?"}, 
        {"type": "image"},
      ]
    }
]
prompt = processor.apply_chat_template(chat)

chat_2 = [
    {
      "role": "user", "content": [
        {"type": "text", "content": "What do you see here"}, 
        {"type": "image"},
      ]
    }
]
prompt_2 = processor.apply_chat_template(chat_2)


begin = torch.cuda.memory_allocated()

prompts = [prompt] * 40 + [prompt_2]
images = [[url_dog, url_mountain]] * 40 + [[url_stop]]


inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")

model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda:0", torch_dtype="float16")

inputs = inputs.to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=50)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

end = torch.cuda.max_memory_allocated()
print((end - begin) // 1024 ** 2, "MiB")

@zucchini-nlp zucchini-nlp changed the title [WIP] Pixtral: vectorize patch embeddings Pixtral: vectorize patch embeddings and enable tests Jan 10, 2025
@zucchini-nlp
Copy link
Member Author

Ready for review! Quick update:

This PR modifies Pixtral model code to act similarly to other VLMs that divide each image into patches. So, now we pad the inputs to max_patch_size and will have a more standard 4D pixel_values. After running the patch embeddings, the hidden_states can be unpadded back. Similar method is used in llava-next based models and in Emu3 (maybe also Molmo 🤔 )

Plus, I enabled the tests that were skipped and fixed what needed to be fixed. All slow tests for Pixtral are passing + ran a sanity check by generating with "pixtral-12b"

Copy link
Member

@Rocketknight1 Rocketknight1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If tests pass, I'm happy with it! In general I strongly approve of making Pixtral more compliant with the other VLM code. Also, the existing code doesn't work with batched inputs at all, so more or less anything is an improvement 😅

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! would be nice to add an explicit test with padded patches and etc

supports_gradient_checkpointing = True
_no_split_modules = ["PixtralVisionAttention"]
_skip_keys_device_placement = "past_key_values"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we keep _skip_keys_device_placement?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's because Pixtral is a vision-only model, as ViT. And we dont expect ViT to have cache at all. The VLM is anyway called with llava model class, which should be the one showing if we support cache or not (depending on LM)

supports_gradient_checkpointing = True
_no_split_modules = ["PixtralVisionAttention"]
_skip_keys_device_placement = "past_key_values"
_supports_cache_class = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty sure it does support cache class

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Let's run these slow test and merge

return BatchMixFeature(
data={"pixel_values": batch_images, "image_sizes": batch_image_sizes},
tensor_type=None,
for image in images:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Comment on lines 596 to 608
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]"

# image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda")
generate_ids = model.generate(**inputs, max_new_tokens=500)
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

# fmt: off
EXPECTED_GENERATION = """
Describe the images.
Sure, let's break down each image description:

1. **Image 1:**
- **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera.
- **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur.

2. **Image 2:**
- **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley.
- **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image.

3. **Image 3:**
- **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset.
- **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene.

4. **Image 4:**
- **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers.
- **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden.

Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this test is important: it makes sure a single prompt can describe 4 images and this test was 1-1 passing so let's keep it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oke, but the test wasn't even loading the model correctly for me, so I remove it. I changed the ckpt now and used correct device for inputs

One thing is that the model runs on cpu, we can't put it to gpu even in fp16, and 4-bit hurts performance a lot. Will just leave as is then

@ArthurZucker
Copy link
Collaborator

run slow pixtral

Copy link

This comment contains run-slow, running the specified jobs: ['models/pixtral'] ...

@zucchini-nlp zucchini-nlp merged commit 9725e5b into huggingface:main Jan 30, 2025
25 checks passed
bursteratom pushed a commit to bursteratom/transformers that referenced this pull request Feb 5, 2025
* initial POC

* - batch mix feature

* fix tests

* fix tests

* make style

* do not skip and instead fix tests

* update

* return back the test

* correct text with the correct ckpt
elvircrn pushed a commit to elvircrn/transformers that referenced this pull request Feb 13, 2025
* initial POC

* - batch mix feature

* fix tests

* fix tests

* make style

* do not skip and instead fix tests

* update

* return back the test

* correct text with the correct ckpt
sbucaille pushed a commit to sbucaille/transformers that referenced this pull request Feb 16, 2025
* initial POC

* - batch mix feature

* fix tests

* fix tests

* make style

* do not skip and instead fix tests

* update

* return back the test

* correct text with the correct ckpt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants