-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pixtral: vectorize patch embeddings and enable tests #35122
Pixtral: vectorize patch embeddings and enable tests #35122
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@zucchini-nlp One concern I have about |
@ywang96 hmm, indeed could induce some extra overhead by patching everything to "max-size", lemme see how much overhead comes from it
Not sure I got this. Currently we return pixels of shape (batch-size, C, H, W). What is the difference between making it |
Made some evals for memory usage with inputs where all images are low resolution (100 x 100) and there is one with the highest resolution possible (1024 x 1024). The memory usage between old and the new processing logic is almost same, excluding memory used by model weights it is Old-processor -> 41470 MiB Higher batch size results in OOM in both cases in A100 80GB from PIL import Image
import torch
from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "mistral-community/pixtral-12b"
processor = AutoProcessor.from_pretrained(model_id, use_fast=True)
processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id
processor.tokenizer.padding_side = "left"
url_dog = "https://picsum.photos/id/237/100/100"
url_mountain = "https://picsum.photos/seed/picsum/100/100"
url_stop = "https://picsum.photos/id/237/2000/2000"
chat = [
{
"role": "user", "content": [
{"type": "text", "content": "Can this animal"},
{"type": "image"},
{"type": "text", "content": " live here?"},
{"type": "image"},
]
}
]
prompt = processor.apply_chat_template(chat)
chat_2 = [
{
"role": "user", "content": [
{"type": "text", "content": "What do you see here"},
{"type": "image"},
]
}
]
prompt_2 = processor.apply_chat_template(chat_2)
begin = torch.cuda.memory_allocated()
prompts = [prompt] * 40 + [prompt_2]
images = [[url_dog, url_mountain]] * 40 + [[url_stop]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
model = LlavaForConditionalGeneration.from_pretrained(model_id, device_map="cuda:0", torch_dtype="float16")
inputs = inputs.to(model.device, torch.float16)
generate_ids = model.generate(**inputs, max_new_tokens=50)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
end = torch.cuda.max_memory_allocated()
print((end - begin) // 1024 ** 2, "MiB") |
Ready for review! Quick update: This PR modifies Pixtral model code to act similarly to other VLMs that divide each image into patches. So, now we pad the inputs to Plus, I enabled the tests that were skipped and fixed what needed to be fixed. All slow tests for Pixtral are passing + ran a sanity check by generating with "pixtral-12b" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If tests pass, I'm happy with it! In general I strongly approve of making Pixtral more compliant with the other VLM code. Also, the existing code doesn't work with batched inputs at all, so more or less anything is an improvement 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! would be nice to add an explicit test with padded patches and etc
src/transformers/models/pixtral/image_processing_pixtral_fast.py
Outdated
Show resolved
Hide resolved
supports_gradient_checkpointing = True | ||
_no_split_modules = ["PixtralVisionAttention"] | ||
_skip_keys_device_placement = "past_key_values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we keep _skip_keys_device_placement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's because Pixtral is a vision-only model, as ViT. And we dont expect ViT to have cache at all. The VLM is anyway called with llava model class, which should be the one showing if we support cache or not (depending on LM)
supports_gradient_checkpointing = True | ||
_no_split_modules = ["PixtralVisionAttention"] | ||
_skip_keys_device_placement = "past_key_values" | ||
_supports_cache_class = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure it does support cache class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Let's run these slow test and merge
return BatchMixFeature( | ||
data={"pixel_values": batch_images, "image_sizes": batch_image_sizes}, | ||
tensor_type=None, | ||
for image in images: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
PROMPT = "<s>[INST]Describe the images.\n[IMG][IMG][IMG][IMG][/INST]" | ||
|
||
# image = Image.open(requests.get(url, stream=True).raw) | ||
inputs = processor(text=PROMPT, images=IMG_URLS, return_tensors="pt").to("cuda") | ||
generate_ids = model.generate(**inputs, max_new_tokens=500) | ||
ouptut = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] | ||
|
||
# fmt: off | ||
EXPECTED_GENERATION = """ | ||
Describe the images. | ||
Sure, let's break down each image description: | ||
|
||
1. **Image 1:** | ||
- **Description:** A black dog with a glossy coat is sitting on a wooden floor. The dog has a focused expression and is looking directly at the camera. | ||
- **Details:** The wooden floor has a rustic appearance with visible wood grain patterns. The dog's eyes are a striking color, possibly brown or amber, which contrasts with its black fur. | ||
|
||
2. **Image 2:** | ||
- **Description:** A scenic view of a mountainous landscape with a winding road cutting through it. The road is surrounded by lush green vegetation and leads to a distant valley. | ||
- **Details:** The mountains are rugged with steep slopes, and the sky is clear, indicating good weather. The winding road adds a sense of depth and perspective to the image. | ||
|
||
3. **Image 3:** | ||
- **Description:** A beach scene with waves crashing against the shore. There are several people in the water and on the beach, enjoying the waves and the sunset. | ||
- **Details:** The waves are powerful, creating a dynamic and lively atmosphere. The sky is painted with hues of orange and pink from the setting sun, adding a warm glow to the scene. | ||
|
||
4. **Image 4:** | ||
- **Description:** A garden path leading to a large tree with a bench underneath it. The path is bordered by well-maintained grass and flowers. | ||
- **Details:** The path is made of small stones or gravel, and the tree provides a shaded area with the bench invitingly placed beneath it. The surrounding area is lush and green, suggesting a well-kept garden. | ||
|
||
Each image captures a different scene, from a close-up of a dog to expansive natural landscapes, showcasing various elements of nature and human interaction with it. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this test is important: it makes sure a single prompt can describe 4 images and this test was 1-1 passing so let's keep it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oke, but the test wasn't even loading the model correctly for me, so I remove it. I changed the ckpt now and used correct device for inputs
One thing is that the model runs on cpu, we can't put it to gpu even in fp16, and 4-bit hurts performance a lot. Will just leave as is then
run slow pixtral |
This comment contains run-slow, running the specified jobs: ['models/pixtral'] ... |
* initial POC * - batch mix feature * fix tests * fix tests * make style * do not skip and instead fix tests * update * return back the test * correct text with the correct ckpt
* initial POC * - batch mix feature * fix tests * fix tests * make style * do not skip and instead fix tests * update * return back the test * correct text with the correct ckpt
* initial POC * - batch mix feature * fix tests * fix tests * make style * do not skip and instead fix tests * update * return back the test * correct text with the correct ckpt
What does this PR do?
Continuation on the discussion from #35110.
This PR gets rid of the loop over each image in the input for Pixtra and makes it more aligned with other VLMs. Now the model will pad the image on h/w dimensions and unpad it back after the vision patch embedding layer. That also helps us to get rid of extra dimension errors on processing code we've been having lately and remove the
BacthFeatureMix
Test with the demo script from https://huggingface.co/mistral-community/pixtral-12b, as we don't have any test for pixtral as VLM at the moment. The generations match on text level
cc @Rocketknight1 wdyt about this? The design might need some changes as I had to make llava accept extra kwargs (
image_sizes
) to make the model work