Keep `image_sizes` in output of `PixtralProcessor` #35110

ywang96 · 2024-12-06T05:43:50Z

This PR keeps the image_sizes in the output of PixtralProcessor.__call__.

For the context, on vLLM we have been using this particular kwarg from the output of PixtralImageProcessor to differentiate if a model is Pixtral-hf or LLaVA (since they're both defined as LlavaForConditionalGeneration).

We are currently in a process of migration to directly use PixtralProcessor (and similarly AutoProcessor for a lot of other multimodal models) to reduce the development overhead, but noticed this is dropped during the processor call. I hope it's not a big deal to keep this (since it's just a 1d tensor) in the output, but let me know if there's any issue or better suggestion.

cc @mgoin @DarkLight1337

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Roger Wang <ywang@roblox.com>

ywang96 · 2024-12-06T05:45:20Z

@zucchini-nlp let me know if I need to provide more information - thanks in advance!

zucchini-nlp

Hey @ywang96 !

If we are to keep this kwarg in outputs, it should be also acceptable in model's forward call which is not true. We usually don't output kwargs that are not in forward signature

I realize Pixtral is a bit different from LLaVA and that causes headache in processing. We've been discussing with @Rocketknight1 the possibility of returning one batched tensor for the pixel values instead of list of tensors. I believe it would be a cleaner solution + in line with other VLMs

WDYT?

ywang96 · 2024-12-06T10:33:47Z

Hey @ywang96 !

If we are to keep this kwarg in outputs, it should be also acceptable in model's forward call which is not true. We usually don't output kwargs that are not in forward signature

I realize Pixtral is a bit different from LLaVA and that causes headache in processing. We've been discussing with @Rocketknight1 the possibility of returning one batched tensor for the pixel values instead of list of tensors. I believe it would be a cleaner solution + in line with other VLMs

WDYT?

Yea - I'm honestly not a huge fan of using LLaVA definition to support Pixtral.

Regarding one batched tensor for the pixel values instead of list of tensors, what does the shape look like in your mind? If two images have different number of patches, are you going to return a tensor of (total_num_patches, 3, h, w)?

This design has a caveat that if it's the only output of the processor, then you can't tell which patches are for which image. This is totally fine if for inference only, but tricky if users want to cache the embeddings of each individual images (something we want to do on vLLM).

zucchini-nlp · 2024-12-06T11:37:12Z

@ywang96 I was mostly thinking about padding the images before the vision backbone and then unpadding them back after, similar to what we do in llava-next and also did in Emu (which is not merged to main library yet). I am not 100% sure if this will work out as expected because the model is not trained that way and adding extra zeros on height/width might change the outputs from patch embedding. That needs to be tested if it's almost same as looping over, as it worked with some other models

cache the embeddings of each individual images
Can you share more on this? You mean caching image last hidden states? The last hidden states should not change is shape because what currently happens is that images are embedded by looping over and then stacked as batch by flattening on the unmatched size dimensions.

zucchini-nlp · 2024-12-06T12:53:19Z

Made a PR as POC and verified that padding the inputs matches the output with looping over inputs, #35122

ywang96 · 2025-01-01T03:18:01Z

Hi @zucchini-nlp! I'm closing this PR as we might just deal with it on the vLLM side for now. Thanks!

keep image sizes

35603f7

Signed-off-by: Roger Wang <ywang@roblox.com>

zucchini-nlp reviewed Dec 6, 2024

View reviewed changes

zucchini-nlp mentioned this pull request Dec 6, 2024

Pixtral: vectorize patch embeddings and enable tests #35122

Merged

ywang96 closed this Jan 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep `image_sizes` in output of `PixtralProcessor` #35110

Keep `image_sizes` in output of `PixtralProcessor` #35110

ywang96 commented Dec 6, 2024 •

edited

Loading

ywang96 commented Dec 6, 2024

zucchini-nlp left a comment

ywang96 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

ywang96 commented Jan 1, 2025

Keep image_sizes in output of PixtralProcessor #35110

Keep image_sizes in output of PixtralProcessor #35110

Conversation

ywang96 commented Dec 6, 2024 • edited Loading

Before submitting

Who can review?

ywang96 commented Dec 6, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

ywang96 commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

zucchini-nlp commented Dec 6, 2024

ywang96 commented Jan 1, 2025

Keep `image_sizes` in output of `PixtralProcessor` #35110

Keep `image_sizes` in output of `PixtralProcessor` #35110

ywang96 commented Dec 6, 2024 •

edited

Loading