Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep image_sizes in output of PixtralProcessor #35110

Closed
wants to merge 1 commit into from

Conversation

ywang96
Copy link

@ywang96 ywang96 commented Dec 6, 2024

This PR keeps the image_sizes in the output of PixtralProcessor.__call__.

For the context, on vLLM we have been using this particular kwarg from the output of PixtralImageProcessor to differentiate if a model is Pixtral-hf or LLaVA (since they're both defined as LlavaForConditionalGeneration).

We are currently in a process of migration to directly use PixtralProcessor (and similarly AutoProcessor for a lot of other multimodal models) to reduce the development overhead, but noticed this is dropped during the processor call. I hope it's not a big deal to keep this (since it's just a 1d tensor) in the output, but let me know if there's any issue or better suggestion.

cc @mgoin @DarkLight1337

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Signed-off-by: Roger Wang <ywang@roblox.com>
@ywang96
Copy link
Author

ywang96 commented Dec 6, 2024

@zucchini-nlp let me know if I need to provide more information - thanks in advance!

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ywang96 !

If we are to keep this kwarg in outputs, it should be also acceptable in model's forward call which is not true. We usually don't output kwargs that are not in forward signature

I realize Pixtral is a bit different from LLaVA and that causes headache in processing. We've been discussing with @Rocketknight1 the possibility of returning one batched tensor for the pixel values instead of list of tensors. I believe it would be a cleaner solution + in line with other VLMs

WDYT?

@ywang96
Copy link
Author

ywang96 commented Dec 6, 2024

Hey @ywang96 !

If we are to keep this kwarg in outputs, it should be also acceptable in model's forward call which is not true. We usually don't output kwargs that are not in forward signature

I realize Pixtral is a bit different from LLaVA and that causes headache in processing. We've been discussing with @Rocketknight1 the possibility of returning one batched tensor for the pixel values instead of list of tensors. I believe it would be a cleaner solution + in line with other VLMs

WDYT?

Yea - I'm honestly not a huge fan of using LLaVA definition to support Pixtral.

Regarding one batched tensor for the pixel values instead of list of tensors, what does the shape look like in your mind? If two images have different number of patches, are you going to return a tensor of (total_num_patches, 3, h, w)?

This design has a caveat that if it's the only output of the processor, then you can't tell which patches are for which image. This is totally fine if for inference only, but tricky if users want to cache the embeddings of each individual images (something we want to do on vLLM).

@zucchini-nlp
Copy link
Member

@ywang96 I was mostly thinking about padding the images before the vision backbone and then unpadding them back after, similar to what we do in llava-next and also did in Emu (which is not merged to main library yet). I am not 100% sure if this will work out as expected because the model is not trained that way and adding extra zeros on height/width might change the outputs from patch embedding. That needs to be tested if it's almost same as looping over, as it worked with some other models

cache the embeddings of each individual images
Can you share more on this? You mean caching image last hidden states? The last hidden states should not change is shape because what currently happens is that images are embedded by looping over and then stacked as batch by flattening on the unmatched size dimensions.

@zucchini-nlp
Copy link
Member

Made a PR as POC and verified that padding the inputs matches the output with looping over inputs, #35122

@ywang96
Copy link
Author

ywang96 commented Jan 1, 2025

Hi @zucchini-nlp! I'm closing this PR as we might just deal with it on the vLLM side for now. Thanks!

@ywang96 ywang96 closed this Jan 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants