-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep image_sizes
in output of PixtralProcessor
#35110
Conversation
Signed-off-by: Roger Wang <ywang@roblox.com>
@zucchini-nlp let me know if I need to provide more information - thanks in advance! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ywang96 !
If we are to keep this kwarg in outputs, it should be also acceptable in model's forward call which is not true. We usually don't output kwargs that are not in forward signature
I realize Pixtral is a bit different from LLaVA and that causes headache in processing. We've been discussing with @Rocketknight1 the possibility of returning one batched tensor for the pixel values instead of list of tensors. I believe it would be a cleaner solution + in line with other VLMs
WDYT?
Yea - I'm honestly not a huge fan of using LLaVA definition to support Pixtral. Regarding This design has a caveat that if it's the only output of the processor, then you can't tell which patches are for which image. This is totally fine if for inference only, but tricky if users want to cache the embeddings of each individual images (something we want to do on vLLM). |
@ywang96 I was mostly thinking about padding the images before the vision backbone and then unpadding them back after, similar to what we do in llava-next and also did in Emu (which is not merged to main library yet). I am not 100% sure if this will work out as expected because the model is not trained that way and adding extra zeros on height/width might change the outputs from patch embedding. That needs to be tested if it's almost same as looping over, as it worked with some other models
|
Made a PR as POC and verified that padding the inputs matches the output with looping over inputs, #35122 |
Hi @zucchini-nlp! I'm closing this PR as we might just deal with it on the vLLM side for now. Thanks! |
This PR keeps the
image_sizes
in the output ofPixtralProcessor.__call__
.For the context, on vLLM we have been using this particular kwarg from the output of
PixtralImageProcessor
to differentiate if a model is Pixtral-hf or LLaVA (since they're both defined asLlavaForConditionalGeneration
).We are currently in a process of migration to directly use
PixtralProcessor
(and similarlyAutoProcessor
for a lot of other multimodal models) to reduce the development overhead, but noticed this is dropped during the processor call. I hope it's not a big deal to keep this (since it's just a 1d tensor) in the output, but let me know if there's any issue or better suggestion.cc @mgoin @DarkLight1337
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.