-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLIP-2 request: If it's even possible, can you please provide an official example script of how to get the text(caption) features and image features into the same vector space (e.g. for cross-modal retrieval/search using BLIP-2 models, similar to what we can already do with CLIP.) Thanks in advance. #25245
Comments
Hi @wingz1, thanks for raising an issue! This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports. There are code examples of how to use BLIP and BLIP-2 in the docs. Both have a similar API to CLIP and have the same methods e.g. |
Thanks, I figured that -- I will check the forums! Indeed those methods do exist in BLIP-2, but those outputs don't share the same dimensionality or mean the same thing as the equivalent commands in CLIP due to the how the model is set up. |
Not really a useful answer, but from the following lines in the modeling file, you can go (and yes, further question on this topic should be on the forum)
|
Hi I think multimodal embeddings is something lacking in the current implementation, where we can't extract embeddings obtained by passing both text and image to the QFormer, infact the Qformer in HF doesn't even take text Whereas the original Qformer implementation did take text inputs as input_id here , along with the image and this can be used to extract multimodal embeddings as done in the |
@ayushtues Indeed, it seems that wasn't included when the model was first added to the library. @NielsRogge - was there a reason for not including this? If there wasn't a specific reason - it seems like a useful addition :) @ayushtues would you be interested in opening a PR to add this? This would mean you get the github contribution for adding the feature. |
A similar request for it is here: #25300 |
I was working on integrating BlipDiffusion into diffusers huggingface/diffusers#4388, which also needs multimodal features. Made a local copy of Blip2Qformer and was modifying in this PR, but having the change integrated into HF will make it much cleaner |
Great - let's add it into transformers then :) ! |
@youssefadr is picking this up as discussed in #25300, happy to help him if needed |
@ayushtues Yes, I'll open a PR this week asap! |
Hi @youssefadr I hope it is fine that I opened a draft PR #25612 to share some progress about multimodal features. I started to try to contribute to huggingface this week :) The weights of the original blip2 itm model are converted into Blip2ForImageTextRetrieval. Feel free to use what I did, if it makes sense. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
System Info
linux, python 3.8+, pytorch '1.13.0+cu116'
Who can help?
@sgugger
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
N/A
Expected behavior
N/A
The text was updated successfully, but these errors were encountered: