Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLIP-2 request: If it's even possible, can you please provide an official example script of how to get the text(caption) features and image features into the same vector space (e.g. for cross-modal retrieval/search using BLIP-2 models, similar to what we can already do with CLIP.) Thanks in advance. #25245

Closed
1 of 4 tasks
wingz1 opened this issue Aug 1, 2023 · 12 comments

Comments

@wingz1
Copy link

wingz1 commented Aug 1, 2023

System Info

linux, python 3.8+, pytorch '1.13.0+cu116'

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

N/A

Expected behavior

N/A

@amyeroberts
Copy link
Collaborator

Hi @wingz1, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

There are code examples of how to use BLIP and BLIP-2 in the docs. Both have a similar API to CLIP and have the same methods e.g. get_text_features, get_image_features implemented and return similar outputs.

@wingz1
Copy link
Author

wingz1 commented Aug 1, 2023

Thanks, I figured that -- I will check the forums! Indeed those methods do exist in BLIP-2, but those outputs don't share the same dimensionality or mean the same thing as the equivalent commands in CLIP due to the how the model is set up.

@ydshieh
Copy link
Collaborator

ydshieh commented Aug 2, 2023

Not really a useful answer, but from the following lines in the modeling file, you can go language_projection to get the same dimension. But it's super questionable regarding if this is the same space with the meaningful text/image features.

(and yes, further question on this topic should be on the forum)

self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)

ilanguage_model_inputs = self.language_projection(query_output)

inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1)

@ayushtues
Copy link
Contributor

Hi I think multimodal embeddings is something lacking in the current implementation, where we can't extract embeddings obtained by passing both text and image to the QFormer, infact the Qformer in HF doesn't even take text input_ids as input here

Whereas the original Qformer implementation did take text inputs as input_id here , along with the image and this can be used to extract multimodal embeddings as done in the extract_features fn here

@amyeroberts
Copy link
Collaborator

@ayushtues Indeed, it seems that wasn't included when the model was first added to the library. @NielsRogge - was there a reason for not including this?

If there wasn't a specific reason - it seems like a useful addition :) @ayushtues would you be interested in opening a PR to add this? This would mean you get the github contribution for adding the feature.

@NielsRogge
Copy link
Contributor

A similar request for it is here: #25300

@ayushtues
Copy link
Contributor

I was working on integrating BlipDiffusion into diffusers huggingface/diffusers#4388, which also needs multimodal features. Made a local copy of Blip2Qformer and was modifying in this PR, but having the change integrated into HF will make it much cleaner

@amyeroberts
Copy link
Collaborator

Great - let's add it into transformers then :) !

@ayushtues
Copy link
Contributor

ayushtues commented Aug 8, 2023

@youssefadr is picking this up as discussed in #25300, happy to help him if needed

@youssefadr
Copy link
Contributor

@ayushtues Yes, I'll open a PR this week asap!

@jpizarrom
Copy link
Contributor

jpizarrom commented Aug 19, 2023

Hi @youssefadr

I hope it is fine that I opened a draft PR #25612 to share some progress about multimodal features. I started to try to contribute to huggingface this week :)

The weights of the original blip2 itm model are converted into Blip2ForImageTextRetrieval.
The idea of adding Blip2ForImageTextRetrieval has not been discussed at all. wdyt?

Feel free to use what I did, if it makes sense.
Please let me know if it makes sense for me to continue trying to implement Blip2ForImageTextRetrieval, maybe you are already working in this part, or maybe it is not really necessary to try to implement Blip2ForImageTextRetrieval.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants