BLIP-2 request: If it's even possible, can you please provide an official example script of how to get the text(caption) features and image features into the same vector space (e.g. for cross-modal retrieval/search using BLIP-2 models, similar to what we can already do with CLIP.) Thanks in advance. #25245

wingz1 · 2023-08-01T18:21:07Z

System Info

linux, python 3.8+, pytorch '1.13.0+cu116'

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

N/A

Expected behavior

N/A

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-08-01T18:34:55Z

Hi @wingz1, thanks for raising an issue!

This is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

There are code examples of how to use BLIP and BLIP-2 in the docs. Both have a similar API to CLIP and have the same methods e.g. get_text_features, get_image_features implemented and return similar outputs.

wingz1 · 2023-08-01T18:45:53Z

Thanks, I figured that -- I will check the forums! Indeed those methods do exist in BLIP-2, but those outputs don't share the same dimensionality or mean the same thing as the equivalent commands in CLIP due to the how the model is set up.

ydshieh · 2023-08-02T07:03:59Z

Not really a useful answer, but from the following lines in the modeling file, you can go language_projection to get the same dimension. But it's super questionable regarding if this is the same space with the meaningful text/image features.

(and yes, further question on this topic should be on the forum)

self.language_projection = nn.Linear(config.qformer_config.hidden_size, config.text_config.hidden_size)

ilanguage_model_inputs = self.language_projection(query_output)

inputs_embeds = self.language_model.get_input_embeddings()(input_ids)
inputs_embeds = torch.cat([language_model_inputs, inputs_embeds], dim=1)

ayushtues · 2023-08-03T17:18:45Z

Hi I think multimodal embeddings is something lacking in the current implementation, where we can't extract embeddings obtained by passing both text and image to the QFormer, infact the Qformer in HF doesn't even take text input_ids as input here

Whereas the original Qformer implementation did take text inputs as input_id here , along with the image and this can be used to extract multimodal embeddings as done in the extract_features fn here

amyeroberts · 2023-08-07T15:00:23Z

@ayushtues Indeed, it seems that wasn't included when the model was first added to the library. @NielsRogge - was there a reason for not including this?

If there wasn't a specific reason - it seems like a useful addition :) @ayushtues would you be interested in opening a PR to add this? This would mean you get the github contribution for adding the feature.

NielsRogge · 2023-08-07T15:13:56Z

A similar request for it is here: #25300

ayushtues · 2023-08-08T08:51:13Z

I was working on integrating BlipDiffusion into diffusers huggingface/diffusers#4388, which also needs multimodal features. Made a local copy of Blip2Qformer and was modifying in this PR, but having the change integrated into HF will make it much cleaner

amyeroberts · 2023-08-08T08:59:12Z

Great - let's add it into transformers then :) !

ayushtues · 2023-08-08T09:33:21Z

@youssefadr is picking this up as discussed in #25300, happy to help him if needed

youssefadr · 2023-08-08T19:54:44Z

@ayushtues Yes, I'll open a PR this week asap!

jpizarrom · 2023-08-19T15:15:44Z

Hi @youssefadr

I hope it is fine that I opened a draft PR #25612 to share some progress about multimodal features. I started to try to contribute to huggingface this week :)

The weights of the original blip2 itm model are converted into Blip2ForImageTextRetrieval.
The idea of adding Blip2ForImageTextRetrieval has not been discussed at all. wdyt?

Feel free to use what I did, if it makes sense.
Please let me know if it makes sense for me to continue trying to implement Blip2ForImageTextRetrieval, maybe you are already working in this part, or maybe it is not really necessary to try to implement Blip2ForImageTextRetrieval.

github-actions · 2023-09-13T08:03:21Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ayushtues mentioned this issue Aug 8, 2023

Add BLIP Diffusion huggingface/diffusers#4388

Merged

7 tasks

youssefadr mentioned this issue Aug 12, 2023

Add support for BLIP-2 multimodal feature extraction #25474

Draft

5 tasks

jpizarrom mentioned this issue Aug 19, 2023

Add Blip2ForImageTextRetrieval for multimodal feature extraction #25612

Closed

12 tasks

github-actions bot closed this as completed Sep 21, 2023

jpizarrom mentioned this issue Feb 23, 2024

🚨 Add Blip2ForImageTextRetrieval #29261

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wingz1 commented Aug 1, 2023

amyeroberts commented Aug 1, 2023

wingz1 commented Aug 1, 2023

ydshieh commented Aug 2, 2023 •

edited

Loading

ayushtues commented Aug 3, 2023

amyeroberts commented Aug 7, 2023

NielsRogge commented Aug 7, 2023

ayushtues commented Aug 8, 2023

amyeroberts commented Aug 8, 2023

ayushtues commented Aug 8, 2023 •

edited

Loading

youssefadr commented Aug 8, 2023

jpizarrom commented Aug 19, 2023 •

edited

Loading

github-actions bot commented Sep 13, 2023

Comments

wingz1 commented Aug 1, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Aug 1, 2023

wingz1 commented Aug 1, 2023

ydshieh commented Aug 2, 2023 • edited Loading

ayushtues commented Aug 3, 2023

amyeroberts commented Aug 7, 2023

NielsRogge commented Aug 7, 2023

ayushtues commented Aug 8, 2023

amyeroberts commented Aug 8, 2023

ayushtues commented Aug 8, 2023 • edited Loading

youssefadr commented Aug 8, 2023

jpizarrom commented Aug 19, 2023 • edited Loading

github-actions bot commented Sep 13, 2023

ydshieh commented Aug 2, 2023 •

edited

Loading

ayushtues commented Aug 8, 2023 •

edited

Loading

jpizarrom commented Aug 19, 2023 •

edited

Loading