Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚨 Add Blip2ForImageTextRetrieval #29261

Merged
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e106340
add Blip2ForImageTextRetrieval
jpizarrom Mar 22, 2024
164d824
use one line and remove unnecessary space in tests
jpizarrom Apr 6, 2024
715f6fa
use value from the config, rather than hardcoded
jpizarrom Apr 6, 2024
da0cc83
change order of params in Blip2QFormerModel.forward
jpizarrom May 1, 2024
02a0e08
update docstring
jpizarrom May 1, 2024
360f537
fix style
jpizarrom May 1, 2024
5e7764f
update test_inference_opt
jpizarrom May 1, 2024
7e7135a
move embeddings out of Blip2QFormerModel
jpizarrom May 1, 2024
258f349
remove from_vision_qformer_configs
jpizarrom May 1, 2024
cf42e9b
remove autocast float16 in Blip2QFormerModel
jpizarrom May 1, 2024
8216085
rename fiels into vision_projection,text_projection,use_image_text_ma…
jpizarrom May 1, 2024
e2781cb
use CLIPOutput for Blip2ImageTextMatchingModelOutput
jpizarrom May 12, 2024
1c9746d
remove past_key_values_length from Blip2TextEmbeddings
jpizarrom May 18, 2024
e6da638
fix small typo in the CLIPOutput docstring
jpizarrom May 25, 2024
c65ea33
add Blip2ForImageTextRetrieval to Zero Shot Image Classification mapping
jpizarrom May 25, 2024
a39f9fd
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom May 25, 2024
a2b99a9
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom May 27, 2024
6468947
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom Jun 1, 2024
9bda979
update docstring and add require_torch_fp16
jpizarrom Jun 8, 2024
efa8041
rollback test_inference_opt
jpizarrom Jun 8, 2024
cb7beae
use use_image_text_matching_head=True in convert
jpizarrom Jun 8, 2024
05cb8d4
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom Jun 8, 2024
8eca5e5
skip test_model_get_set_embeddings
jpizarrom Jun 8, 2024
e31b7e5
fix create_rename_keys error on new itm fields
jpizarrom Jun 16, 2024
8808b8a
revert to do scale after dot product between "query" and "key"
jpizarrom Jun 16, 2024
0a72567
fix ValueError on convert script for blip2-opt-2.7b
jpizarrom Jun 20, 2024
2534ca3
update org of paths to Salesforce
jpizarrom Aug 9, 2024
af53840
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom Aug 9, 2024
f841c59
add is_pipeline_test_to_skip for VisualQuestionAnsweringPipelineTests
jpizarrom Aug 9, 2024
1720ba4
[run_slow] blip_2
jpizarrom Aug 9, 2024
11f0579
removed Blip2ForImageTextRetrieval from IGNORE_NON_AUTO_CONFIGURED
jpizarrom Aug 10, 2024
5e63cdb
fix docstring of Blip2ImageTextMatchingModelOutput
jpizarrom Aug 11, 2024
95232f8
[run_slow] blip_2
jpizarrom Aug 11, 2024
04bb860
fix multi-gpu tests
jpizarrom Aug 13, 2024
5c66e50
[run_slow] blip_2
jpizarrom Aug 13, 2024
2419c85
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom Aug 13, 2024
53abc06
Merge remote-tracking branch 'upstream/main' into add_blip2_image_tex…
jpizarrom Aug 13, 2024
d32e889
[run_slow] blip_2
jpizarrom Aug 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion docs/source/en/model_doc/blip-2.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,4 +87,17 @@ If you're interested in submitting a resource to be included here, please feel f

[[autodoc]] Blip2ForConditionalGeneration
- forward
- generate
- generate

## Blip2ForImageTextRetrieval

[[autodoc]] Blip2ForImageTextRetrieval
- forward

## Blip2TextModelWithProjection

[[autodoc]] Blip2TextModelWithProjection

## Blip2VisionModelWithProjection

[[autodoc]] Blip2VisionModelWithProjection
6 changes: 6 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -1516,10 +1516,13 @@
_import_structure["models.blip_2"].extend(
[
"Blip2ForConditionalGeneration",
"Blip2ForImageTextRetrieval",
"Blip2Model",
"Blip2PreTrainedModel",
"Blip2QFormerModel",
"Blip2TextModelWithProjection",
"Blip2VisionModel",
"Blip2VisionModelWithProjection",
]
)
_import_structure["models.bloom"].extend(
Expand Down Expand Up @@ -6094,10 +6097,13 @@
)
from .models.blip_2 import (
Blip2ForConditionalGeneration,
Blip2ForImageTextRetrieval,
Blip2Model,
Blip2PreTrainedModel,
Blip2QFormerModel,
Blip2TextModelWithProjection,
Blip2VisionModel,
Blip2VisionModelWithProjection,
)
from .models.bloom import (
BloomForCausalLM,
Expand Down
12 changes: 6 additions & 6 deletions src/transformers/models/altclip/modeling_altclip.py
Original file line number Diff line number Diff line change
Expand Up @@ -161,19 +161,19 @@ class AltCLIPOutput(ModelOutput):
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
Contrastive loss for image-text similarity.
logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
logits_per_image (`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
similarity scores.
logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
logits_per_text (`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
similarity scores.
text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
The text embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPTextModel`].
image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim`):
The image embeddings obtained by applying the projection layer to the pooled output of [`AltCLIPVisionModel`].
text_model_output(`BaseModelOutputWithPooling`):
text_model_output (`BaseModelOutputWithPooling`):
The output of the [`AltCLIPTextModel`].
vision_model_output(`BaseModelOutputWithPooling`):
vision_model_output (`BaseModelOutputWithPooling`):
The output of the [`AltCLIPVisionModel`].
"""

Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -1231,6 +1231,7 @@
("align", "AlignModel"),
("altclip", "AltCLIPModel"),
("blip", "BlipModel"),
("blip-2", "Blip2ForImageTextRetrieval"),
("chinese_clip", "ChineseCLIPModel"),
("clip", "CLIPModel"),
("clipseg", "CLIPSegModel"),
Expand Down
6 changes: 6 additions & 0 deletions src/transformers/models/blip_2/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,13 @@
else:
_import_structure["modeling_blip_2"] = [
"Blip2Model",
"Blip2VisionModelWithProjection",
"Blip2QFormerModel",
"Blip2PreTrainedModel",
"Blip2ForConditionalGeneration",
"Blip2ForImageTextRetrieval",
"Blip2VisionModel",
"Blip2TextModelWithProjection",
]

if TYPE_CHECKING:
Expand All @@ -55,10 +58,13 @@
else:
from .modeling_blip_2 import (
Blip2ForConditionalGeneration,
Blip2ForImageTextRetrieval,
Blip2Model,
Blip2PreTrainedModel,
Blip2QFormerModel,
Blip2TextModelWithProjection,
Blip2VisionModel,
Blip2VisionModelWithProjection,
)

else:
Expand Down
32 changes: 27 additions & 5 deletions src/transformers/models/blip_2/configuration_blip_2.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"""BLIP-2 model configuration"""

import os
from typing import Union
from typing import Optional, Union

from ...configuration_utils import PretrainedConfig
from ...models.auto.modeling_auto import MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
Expand Down Expand Up @@ -172,6 +172,8 @@ class Blip2QFormerConfig(PretrainedConfig):
The frequency of adding cross-attention to the Transformer layers.
encoder_hidden_size (`int`, *optional*, defaults to 1408):
The hidden size of the hidden states for cross-attention.
use_qformer_text_input (`bool`, *optional*, defaults to `False`):
Whether to use BERT-style embeddings.

Examples:

Expand Down Expand Up @@ -206,6 +208,7 @@ def __init__(
position_embedding_type="absolute",
cross_attention_frequency=2,
encoder_hidden_size=1408,
use_qformer_text_input=False,
**kwargs,
):
super().__init__(pad_token_id=pad_token_id, **kwargs)
Expand All @@ -224,6 +227,7 @@ def __init__(
self.position_embedding_type = position_embedding_type
self.cross_attention_frequency = cross_attention_frequency
self.encoder_hidden_size = encoder_hidden_size
self.use_qformer_text_input = use_qformer_text_input

@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
Expand Down Expand Up @@ -263,7 +267,8 @@ class Blip2Config(PretrainedConfig):
Dictionary of configuration options used to initialize any [`PretrainedConfig`].
num_query_tokens (`int`, *optional*, defaults to 32):
The number of query tokens passed through the Transformer.

image_text_hidden_size (`int`, *optional*, defaults to 256):
Dimentionality of the hidden state of the image-text fusion layer.
kwargs (*optional*):
Dictionary of keyword arguments.

Expand Down Expand Up @@ -299,7 +304,15 @@ class Blip2Config(PretrainedConfig):

model_type = "blip-2"

def __init__(self, vision_config=None, qformer_config=None, text_config=None, num_query_tokens=32, **kwargs):
def __init__(
self,
vision_config=None,
qformer_config=None,
text_config=None,
num_query_tokens=32,
image_text_hidden_size=256,
**kwargs,
):
super().__init__(**kwargs)

if vision_config is None:
Expand All @@ -323,6 +336,7 @@ def __init__(self, vision_config=None, qformer_config=None, text_config=None, nu
self.is_encoder_decoder = self.text_config.is_encoder_decoder

self.num_query_tokens = num_query_tokens
self.image_text_hidden_size = image_text_hidden_size
self.qformer_config.encoder_hidden_size = self.vision_config.hidden_size
self.use_decoder_only_language_model = self.text_config.model_type in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES
self.initializer_factor = 1.0
Expand All @@ -333,20 +347,28 @@ def from_vision_qformer_text_configs(
cls,
vision_config: Blip2VisionConfig,
qformer_config: Blip2QFormerConfig,
text_config: PretrainedConfig,
text_config: Optional[PretrainedConfig] = None,
**kwargs,
):
r"""
Instantiate a [`Blip2Config`] (or a derived class) from a BLIP-2 vision model, Q-Former and language model
configurations.

Args:
vision_config (`dict`):
Dictionary of configuration options used to initialize [`Blip2VisionConfig`].
qformer_config (`dict`):
Dictionary of configuration options used to initialize [`Blip2QFormerConfig`].
text_config (`dict`, *optional*):
Dictionary of configuration options used to initialize any [`PretrainedConfig`].

Returns:
[`Blip2Config`]: An instance of a configuration object
"""

return cls(
vision_config=vision_config.to_dict(),
qformer_config=qformer_config.to_dict(),
text_config=text_config.to_dict(),
text_config=text_config.to_dict() if text_config is not None else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this optional is a bit funny given the name of the method. We should at least update the docstring to indicate that language model config is optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docstring were updated

**kwargs,
)
Loading