diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 7eff2a383026..8c48749bc0c9 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -852,6 +852,8 @@
title: MGP-STR
- local: model_doc/nougat
title: Nougat
+ - local: model_doc/omdet-turbo
+ title: OmDet-Turbo
- local: model_doc/oneformer
title: OneFormer
- local: model_doc/owlvit
diff --git a/docs/source/en/index.md b/docs/source/en/index.md
index c18426de4c03..478184fdd344 100644
--- a/docs/source/en/index.md
+++ b/docs/source/en/index.md
@@ -235,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
| [Nyströmformer](model_doc/nystromformer) | ✅ | ❌ | ❌ |
| [OLMo](model_doc/olmo) | ✅ | ❌ | ❌ |
| [OLMoE](model_doc/olmoe) | ✅ | ❌ | ❌ |
+| [OmDet-Turbo](model_doc/omdet-turbo) | ✅ | ❌ | ❌ |
| [OneFormer](model_doc/oneformer) | ✅ | ❌ | ❌ |
| [OpenAI GPT](model_doc/openai-gpt) | ✅ | ✅ | ❌ |
| [OpenAI GPT-2](model_doc/gpt2) | ✅ | ✅ | ✅ |
diff --git a/docs/source/en/model_doc/omdet-turbo.md b/docs/source/en/model_doc/omdet-turbo.md
new file mode 100644
index 000000000000..190ac3e31eea
--- /dev/null
+++ b/docs/source/en/model_doc/omdet-turbo.md
@@ -0,0 +1,164 @@
+
+
+# OmDet-Turbo
+
+## Overview
+
+The OmDet-Turbo model was proposed in [Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head](https://arxiv.org/abs/2403.06892) by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.
+
+The abstract from the paper is the following:
+
+*End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks.*
+
+
+
+ OmDet-Turbo architecture overview. Taken from the original paper.
+
+This model was contributed by [yonigozlan](https://huggingface.co/yonigozlan).
+The original code can be found [here](https://github.com/om-ai-lab/OmDet).
+
+## Usage tips
+
+One unique property of OmDet-Turbo compared to other zero-shot object detection models, such as [Grounding DINO](grounding-dino), is the decoupled classes and prompt embedding structure that allows caching of text embeddings. This means that the model needs both classes and task as inputs, where classes is a list of objects we want to detect and task is the grounded text used to guide open-vocabulary detection. This approach limits the scope of the open-vocabulary detection and makes the decoding process faster.
+
+[`OmDetTurboProcessor`] is used to prepare the classes, task and image triplet. The task input is optional, and when not provided, it will default to `"Detect [class1], [class2], [class3], ..."`. To process the results from the model, one can use `post_process_grounded_object_detection` from [`OmDetTurboProcessor`]. Notably, this function takes in the input classes, as unlike other zero-shot object detection models, the decoupling of classes and task embeddings means that no decoding of the predicted class embeddings is needed in the post-processing step, and the predicted classes can be matched to the inputted ones directly.
+
+## Usage example
+
+### Single image inference
+
+Here's how to load the model and prepare the inputs to perform zero-shot object detection on a single image:
+
+```python
+import requests
+from PIL import Image
+
+from transformers import AutoProcessor, OmDetTurboForObjectDetection
+
+processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-tiny")
+model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-tiny")
+
+url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+image = Image.open(requests.get(url, stream=True).raw)
+classes = ["cat", "remote"]
+inputs = processor(image, text=classes, return_tensors="pt")
+
+outputs = model(**inputs)
+
+# convert outputs (bounding boxes and class logits)
+results = processor.post_process_grounded_object_detection(
+ outputs,
+ classes=classes,
+ target_sizes=[image.size[::-1]],
+ score_threshold=0.3,
+ nms_threshold=0.3,
+)[0]
+for score, class_name, box in zip(
+ results["scores"], results["classes"], results["boxes"]
+):
+ box = [round(i, 1) for i in box.tolist()]
+ print(
+ f"Detected {class_name} with confidence "
+ f"{round(score.item(), 2)} at location {box}"
+ )
+```
+
+### Multi image inference
+
+OmDet-Turbo can perform batched multi-image inference, with support for different text prompts and classes in the same batch:
+
+```python
+>>> import torch
+>>> import requests
+>>> from io import BytesIO
+>>> from PIL import Image
+>>> from transformers import AutoProcessor, OmDetTurboForObjectDetection
+
+>>> processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
+>>> model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
+
+>>> url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
+>>> image1 = Image.open(BytesIO(requests.get(url1).content)).convert("RGB")
+>>> classes1 = ["cat", "remote"]
+>>> task1 = "Detect {}.".format(", ".join(classes1))
+
+>>> url2 = "http://images.cocodataset.org/train2017/000000257813.jpg"
+>>> image2 = Image.open(BytesIO(requests.get(url2).content)).convert("RGB")
+>>> classes2 = ["boat"]
+>>> task2 = "Detect everything that looks like a boat."
+
+>>> url3 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
+>>> image3 = Image.open(BytesIO(requests.get(url3).content)).convert("RGB")
+>>> classes3 = ["statue", "trees"]
+>>> task3 = "Focus on the foreground, detect statue and trees."
+
+>>> inputs = processor(
+... images=[image1, image2, image3],
+... text=[classes1, classes2, classes3],
+... task=[task1, task2, task3],
+... return_tensors="pt",
+... )
+
+>>> with torch.no_grad():
+... outputs = model(**inputs)
+
+>>> # convert outputs (bounding boxes and class logits)
+>>> results = processor.post_process_grounded_object_detection(
+... outputs,
+... classes=[classes1, classes2, classes3],
+... target_sizes=[image1.size[::-1], image2.size[::-1], image3.size[::-1]],
+... score_threshold=0.2,
+... nms_threshold=0.3,
+... )
+
+>>> for i, result in enumerate(results):
+... for score, class_name, box in zip(
+... result["scores"], result["classes"], result["boxes"]
+... ):
+... box = [round(i, 1) for i in box.tolist()]
+... print(
+... f"Detected {class_name} with confidence "
+... f"{round(score.item(), 2)} at location {box} in image {i}"
+... )
+Detected remote with confidence 0.77 at location [39.9, 70.4, 176.7, 118.0] in image 0
+Detected cat with confidence 0.72 at location [11.6, 54.2, 314.8, 474.0] in image 0
+Detected remote with confidence 0.56 at location [333.4, 75.8, 370.7, 187.0] in image 0
+Detected cat with confidence 0.55 at location [345.2, 24.0, 639.8, 371.7] in image 0
+Detected boat with confidence 0.32 at location [146.9, 219.8, 209.6, 250.7] in image 1
+Detected boat with confidence 0.3 at location [319.1, 223.2, 403.2, 238.4] in image 1
+Detected boat with confidence 0.27 at location [37.7, 220.3, 84.0, 235.9] in image 1
+Detected boat with confidence 0.22 at location [407.9, 207.0, 441.7, 220.2] in image 1
+Detected statue with confidence 0.73 at location [544.7, 210.2, 651.9, 502.8] in image 2
+Detected trees with confidence 0.25 at location [3.9, 584.3, 391.4, 785.6] in image 2
+Detected trees with confidence 0.25 at location [1.4, 621.2, 118.2, 787.8] in image 2
+Detected statue with confidence 0.2 at location [428.1, 205.5, 767.3, 759.5] in image 2
+
+```
+
+## OmDetTurboConfig
+
+[[autodoc]] OmDetTurboConfig
+
+## OmDetTurboProcessor
+
+[[autodoc]] OmDetTurboProcessor
+ - post_process_grounded_object_detection
+
+## OmDetTurboForObjectDetection
+
+[[autodoc]] OmDetTurboForObjectDetection
+ - forward
diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py
index 36775d8454ab..c3260ad0ae60 100755
--- a/src/transformers/__init__.py
+++ b/src/transformers/__init__.py
@@ -606,6 +606,10 @@
"models.nystromformer": ["NystromformerConfig"],
"models.olmo": ["OlmoConfig"],
"models.olmoe": ["OlmoeConfig"],
+ "models.omdet_turbo": [
+ "OmDetTurboConfig",
+ "OmDetTurboProcessor",
+ ],
"models.oneformer": [
"OneFormerConfig",
"OneFormerProcessor",
@@ -2844,6 +2848,12 @@
"OlmoePreTrainedModel",
]
)
+ _import_structure["models.omdet_turbo"].extend(
+ [
+ "OmDetTurboForObjectDetection",
+ "OmDetTurboPreTrainedModel",
+ ]
+ )
_import_structure["models.oneformer"].extend(
[
"OneFormerForUniversalSegmentation",
@@ -5385,6 +5395,10 @@
)
from .models.olmo import OlmoConfig
from .models.olmoe import OlmoeConfig
+ from .models.omdet_turbo import (
+ OmDetTurboConfig,
+ OmDetTurboProcessor,
+ )
from .models.oneformer import (
OneFormerConfig,
OneFormerProcessor,
@@ -7351,6 +7365,10 @@
OlmoeModel,
OlmoePreTrainedModel,
)
+ from .models.omdet_turbo import (
+ OmDetTurboForObjectDetection,
+ OmDetTurboPreTrainedModel,
+ )
from .models.oneformer import (
OneFormerForUniversalSegmentation,
OneFormerModel,
diff --git a/src/transformers/models/__init__.py b/src/transformers/models/__init__.py
index 2022048cd455..0819277194b3 100644
--- a/src/transformers/models/__init__.py
+++ b/src/transformers/models/__init__.py
@@ -171,6 +171,7 @@
nystromformer,
olmo,
olmoe,
+ omdet_turbo,
oneformer,
openai,
opt,
diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py
index 2cd7d550d90b..54df20a07b1f 100644
--- a/src/transformers/models/auto/configuration_auto.py
+++ b/src/transformers/models/auto/configuration_auto.py
@@ -60,6 +60,7 @@
("chinese_clip_vision_model", "ChineseCLIPVisionConfig"),
("clap", "ClapConfig"),
("clip", "CLIPConfig"),
+ ("clip_text_model", "CLIPTextConfig"),
("clip_vision_model", "CLIPVisionConfig"),
("clipseg", "CLIPSegConfig"),
("clvp", "ClvpConfig"),
@@ -189,6 +190,7 @@
("nystromformer", "NystromformerConfig"),
("olmo", "OlmoConfig"),
("olmoe", "OlmoeConfig"),
+ ("omdet-turbo", "OmDetTurboConfig"),
("oneformer", "OneFormerConfig"),
("open-llama", "OpenLlamaConfig"),
("openai-gpt", "OpenAIGPTConfig"),
@@ -346,6 +348,7 @@
("chinese_clip_vision_model", "ChineseCLIPVisionModel"),
("clap", "CLAP"),
("clip", "CLIP"),
+ ("clip_text_model", "CLIPTextModel"),
("clip_vision_model", "CLIPVisionModel"),
("clipseg", "CLIPSeg"),
("clvp", "CLVP"),
@@ -493,6 +496,7 @@
("nystromformer", "Nyströmformer"),
("olmo", "OLMo"),
("olmoe", "OLMoE"),
+ ("omdet-turbo", "OmDet-Turbo"),
("oneformer", "OneFormer"),
("open-llama", "OpenLlama"),
("openai-gpt", "OpenAI GPT"),
@@ -661,6 +665,7 @@
("xclip", "x_clip"),
("clip_vision_model", "clip"),
("qwen2_audio_encoder", "qwen2_audio"),
+ ("clip_text_model", "clip"),
("siglip_vision_model", "siglip"),
("chinese_clip_vision_model", "chinese_clip"),
("rt_detr_resnet", "rt_detr"),
diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py
index e0d15f1e2365..8b9e3fe5df95 100644
--- a/src/transformers/models/auto/modeling_auto.py
+++ b/src/transformers/models/auto/modeling_auto.py
@@ -60,6 +60,7 @@
("chinese_clip_vision_model", "ChineseCLIPVisionModel"),
("clap", "ClapModel"),
("clip", "CLIPModel"),
+ ("clip_text_model", "CLIPTextModel"),
("clip_vision_model", "CLIPVisionModel"),
("clipseg", "CLIPSegModel"),
("clvp", "ClvpModelForConditionalGeneration"),
@@ -179,6 +180,7 @@
("nystromformer", "NystromformerModel"),
("olmo", "OlmoModel"),
("olmoe", "OlmoeModel"),
+ ("omdet-turbo", "OmDetTurboForObjectDetection"),
("oneformer", "OneFormerModel"),
("open-llama", "OpenLlamaModel"),
("openai-gpt", "OpenAIGPTModel"),
@@ -809,6 +811,7 @@
[
# Model for Zero Shot Object Detection mapping
("grounding-dino", "GroundingDinoForObjectDetection"),
+ ("omdet-turbo", "OmDetTurboForObjectDetection"),
("owlv2", "Owlv2ForObjectDetection"),
("owlvit", "OwlViTForObjectDetection"),
]
@@ -1323,6 +1326,7 @@
("albert", "AlbertModel"),
("bert", "BertModel"),
("big_bird", "BigBirdModel"),
+ ("clip_text_model", "CLIPTextModel"),
("data2vec-text", "Data2VecTextModel"),
("deberta", "DebertaModel"),
("deberta-v2", "DebertaV2Model"),
diff --git a/src/transformers/models/auto/tokenization_auto.py b/src/transformers/models/auto/tokenization_auto.py
index e735579108d8..8a7b8c2330d3 100644
--- a/src/transformers/models/auto/tokenization_auto.py
+++ b/src/transformers/models/auto/tokenization_auto.py
@@ -344,6 +344,10 @@
),
("olmo", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
("olmoe", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
+ (
+ "omdet-turbo",
+ ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None),
+ ),
("oneformer", ("CLIPTokenizer", "CLIPTokenizerFast" if is_tokenizers_available() else None)),
(
"openai-gpt",
diff --git a/src/transformers/models/omdet_turbo/__init__.py b/src/transformers/models/omdet_turbo/__init__.py
new file mode 100644
index 000000000000..34eb6386298f
--- /dev/null
+++ b/src/transformers/models/omdet_turbo/__init__.py
@@ -0,0 +1,56 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import TYPE_CHECKING
+
+from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available
+
+
+_import_structure = {
+ "configuration_omdet_turbo": ["OmDetTurboConfig"],
+ "processing_omdet_turbo": ["OmDetTurboProcessor"],
+}
+
+try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+except OptionalDependencyNotAvailable:
+ pass
+else:
+ _import_structure["modeling_omdet_turbo"] = [
+ "OmDetTurboForObjectDetection",
+ "OmDetTurboPreTrainedModel",
+ ]
+
+if TYPE_CHECKING:
+ from .configuration_omdet_turbo import (
+ OmDetTurboConfig,
+ )
+ from .processing_omdet_turbo import OmDetTurboProcessor
+
+ try:
+ if not is_torch_available():
+ raise OptionalDependencyNotAvailable()
+ except OptionalDependencyNotAvailable:
+ pass
+ else:
+ from .modeling_omdet_turbo import (
+ OmDetTurboForObjectDetection,
+ OmDetTurboPreTrainedModel,
+ )
+
+else:
+ import sys
+
+ sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__)
diff --git a/src/transformers/models/omdet_turbo/configuration_omdet_turbo.py b/src/transformers/models/omdet_turbo/configuration_omdet_turbo.py
new file mode 100644
index 000000000000..cb5e69db5f90
--- /dev/null
+++ b/src/transformers/models/omdet_turbo/configuration_omdet_turbo.py
@@ -0,0 +1,290 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""OmDet-Turbo model configuration"""
+
+from ...configuration_utils import PretrainedConfig
+from ...utils import logging
+from ...utils.backbone_utils import verify_backbone_config_arguments
+from ..auto import CONFIG_MAPPING
+
+
+logger = logging.get_logger(__name__)
+
+
+class OmDetTurboConfig(PretrainedConfig):
+ r"""
+ This is the configuration class to store the configuration of a [`OmDetTurboForObjectDetection`].
+ It is used to instantiate a OmDet-Turbo model according to the specified arguments, defining the model architecture
+ Instantiating a configuration with the defaults will yield a similar configuration to that of the OmDet-Turbo
+ [omlab/omdet-turbo-swin-tiny-hf](https://huggingface.co/omlab/omdet-turbo-swin-tiny-hf) architecture.
+
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+ documentation from [`PretrainedConfig`] for more information.
+
+ Args:
+ text_config (`PretrainedConfig`, *optional*):
+ The configuration of the text backbone.
+ backbone_config (`PretrainedConfig`, *optional*):
+ The configuration of the vision backbone.
+ use_timm_backbone (`bool`, *optional*, defaults to `True`):
+ Whether to use the timm for the vision backbone.
+ backbone (`str`, *optional*, defaults to `"swin_tiny_patch4_window7_224"`):
+ The name of the pretrained vision backbone to use. If `use_pretrained_backbone=False` a randomly initialized
+ backbone with the same architecture `backbone` is used.
+ backbone_kwargs (`dict`, *optional*):
+ Additional kwargs for the vision backbone.
+ use_pretrained_backbone (`bool`, *optional*, defaults to `False`):
+ Whether to use a pretrained vision backbone.
+ apply_layernorm_after_vision_backbone (`bool`, *optional*, defaults to `True`):
+ Whether to apply layer normalization on the feature maps of the vision backbone output.
+ image_size (`int`, *optional*, defaults to 640):
+ The size (resolution) of each image.
+ disable_custom_kernels (`bool`, *optional*, defaults to `False`):
+ Whether to disable custom kernels.
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
+ The epsilon value for layer normalization.
+ batch_norm_eps (`float`, *optional*, defaults to 1e-05):
+ The epsilon value for batch normalization.
+ init_std (`float`, *optional*, defaults to 0.02):
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+ text_projection_in_dim (`int`, *optional*, defaults to 512):
+ The input dimension for the text projection.
+ text_projection_out_dim (`int`, *optional*, defaults to 512):
+ The output dimension for the text projection.
+ task_encoder_hidden_dim (`int`, *optional*, defaults to 1024):
+ The feedforward dimension for the task encoder.
+ class_embed_dim (`int`, *optional*, defaults to 512):
+ The dimension of the classes embeddings.
+ class_distance_type (`str`, *optional*, defaults to `"cosine"`):
+ The type of of distance to compare predicted classes to projected classes embeddings.
+ Can be `"cosine"` or `"dot"`.
+ num_queries (`int`, *optional*, defaults to 900):
+ The number of queries.
+ csp_activation (`str`, *optional*, defaults to `"silu"`):
+ The activation function of the Cross Stage Partial (CSP) networks of the encoder.
+ conv_norm_activation (`str`, *optional*, defaults to `"gelu"`):
+ The activation function of the ConvNormLayer layers of the encoder.
+ encoder_feedforward_activation (`str`, *optional*, defaults to `"relu"`):
+ The activation function for the feedforward network of the encoder.
+ encoder_feedforward_dropout (`float`, *optional*, defaults to 0.0):
+ The dropout rate following the activation of the encoder feedforward network.
+ encoder_dropout (`float`, *optional*, defaults to 0.0):
+ The dropout rate of the encoder multi-head attention module.
+ hidden_expansion (`int`, *optional*, defaults to 1):
+ The hidden expansion of the CSP networks in the encoder.
+ vision_features_channels (`tuple(int)`, *optional*, defaults to `[256, 256, 256]`):
+ The projected vision features channels used as inputs for the decoder.
+ encoder_hidden_dim (`int`, *optional*, defaults to 256):
+ The hidden dimension of the encoder.
+ encoder_in_channels (`List(int)`, *optional*, defaults to `[192, 384, 768]`):
+ The input channels for the encoder.
+ encoder_projection_indices (`List(int)`, *optional*, defaults to `[2]`):
+ The indices of the input features projected by each layers.
+ encoder_attention_heads (`int`, *optional*, defaults to 8):
+ The number of attention heads for the encoder.
+ encoder_dim_feedforward (`int`, *optional*, defaults to 2048):
+ The feedforward dimension for the encoder.
+ encoder_layers (`int`, *optional*, defaults to 1):
+ The number of layers in the encoder.
+ positional_encoding_temperature (`int`, *optional*, defaults to 10000):
+ The positional encoding temperature in the encoder.
+ num_feature_levels (`int`, *optional*, defaults to 3):
+ The number of feature levels for the multi-scale deformable attention module of the decoder.
+ decoder_hidden_dim (`int`, *optional*, defaults to 256):
+ The hidden dimension of the decoder.
+ decoder_num_heads (`int`, *optional*, defaults to 8):
+ The number of heads for the decoder.
+ decoder_num_layers (`int`, *optional*, defaults to 6):
+ The number of layers for the decoder.
+ decoder_activation (`str`, *optional*, defaults to `"relu"`):
+ The activation function for the decoder.
+ decoder_dim_feedforward (`int`, *optional*, defaults to 2048):
+ The feedforward dimension for the decoder.
+ decoder_num_points (`int`, *optional*, defaults to 4):
+ The number of points sampled in the decoder multi-scale deformable attention module.
+ decoder_dropout (`float`, *optional*, defaults to 0.0):
+ The dropout rate for the decoder.
+ eval_size (`Tuple[int, int]`, *optional*):
+ Height and width used to computes the effective height and width of the position embeddings after taking
+ into account the stride (see RTDetr).
+ learn_initial_query (`bool`, *optional*, defaults to `False`):
+ Whether to learn the initial query.
+ cache_size (`int`, *optional*, defaults to 100):
+ The cache size for the classes and prompts caches.
+ is_encoder_decoder (`bool`, *optional*, defaults to `True`):
+ Whether the model is used as an encoder-decoder model or not.
+ kwargs (`Dict[str, Any]`, *optional*):
+ Additional parameters from the architecture. The values in kwargs will be saved as part of the configuration
+ and can be used to control the model outputs.
+
+ Examples:
+
+ ```python
+ >>> from transformers import OmDetTurboConfig, OmDetTurboForObjectDetection
+
+ >>> # Initializing a OmDet-Turbo omlab/omdet-turbo-tiny style configuration
+ >>> configuration = OmDetTurboConfig()
+
+ >>> # Initializing a model (with random weights) from the omlab/omdet-turbo-tiny style configuration
+ >>> model = OmDetTurboForObjectDetection(configuration)
+
+ >>> # Accessing the model configuration
+ >>> configuration = model.config
+ ```"""
+
+ model_type = "omdet-turbo"
+ attribute_map = {
+ "encoder_hidden_dim": "d_model",
+ "num_attention_heads": "encoder_attention_heads",
+ }
+
+ def __init__(
+ self,
+ text_config=None,
+ backbone_config=None,
+ use_timm_backbone=True,
+ backbone="swin_tiny_patch4_window7_224",
+ backbone_kwargs=None,
+ use_pretrained_backbone=False,
+ apply_layernorm_after_vision_backbone=True,
+ image_size=640,
+ disable_custom_kernels=False,
+ layer_norm_eps=1e-5,
+ batch_norm_eps=1e-5,
+ init_std=0.02,
+ text_projection_in_dim=512,
+ text_projection_out_dim=512,
+ task_encoder_hidden_dim=1024,
+ class_embed_dim=512,
+ class_distance_type="cosine",
+ num_queries=900,
+ csp_activation="silu",
+ conv_norm_activation="gelu",
+ encoder_feedforward_activation="relu",
+ encoder_feedforward_dropout=0.0,
+ encoder_dropout=0.0,
+ hidden_expansion=1,
+ vision_features_channels=[256, 256, 256],
+ encoder_hidden_dim=256,
+ encoder_in_channels=[192, 384, 768],
+ encoder_projection_indices=[2],
+ encoder_attention_heads=8,
+ encoder_dim_feedforward=2048,
+ encoder_layers=1,
+ positional_encoding_temperature=10000,
+ num_feature_levels=3,
+ decoder_hidden_dim=256,
+ decoder_num_heads=8,
+ decoder_num_layers=6,
+ decoder_activation="relu",
+ decoder_dim_feedforward=2048,
+ decoder_num_points=4,
+ decoder_dropout=0.0,
+ eval_size=None,
+ learn_initial_query=False,
+ cache_size=100,
+ is_encoder_decoder=True,
+ **kwargs,
+ ):
+ if use_timm_backbone:
+ if backbone_config is None:
+ backbone_kwargs = {
+ "out_indices": [1, 2, 3],
+ "img_size": image_size,
+ "always_partition": True,
+ }
+ elif backbone_config is None:
+ logger.info("`backbone_config` is `None`. Initializing the config with the default `swin` vision config.")
+ backbone_config = CONFIG_MAPPING["swin"](
+ window_size=7,
+ image_size=image_size,
+ embed_dim=96,
+ depths=[2, 2, 6, 2],
+ num_heads=[3, 6, 12, 24],
+ out_indices=[2, 3, 4],
+ )
+ elif isinstance(backbone_config, dict):
+ backbone_model_type = backbone_config.get("model_type")
+ config_class = CONFIG_MAPPING[backbone_model_type]
+ backbone_config = config_class.from_dict(backbone_config)
+
+ verify_backbone_config_arguments(
+ use_timm_backbone=use_timm_backbone,
+ use_pretrained_backbone=use_pretrained_backbone,
+ backbone=backbone,
+ backbone_config=backbone_config,
+ backbone_kwargs=backbone_kwargs,
+ )
+
+ if text_config is None:
+ logger.info(
+ "`text_config` is `None`. Initializing the config with the default `clip_text_model` text config."
+ )
+ text_config = CONFIG_MAPPING["clip_text_model"]()
+ elif isinstance(text_config, dict):
+ text_model_type = text_config.get("model_type")
+ text_config = CONFIG_MAPPING[text_model_type](**text_config)
+
+ if class_distance_type not in ["cosine", "dot"]:
+ raise ValueError(
+ f"Invalid `class_distance_type`. It should be either `cosine` or `dot`, but got {class_distance_type}."
+ )
+
+ self.text_config = text_config
+ self.backbone_config = backbone_config
+ self.use_timm_backbone = use_timm_backbone
+ self.backbone = backbone
+ self.backbone_kwargs = backbone_kwargs
+ self.use_pretrained_backbone = use_pretrained_backbone
+ self.apply_layernorm_after_vision_backbone = apply_layernorm_after_vision_backbone
+ self.image_size = image_size
+ self.disable_custom_kernels = disable_custom_kernels
+ self.layer_norm_eps = layer_norm_eps
+ self.batch_norm_eps = batch_norm_eps
+ self.init_std = init_std
+ self.text_projection_in_dim = text_projection_in_dim
+ self.text_projection_out_dim = text_projection_out_dim
+ self.task_encoder_hidden_dim = task_encoder_hidden_dim
+ self.class_embed_dim = class_embed_dim
+ self.class_distance_type = class_distance_type
+ self.num_queries = num_queries
+ self.csp_activation = csp_activation
+ self.conv_norm_activation = conv_norm_activation
+ self.encoder_feedforward_activation = encoder_feedforward_activation
+ self.encoder_feedforward_dropout = encoder_feedforward_dropout
+ self.encoder_dropout = encoder_dropout
+ self.hidden_expansion = hidden_expansion
+ self.vision_features_channels = vision_features_channels
+ self.encoder_hidden_dim = encoder_hidden_dim
+ self.encoder_in_channels = encoder_in_channels
+ self.encoder_projection_indices = encoder_projection_indices
+ self.encoder_attention_heads = encoder_attention_heads
+ self.encoder_dim_feedforward = encoder_dim_feedforward
+ self.encoder_layers = encoder_layers
+ self.positional_encoding_temperature = positional_encoding_temperature
+ self.num_feature_levels = num_feature_levels
+ self.decoder_hidden_dim = decoder_hidden_dim
+ self.decoder_num_heads = decoder_num_heads
+ self.decoder_num_layers = decoder_num_layers
+ self.decoder_activation = decoder_activation
+ self.decoder_dim_feedforward = decoder_dim_feedforward
+ self.decoder_num_points = decoder_num_points
+ self.decoder_dropout = decoder_dropout
+ self.eval_size = eval_size
+ self.learn_initial_query = learn_initial_query
+ self.cache_size = cache_size
+ self.is_encoder_decoder = is_encoder_decoder
+
+ super().__init__(is_encoder_decoder=is_encoder_decoder, **kwargs)
diff --git a/src/transformers/models/omdet_turbo/convert_omdet_turbo_to_hf.py b/src/transformers/models/omdet_turbo/convert_omdet_turbo_to_hf.py
new file mode 100644
index 000000000000..2e515e983408
--- /dev/null
+++ b/src/transformers/models/omdet_turbo/convert_omdet_turbo_to_hf.py
@@ -0,0 +1,349 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Convert OmDet-Turbo checkpoints from the original repository.
+
+URL: https://github.com/om-ai-lab/OmDet"""
+
+import argparse
+
+import requests
+import torch
+from PIL import Image
+
+from transformers import (
+ CLIPTokenizer,
+ DetrImageProcessor,
+ OmDetTurboConfig,
+ OmDetTurboForObjectDetection,
+ OmDetTurboProcessor,
+)
+
+
+IMAGE_MEAN = [123.675, 116.28, 103.53]
+IMAGE_STD = [58.395, 57.12, 57.375]
+
+
+def get_omdet_turbo_config(model_name, use_timm_backbone):
+ if "tiny" in model_name:
+ window_size = 7
+ embed_dim = 96
+ depths = (2, 2, 6, 2)
+ num_heads = (3, 6, 12, 24)
+ image_size = 640
+ else:
+ raise ValueError("Model not supported, only supports tiny variant.")
+
+ config = OmDetTurboConfig(
+ backbone_window_size=window_size,
+ backbone_image_size=image_size,
+ backbone_embed_dim=embed_dim,
+ backbone_depths=depths,
+ backbone_num_heads=num_heads,
+ backbone_out_indices=(1, 2, 3),
+ text_config={"model_type": "clip_text_model"},
+ use_timm_backbone=use_timm_backbone,
+ backbone="swin_tiny_patch4_window7_224" if use_timm_backbone else None,
+ apply_layernorm_after_vision_backbone=True if use_timm_backbone else False,
+ use_pretrained_backbone=False,
+ )
+
+ return config
+
+
+def create_rename_keys_vision(state_dict, config):
+ rename_keys = []
+ # fmt: off
+ ########################################## VISION BACKBONE - START
+ for layer_name in state_dict.keys():
+ if layer_name.startswith("backbone") and not layer_name.startswith("backbone.norm"):
+ if config.use_timm_backbone:
+ layer_name_replace = layer_name.replace("backbone", "vision_backbone.vision_backbone._backbone")
+ layer_name_replace = layer_name_replace.replace(".layers.", ".layers_")
+ if "downsample" in layer_name:
+ # get layer number
+ layer_num = int(layer_name.split(".")[2])
+ layer_name_replace = layer_name_replace.replace(f"{layer_num}.downsample", f"{layer_num+1}.downsample")
+ else:
+ layer_name_replace = layer_name.replace("backbone", "vision_backbone.vision_backbone")
+ layer_name_replace = layer_name_replace.replace("patch_embed.proj", "embeddings.patch_embeddings.projection")
+ layer_name_replace = layer_name_replace.replace("patch_embed.norm", "embeddings.norm")
+ if layer_name.startswith("backbone.layers"):
+ layer_name_replace = layer_name_replace.replace("norm1", "layernorm_before")
+ layer_name_replace = layer_name_replace.replace("norm2", "layernorm_after")
+ layer_name_replace = layer_name_replace.replace("attn.proj", "attention.output.dense")
+ layer_name_replace = layer_name_replace.replace("mlp.fc1", "intermediate.dense")
+ layer_name_replace = layer_name_replace.replace("mlp.fc2", "output.dense")
+ layer_name_replace = layer_name_replace.replace(".layers.", ".encoder.layers.")
+ layer_name_replace = layer_name_replace.replace(".attn.", ".attention.self.")
+ elif layer_name.startswith("backbone.norm"):
+ layer_num = int(layer_name.split("norm")[1].split(".")[0])
+ if config.use_timm_backbone:
+ layer_name_replace = layer_name.replace("backbone", "vision_backbone")
+ layer_name_replace = layer_name_replace.replace(f"norm{layer_num}", f"layer_norms.{layer_num-1}")
+ else:
+ layer_name_replace = layer_name.replace(f"backbone.norm{layer_num}", f"vision_backbone.vision_backbone.hidden_states_norms.stage{layer_num+1}")
+ else:
+ continue
+ rename_keys.append((layer_name, layer_name_replace))
+ ########################################## VISION BACKBONE - END
+
+ ########################################## ENCODER - START
+ for layer_name, params in state_dict.items():
+ if "neck" in layer_name:
+ layer_name_replace = layer_name.replace("neck", "encoder")
+ layer_name_replace = layer_name_replace.replace("input_proj", "channel_projection_layers")
+ if "fpn_blocks" in layer_name or "pan_blocks" in layer_name or "lateral_convs" in layer_name or "downsample_convs" in layer_name:
+ layer_name_replace = layer_name_replace.replace(".m.", ".bottlenecks.")
+ layer_name_replace = layer_name_replace.replace(".cv", ".conv")
+ layer_name_replace = layer_name_replace.replace(".bn", ".norm")
+ if "encoder_layer" in layer_name:
+ layer_name_replace = layer_name_replace.replace("encoder_layer", "encoder.0.layers.0")
+ layer_name_replace = layer_name_replace.replace(".linear", ".fc")
+ layer_name_replace = layer_name_replace.replace("norm1", "self_attn_layer_norm")
+ layer_name_replace = layer_name_replace.replace("norm2", "final_layer_norm")
+ rename_keys.append((layer_name, layer_name_replace))
+ ########################################## ENCODER - END
+
+ ########################################## DECODER - START
+ for layer_name, params in state_dict.items():
+ if layer_name.startswith("decoder"):
+ layer_name_replace = layer_name.replace("decoder.decoder.layers", "decoder.layers")
+ layer_name_replace = layer_name_replace.replace("input_proj", "channel_projection_layers")
+ layer_name_replace = layer_name_replace.replace("query_pos_head", "query_position_head")
+ layer_name_replace = layer_name_replace.replace("enc_bbox_head", "encoder_bbox_head")
+ layer_name_replace = layer_name_replace.replace("enc_output", "encoder_vision_features")
+ layer_name_replace = layer_name_replace.replace("dec_score_head", "decoder_class_head")
+ layer_name_replace = layer_name_replace.replace("dec_bbox_head", "decoder_bbox_head")
+ layer_name_replace = layer_name_replace.replace("enc_score_head", "encoder_class_head")
+ rename_keys.append((layer_name, layer_name_replace))
+ ########################################## DECODER - END
+ # fmt: on
+ return rename_keys
+
+
+def create_rename_keys_language(state_dict):
+ rename_keys = []
+ # fmt: off
+ for layer_name in state_dict.keys():
+ if layer_name.startswith("language_backbone") and not layer_name.startswith("language_backbone.text_projection"):
+ layer_name_replace = layer_name.replace("language_backbone", "language_backbone.model.text_model")
+ layer_name_replace = layer_name_replace.replace("transformer.resblocks", "encoder.layers")
+ layer_name_replace = layer_name_replace.replace("token_embedding", "embeddings.token_embedding")
+ layer_name_replace = layer_name_replace.replace("positional_embedding", "embeddings.position_embedding.weight")
+ layer_name_replace = layer_name_replace.replace(".attn", ".self_attn")
+ layer_name_replace = layer_name_replace.replace(".mlp.c_fc", ".mlp.fc1")
+ layer_name_replace = layer_name_replace.replace(".mlp.c_proj", ".mlp.fc2")
+ layer_name_replace = layer_name_replace.replace("ln_final", "final_layer_norm")
+ layer_name_replace = layer_name_replace.replace(".ln_", ".layer_norm")
+ rename_keys.append((layer_name, layer_name_replace))
+ # fmt: on
+ return rename_keys
+
+
+def rename_key(dct, old, new):
+ val = dct.pop(old)
+ dct[new] = val
+
+
+# we split up the matrix of each encoder layer into queries, keys and values
+def read_in_q_k_v_vision(state_dict, config):
+ state_dict_keys = list(state_dict.keys())
+ for layer_name_vision in state_dict_keys:
+ if layer_name_vision.startswith("vision_backbone") and "qkv" in layer_name_vision:
+ layer_num = int(layer_name_vision.split(".")[4])
+ hidden_size = config.backbone_config.embed_dim * 2**layer_num
+ if "weight" in layer_name_vision:
+ in_proj_weight = state_dict.pop(layer_name_vision)
+ state_dict[layer_name_vision.replace("qkv.weight", "key.weight")] = in_proj_weight[:hidden_size, :]
+ state_dict[layer_name_vision.replace("qkv.weight", "query.weight")] = in_proj_weight[
+ hidden_size : hidden_size * 2, :
+ ]
+ state_dict[layer_name_vision.replace("qkv.weight", "value.weight")] = in_proj_weight[-hidden_size:, :]
+ elif "bias" in layer_name_vision:
+ in_proj_bias = state_dict.pop(layer_name_vision)
+ state_dict[layer_name_vision.replace("qkv.bias", "key.bias")] = in_proj_bias[:hidden_size]
+ state_dict[layer_name_vision.replace("qkv.bias", "query.bias")] = in_proj_bias[
+ hidden_size : hidden_size * 2
+ ]
+ state_dict[layer_name_vision.replace("qkv.bias", "value.bias")] = in_proj_bias[-hidden_size:]
+
+
+def read_in_q_k_v_text(state_dict, config):
+ state_dict_keys = list(state_dict.keys())
+ hidden_size = config.text_config.projection_dim
+ for layer_name_text in state_dict_keys:
+ if layer_name_text.startswith("language_backbone") and "in_proj" in layer_name_text:
+ if "weight" in layer_name_text:
+ in_proj_weight = state_dict.pop(layer_name_text)
+ state_dict[layer_name_text.replace("in_proj_weight", "q_proj.weight")] = in_proj_weight[
+ :hidden_size, :
+ ]
+ state_dict[layer_name_text.replace("in_proj_weight", "k_proj.weight")] = in_proj_weight[
+ hidden_size : hidden_size * 2, :
+ ]
+ state_dict[layer_name_text.replace("in_proj_weight", "v_proj.weight")] = in_proj_weight[
+ -hidden_size:, :
+ ]
+ elif "bias" in layer_name_text:
+ in_proj_bias = state_dict.pop(layer_name_text)
+ state_dict[layer_name_text.replace("in_proj_bias", "q_proj.bias")] = in_proj_bias[:hidden_size]
+ state_dict[layer_name_text.replace("in_proj_bias", "k_proj.bias")] = in_proj_bias[
+ hidden_size : hidden_size * 2
+ ]
+ state_dict[layer_name_text.replace("in_proj_bias", "v_proj.bias")] = in_proj_bias[-hidden_size:]
+
+
+def read_in_q_k_v_encoder(state_dict, config):
+ embed_dim = config.encoder_hidden_dim
+ # read in weights + bias of input projection layer (in original implementation, this is a single matrix + bias)
+ in_proj_weight = state_dict.pop("encoder.encoder.0.layers.0.self_attn.in_proj_weight")
+ in_proj_bias = state_dict.pop("encoder.encoder.0.layers.0.self_attn.in_proj_bias")
+ # next, add query, keys and values (in that order) to the state dict
+ state_dict["encoder.encoder.0.layers.0.self_attn.query.weight"] = in_proj_weight[:embed_dim, :]
+ state_dict["encoder.encoder.0.layers.0.self_attn.query.bias"] = in_proj_bias[:embed_dim]
+ state_dict["encoder.encoder.0.layers.0.self_attn.key.weight"] = in_proj_weight[embed_dim : embed_dim * 2, :]
+ state_dict["encoder.encoder.0.layers.0.self_attn.key.bias"] = in_proj_bias[embed_dim : embed_dim * 2]
+ state_dict["encoder.encoder.0.layers.0.self_attn.value.weight"] = in_proj_weight[-embed_dim:, :]
+ state_dict["encoder.encoder.0.layers.0.self_attn.value.bias"] = in_proj_bias[-embed_dim:]
+
+
+def read_in_q_k_v_decoder(state_dict, config):
+ for layer_num in range(config.decoder_num_layers):
+ embed_dim = config.decoder_hidden_dim
+ # read in weights + bias of input projection layer (in original implementation, this is a single matrix + bias)
+ in_proj_weight = state_dict.pop(f"decoder.layers.{layer_num}.self_attn.in_proj_weight")
+ in_proj_bias = state_dict.pop(f"decoder.layers.{layer_num}.self_attn.in_proj_bias")
+ # next, add query, keys and values (in that order) to the state dict
+ state_dict[f"decoder.layers.{layer_num}.self_attn.query.weight"] = in_proj_weight[:embed_dim, :]
+ state_dict[f"decoder.layers.{layer_num}.self_attn.query.bias"] = in_proj_bias[:embed_dim]
+ state_dict[f"decoder.layers.{layer_num}.self_attn.key.weight"] = in_proj_weight[embed_dim : embed_dim * 2, :]
+ state_dict[f"decoder.layers.{layer_num}.self_attn.key.bias"] = in_proj_bias[embed_dim : embed_dim * 2]
+ state_dict[f"decoder.layers.{layer_num}.self_attn.value.weight"] = in_proj_weight[-embed_dim:, :]
+ state_dict[f"decoder.layers.{layer_num}.self_attn.value.bias"] = in_proj_bias[-embed_dim:]
+
+
+def run_test(model, processor):
+ # We will verify our results on an image of cute cats
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+
+ classes = ["cat", "remote"]
+ task = "Detect {}.".format(", ".join(classes))
+ inputs = processor(image, text=classes, task=task, return_tensors="pt")
+
+ # Running forward
+ with torch.no_grad():
+ outputs = model(**inputs)
+
+ predicted_slice = outputs[1][0, :3, :3]
+ print(predicted_slice)
+ expected_slice = torch.tensor([[0.9427, -2.5958], [0.2105, -3.4569], [-2.6364, -4.1610]])
+
+ assert torch.allclose(predicted_slice, expected_slice, atol=1e-4)
+ print("Looks ok!")
+
+
+@torch.no_grad()
+def convert_omdet_turbo_checkpoint(args):
+ model_name = args.model_name
+ pytorch_dump_folder_path = args.pytorch_dump_folder_path
+ push_to_hub = args.push_to_hub
+ use_timm_backbone = args.use_timm_backbone
+
+ checkpoint_mapping = {
+ "omdet-turbo-tiny": [
+ "https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/OmDet-Turbo_tiny_SWIN_T.pth",
+ "https://huggingface.co/omlab/OmDet-Turbo_tiny_SWIN_T/resolve/main/ViT-B-16.pt",
+ ],
+ }
+ # Define default OmDetTurbo configuation
+ config = get_omdet_turbo_config(model_name, use_timm_backbone)
+
+ # Load original checkpoint
+ checkpoint_url = checkpoint_mapping[model_name]
+ original_state_dict_vision = torch.hub.load_state_dict_from_url(checkpoint_url[0], map_location="cpu")["model"]
+ original_state_dict_vision = {k.replace("module.", ""): v for k, v in original_state_dict_vision.items()}
+
+ # Rename keys
+ new_state_dict = original_state_dict_vision.copy()
+ rename_keys_vision = create_rename_keys_vision(new_state_dict, config)
+
+ rename_keys_language = create_rename_keys_language(new_state_dict)
+
+ for src, dest in rename_keys_vision:
+ rename_key(new_state_dict, src, dest)
+
+ for src, dest in rename_keys_language:
+ rename_key(new_state_dict, src, dest)
+
+ if not use_timm_backbone:
+ read_in_q_k_v_vision(new_state_dict, config)
+ read_in_q_k_v_text(new_state_dict, config)
+ read_in_q_k_v_encoder(new_state_dict, config)
+ read_in_q_k_v_decoder(new_state_dict, config)
+ # add "model" prefix to all keys
+ new_state_dict = {f"model.{k}": v for k, v in new_state_dict.items()}
+
+ # Load HF model
+ model = OmDetTurboForObjectDetection(config)
+ model.eval()
+ missing_keys, unexpected_keys = model.load_state_dict(new_state_dict, strict=False)
+ print("Missing keys:", missing_keys)
+ print("Unexpected keys:", unexpected_keys)
+
+ image_processor = DetrImageProcessor(
+ size={"height": config.backbone_image_size, "width": config.backbone_image_size},
+ do_rescale=False,
+ image_mean=IMAGE_MEAN,
+ image_std=IMAGE_STD,
+ do_pad=False,
+ )
+ tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
+ processor = OmDetTurboProcessor(image_processor=image_processor, tokenizer=tokenizer)
+
+ # end-to-end consistency test
+ run_test(model, processor)
+
+ if pytorch_dump_folder_path is not None:
+ model.save_pretrained(pytorch_dump_folder_path)
+ processor.save_pretrained(pytorch_dump_folder_path)
+
+ if push_to_hub:
+ model.push_to_hub(f"omlab/{model_name}")
+ processor.push_to_hub(f"omlab/{model_name}")
+
+
+if __name__ == "__main__":
+ parser = argparse.ArgumentParser()
+ # Required parameters
+ parser.add_argument(
+ "--model_name",
+ default="omdet-turbo-tiny",
+ type=str,
+ choices=["omdet-turbo-tiny"],
+ help="Name of the OmDetTurbo model you'd like to convert.",
+ )
+ parser.add_argument(
+ "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+ )
+ parser.add_argument(
+ "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub."
+ )
+ parser.add_argument(
+ "--use_timm_backbone", action="store_true", help="Whether or not to use timm backbone for vision backbone."
+ )
+
+ args = parser.parse_args()
+ convert_omdet_turbo_checkpoint(args)
diff --git a/src/transformers/models/omdet_turbo/modeling_omdet_turbo.py b/src/transformers/models/omdet_turbo/modeling_omdet_turbo.py
new file mode 100644
index 000000000000..bb6c8838ff8c
--- /dev/null
+++ b/src/transformers/models/omdet_turbo/modeling_omdet_turbo.py
@@ -0,0 +1,1810 @@
+# coding=utf-8
+# Copyright 2024 Om Research Lab and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""PyTorch OmDet-Turbo model."""
+
+import math
+import os
+import warnings
+from collections import OrderedDict
+from dataclasses import dataclass
+from functools import lru_cache
+from pathlib import Path
+from typing import Optional, Tuple, Union
+
+import torch
+import torch.nn.functional as F
+from torch import Tensor, nn
+from torch.autograd import Function
+from torch.autograd.function import once_differentiable
+
+from ...activations import ACT2CLS, ACT2FN
+from ...file_utils import (
+ ModelOutput,
+ add_start_docstrings,
+ add_start_docstrings_to_model_forward,
+ is_torch_cuda_available,
+ replace_return_docstrings,
+)
+from ...modeling_attn_mask_utils import _prepare_4d_attention_mask
+from ...modeling_utils import PreTrainedModel
+from ...utils import is_ninja_available, logging
+from ...utils.backbone_utils import load_backbone
+from ..auto import AutoModel
+from .configuration_omdet_turbo import OmDetTurboConfig
+
+
+MultiScaleDeformableAttention = None
+
+logger = logging.get_logger(__name__)
+_CONFIG_FOR_DOC = "OmDetTurboConfig"
+
+
+@dataclass
+class OmDetTurboEncoderOutput(ModelOutput):
+ """
+ Base class for outputs of the OmDetTurboHybridEncoder.
+
+ Args:
+ last_hidden_state (`torch.FloatTensor`):
+ Last hidden states of the encoder.
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
+
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
+ sequence_length)`.
+
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
+ heads.
+ extracted_states (`Tuple[torch.FloatTensor]`):
+ The extracted states from the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) of the encoder.
+ """
+
+ last_hidden_state: torch.FloatTensor = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
+ extracted_states: Tuple[torch.FloatTensor] = None
+
+
+@dataclass
+class OmDetTurboDecoderOutput(ModelOutput):
+ """
+ Base class for outputs of the OmDetTurboDecoder.
+
+ Args:
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+ Sequence of hidden-states at the output of the last layer of the decoder.
+ decoder_coords (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The predicted coordinates of the objects.
+ decoder_classes (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes)`):
+ The predicted classes of the objects.
+ encoder_coord_logits (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The predicted coordinates of the objects from the encoder.
+ encoder_class_logits (`Tuple[torch.FloatTensor]`) of shape `(batch_size, num_queries, num_classes)`:
+ The predicted class of the objects from the encoder.
+ init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The initial reference points.
+ intermediate_reference_points (`Tuple[Tuple[torch.FloatTensor]]`):
+ The intermediate reference points.
+ hidden_states (`Optional[Tuple[torch.FloatTensor]]`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of shape
+ `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
+ plus the initial embedding outputs.
+ attentions (`Optional[Tuple[Tuple[torch.FloatTensor]]]`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
+ Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads,
+ sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
+ weighted average in the self-attention, cross-attention and multi-scale deformable attention heads.
+ """
+
+ last_hidden_state: torch.FloatTensor = None
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+ decoder_coords: torch.FloatTensor = None
+ decoder_classes: torch.FloatTensor = None
+ encoder_coord_logits: torch.FloatTensor = None
+ encoder_class_logits: Tuple[torch.FloatTensor] = None
+ init_reference_points: torch.FloatTensor = None
+ intermediate_reference_points: Tuple[Tuple[torch.FloatTensor]] = None
+
+
+@dataclass
+class OmDetTurboObjectDetectionOutput(ModelOutput):
+ """
+ Output type of [`OmDetTurboObjectDetectionOutput`].
+
+ Args:
+ loss (`torch.FloatTensor`):
+ The loss value.
+ decoder_coord_logits (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The predicted coordinates logits of the objects.
+ decoder_class_logits (`torch.FloatTensor` of shape `(batch_size, num_queries, num_classes)`):
+ The predicted class of the objects.
+ init_reference_points (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The initial reference points.
+ intermediate_reference_points (`Tuple[Tuple[torch.FloatTensor]]`):
+ The intermediate reference points.
+ encoder_coord_logits (`torch.FloatTensor` of shape `(batch_size, num_queries, 4)`):
+ The predicted coordinates of the objects from the encoder.
+ encoder_class_logits (`Tuple[torch.FloatTensor]`):
+ The predicted class of the objects from the encoder.
+ encoder_extracted_states (`torch.FloatTensor`):
+ The extracted states from the Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) of the encoder.
+ decoder_hidden_states (`Optional[Tuple[torch.FloatTensor]]`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of shape
+ `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
+ plus the initial embedding outputs.
+ decoder_attentions (`Optional[Tuple[Tuple[torch.FloatTensor]]]`):
+ Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads,
+ sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
+ weighted average in the self-attention, cross-attention and multi-scale deformable attention heads.
+ encoder_hidden_states (`Optional[Tuple[torch.FloatTensor]]`):
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of shape
+ `(batch_size, sequence_length, hidden_size)`. Hidden-states of the model at the output of each layer
+ plus the initial embedding outputs.
+ encoder_attentions (`Optional[Tuple[Tuple[torch.FloatTensor]]]`):
+ Tuple of tuples of `torch.FloatTensor` (one for attention for each layer) of shape `(batch_size, num_heads,
+ sequence_length, sequence_length)`. Attentions weights after the attention softmax, used to compute the
+ weighted average in the self-attention, cross-attention and multi-scale deformable attention heads.
+ """
+
+ loss: torch.FloatTensor = None
+ decoder_coord_logits: torch.FloatTensor = None
+ decoder_class_logits: torch.FloatTensor = None
+ init_reference_points: torch.FloatTensor = None
+ intermediate_reference_points: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+ encoder_coord_logits: torch.FloatTensor = None
+ encoder_class_logits: Tuple[torch.FloatTensor] = None
+ encoder_extracted_states: torch.FloatTensor = None
+ decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ decoder_attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+ encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
+ encoder_attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
+
+
+# Copied from models.deformable_detr.load_cuda_kernels
+def load_cuda_kernels():
+ from torch.utils.cpp_extension import load
+
+ global MultiScaleDeformableAttention
+
+ root = Path(__file__).resolve().parent.parent.parent / "kernels" / "deformable_detr"
+ src_files = [
+ root / filename
+ for filename in [
+ "vision.cpp",
+ os.path.join("cpu", "ms_deform_attn_cpu.cpp"),
+ os.path.join("cuda", "ms_deform_attn_cuda.cu"),
+ ]
+ ]
+
+ MultiScaleDeformableAttention = load(
+ "MultiScaleDeformableAttention",
+ src_files,
+ with_cuda=True,
+ extra_include_paths=[str(root)],
+ extra_cflags=["-DWITH_CUDA=1"],
+ extra_cuda_cflags=[
+ "-DCUDA_HAS_FP16=1",
+ "-D__CUDA_NO_HALF_OPERATORS__",
+ "-D__CUDA_NO_HALF_CONVERSIONS__",
+ "-D__CUDA_NO_HALF2_OPERATORS__",
+ ],
+ )
+
+
+# Copied from transformers.models.deformable_detr.modeling_deformable_detr.multi_scale_deformable_attention
+def multi_scale_deformable_attention(
+ value: Tensor, value_spatial_shapes: Tensor, sampling_locations: Tensor, attention_weights: Tensor
+) -> Tensor:
+ batch_size, _, num_heads, hidden_dim = value.shape
+ _, num_queries, num_heads, num_levels, num_points, _ = sampling_locations.shape
+ # Ignore copy
+ value_list = value.split([height * width for height, width in value_spatial_shapes], dim=1)
+ sampling_grids = 2 * sampling_locations - 1
+ sampling_value_list = []
+ for level_id, (height, width) in enumerate(value_spatial_shapes):
+ # batch_size, height*width, num_heads, hidden_dim
+ # -> batch_size, height*width, num_heads*hidden_dim
+ # -> batch_size, num_heads*hidden_dim, height*width
+ # -> batch_size*num_heads, hidden_dim, height, width
+ value_l_ = (
+ value_list[level_id].flatten(2).transpose(1, 2).reshape(batch_size * num_heads, hidden_dim, height, width)
+ )
+ # batch_size, num_queries, num_heads, num_points, 2
+ # -> batch_size, num_heads, num_queries, num_points, 2
+ # -> batch_size*num_heads, num_queries, num_points, 2
+ sampling_grid_l_ = sampling_grids[:, :, :, level_id].transpose(1, 2).flatten(0, 1)
+ # batch_size*num_heads, hidden_dim, num_queries, num_points
+ sampling_value_l_ = nn.functional.grid_sample(
+ value_l_, sampling_grid_l_, mode="bilinear", padding_mode="zeros", align_corners=False
+ )
+ sampling_value_list.append(sampling_value_l_)
+ # (batch_size, num_queries, num_heads, num_levels, num_points)
+ # -> (batch_size, num_heads, num_queries, num_levels, num_points)
+ # -> (batch_size, num_heads, 1, num_queries, num_levels*num_points)
+ attention_weights = attention_weights.transpose(1, 2).reshape(
+ batch_size * num_heads, 1, num_queries, num_levels * num_points
+ )
+ output = (
+ (torch.stack(sampling_value_list, dim=-2).flatten(-2) * attention_weights)
+ .sum(-1)
+ .view(batch_size, num_heads * hidden_dim, num_queries)
+ )
+ return output.transpose(1, 2).contiguous()
+
+
+class OmDetTurboLRUCache:
+ def __init__(self, capacity: int):
+ self.cache = OrderedDict()
+ self.capacity = capacity
+ self.current_load = 0
+
+ def has(self, key) -> bool:
+ return key in self.cache
+
+ def get(self, key):
+ """
+ Get the value of the key if the key exists in the cache, otherwise return None.
+ Move the key to the end of the cache to show that it was recently used.
+ """
+ if key not in self.cache:
+ return None
+ self.cache.move_to_end(key)
+ return self.cache[key]
+
+ def put(self, key, value) -> None:
+ """
+ Add the key-value pair to the cache.
+ Move the key to the end of the cache to show that it was recently used.
+ If the cache is full, remove the first key (least recently used).
+ """
+ if key not in self.cache:
+ self.current_load += 1
+ if self.current_load > self.capacity:
+ self.cache.popitem(last=False)
+ self.current_load -= 1
+
+ self.cache[key] = value
+ self.cache.move_to_end(key)
+
+
+class OmDetTurboLanguageBackbone(nn.Module):
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+ self.model = AutoModel.from_config(config.text_config, attn_implementation=config._attn_implementation)
+ self.text_projection = nn.Parameter(torch.zeros(config.text_projection_in_dim, config.text_projection_out_dim))
+
+ def forward(self, hidden_states, mask=None, encode_type="task"):
+ text_outputs = self.model(hidden_states)
+ pooled_output = text_outputs[0]
+ if encode_type == "task":
+ if mask is None:
+ raise ValueError("mask is required for task encoding")
+ max_len = (mask != 0).sum(1).max().item()
+ truncated_mask = mask[:, :max_len]
+ truncated_output = pooled_output[:, :max_len, :]
+ return truncated_output.transpose(0, 1), truncated_mask
+ elif encode_type == "class":
+ max_pooled_output = pooled_output[torch.arange(pooled_output.shape[0]), hidden_states.argmax(dim=-1)]
+ projected_output = max_pooled_output @ self.text_projection
+ return projected_output
+ else:
+ raise ValueError(f"encode_type {encode_type} is not supported")
+
+
+class OmDetTurboVisionBackbone(nn.Module):
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+ self.apply_layernorm_after_vision_backbone = config.apply_layernorm_after_vision_backbone
+ self.vision_backbone = load_backbone(config)
+ self.layer_norms = nn.ModuleList(
+ [nn.LayerNorm(in_channel_dim, eps=config.layer_norm_eps) for in_channel_dim in config.encoder_in_channels]
+ )
+
+ def forward(self, pixel_values):
+ outputs = self.vision_backbone(pixel_values).feature_maps
+ if self.apply_layernorm_after_vision_backbone:
+ outputs = [
+ layer_norm(output).permute(0, 3, 1, 2).contiguous()
+ for layer_norm, output in zip(self.layer_norms, outputs)
+ ]
+
+ return outputs
+
+
+# Copied from transformers.models.deformable_detr.modeling_deformable_detr.MultiScaleDeformableAttentionFunction
+class MultiScaleDeformableAttentionFunction(Function):
+ @staticmethod
+ def forward(
+ context,
+ value,
+ value_spatial_shapes,
+ value_level_start_index,
+ sampling_locations,
+ attention_weights,
+ im2col_step,
+ ):
+ context.im2col_step = im2col_step
+ output = MultiScaleDeformableAttention.ms_deform_attn_forward(
+ value,
+ value_spatial_shapes,
+ value_level_start_index,
+ sampling_locations,
+ attention_weights,
+ context.im2col_step,
+ )
+ context.save_for_backward(
+ value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights
+ )
+ return output
+
+ @staticmethod
+ @once_differentiable
+ def backward(context, grad_output):
+ (
+ value,
+ value_spatial_shapes,
+ value_level_start_index,
+ sampling_locations,
+ attention_weights,
+ ) = context.saved_tensors
+ grad_value, grad_sampling_loc, grad_attn_weight = MultiScaleDeformableAttention.ms_deform_attn_backward(
+ value,
+ value_spatial_shapes,
+ value_level_start_index,
+ sampling_locations,
+ attention_weights,
+ grad_output,
+ context.im2col_step,
+ )
+
+ return grad_value, None, None, grad_sampling_loc, grad_attn_weight, None
+
+
+# Copied from transformers.models.deformable_detr.modeling_deformable_detr.DeformableDetrMultiscaleDeformableAttention with DeformableDetr->OmDetTurbo, Deformable DETR->OmDet-Turbo
+class OmDetTurboMultiscaleDeformableAttention(nn.Module):
+ """
+ Multiscale deformable attention as proposed in Deformable DETR.
+ """
+
+ def __init__(self, config: OmDetTurboConfig, num_heads: int, n_points: int):
+ super().__init__()
+
+ kernel_loaded = MultiScaleDeformableAttention is not None
+ if is_torch_cuda_available() and is_ninja_available() and not kernel_loaded:
+ try:
+ load_cuda_kernels()
+ except Exception as e:
+ logger.warning(f"Could not load the custom kernel for multi-scale deformable attention: {e}")
+
+ if config.d_model % num_heads != 0:
+ raise ValueError(
+ f"embed_dim (d_model) must be divisible by num_heads, but got {config.d_model} and {num_heads}"
+ )
+ dim_per_head = config.d_model // num_heads
+ # check if dim_per_head is power of 2
+ if not ((dim_per_head & (dim_per_head - 1) == 0) and dim_per_head != 0):
+ warnings.warn(
+ "You'd better set embed_dim (d_model) in OmDetTurboMultiscaleDeformableAttention to make the"
+ " dimension of each attention head a power of 2 which is more efficient in the authors' CUDA"
+ " implementation."
+ )
+
+ self.im2col_step = 64
+
+ self.d_model = config.d_model
+ self.n_levels = config.num_feature_levels
+ self.n_heads = num_heads
+ self.n_points = n_points
+
+ self.sampling_offsets = nn.Linear(config.d_model, num_heads * self.n_levels * n_points * 2)
+ self.attention_weights = nn.Linear(config.d_model, num_heads * self.n_levels * n_points)
+ self.value_proj = nn.Linear(config.d_model, config.d_model)
+ self.output_proj = nn.Linear(config.d_model, config.d_model)
+
+ self.disable_custom_kernels = config.disable_custom_kernels
+
+ self._reset_parameters()
+
+ def _reset_parameters(self):
+ nn.init.constant_(self.sampling_offsets.weight.data, 0.0)
+ default_dtype = torch.get_default_dtype()
+ thetas = torch.arange(self.n_heads, dtype=torch.int64).to(default_dtype) * (2.0 * math.pi / self.n_heads)
+ grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)
+ grid_init = (
+ (grid_init / grid_init.abs().max(-1, keepdim=True)[0])
+ .view(self.n_heads, 1, 1, 2)
+ .repeat(1, self.n_levels, self.n_points, 1)
+ )
+ for i in range(self.n_points):
+ grid_init[:, :, i, :] *= i + 1
+ with torch.no_grad():
+ self.sampling_offsets.bias = nn.Parameter(grid_init.view(-1))
+ nn.init.constant_(self.attention_weights.weight.data, 0.0)
+ nn.init.constant_(self.attention_weights.bias.data, 0.0)
+ nn.init.xavier_uniform_(self.value_proj.weight.data)
+ nn.init.constant_(self.value_proj.bias.data, 0.0)
+ nn.init.xavier_uniform_(self.output_proj.weight.data)
+ nn.init.constant_(self.output_proj.bias.data, 0.0)
+
+ def with_pos_embed(self, tensor: torch.Tensor, position_embeddings: Optional[Tensor]):
+ return tensor if position_embeddings is None else tensor + position_embeddings
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: Optional[torch.Tensor] = None,
+ encoder_hidden_states=None,
+ encoder_attention_mask=None,
+ position_embeddings: Optional[torch.Tensor] = None,
+ reference_points=None,
+ spatial_shapes=None,
+ spatial_shapes_list=None,
+ level_start_index=None,
+ output_attentions: bool = False,
+ ):
+ # add position embeddings to the hidden states before projecting to queries and keys
+ if position_embeddings is not None:
+ hidden_states = self.with_pos_embed(hidden_states, position_embeddings)
+
+ batch_size, num_queries, _ = hidden_states.shape
+ batch_size, sequence_length, _ = encoder_hidden_states.shape
+ # Ignore copy
+ total_elements = sum([shape[0] * shape[1] for shape in spatial_shapes_list])
+ if total_elements != sequence_length:
+ raise ValueError(
+ "Make sure to align the spatial shapes with the sequence length of the encoder hidden states"
+ )
+
+ value = self.value_proj(encoder_hidden_states)
+ if attention_mask is not None:
+ # we invert the attention_mask
+ value = value.masked_fill(~attention_mask[..., None], float(0))
+ value = value.view(batch_size, sequence_length, self.n_heads, self.d_model // self.n_heads)
+ sampling_offsets = self.sampling_offsets(hidden_states).view(
+ batch_size, num_queries, self.n_heads, self.n_levels, self.n_points, 2
+ )
+ attention_weights = self.attention_weights(hidden_states).view(
+ batch_size, num_queries, self.n_heads, self.n_levels * self.n_points
+ )
+ attention_weights = F.softmax(attention_weights, -1).view(
+ batch_size, num_queries, self.n_heads, self.n_levels, self.n_points
+ )
+ # batch_size, num_queries, n_heads, n_levels, n_points, 2
+ num_coordinates = reference_points.shape[-1]
+ if num_coordinates == 2:
+ offset_normalizer = torch.stack([spatial_shapes[..., 1], spatial_shapes[..., 0]], -1)
+ sampling_locations = (
+ reference_points[:, :, None, :, None, :]
+ + sampling_offsets / offset_normalizer[None, None, None, :, None, :]
+ )
+ elif num_coordinates == 4:
+ sampling_locations = (
+ reference_points[:, :, None, :, None, :2]
+ + sampling_offsets / self.n_points * reference_points[:, :, None, :, None, 2:] * 0.5
+ )
+ else:
+ raise ValueError(f"Last dim of reference_points must be 2 or 4, but got {reference_points.shape[-1]}")
+
+ if self.disable_custom_kernels:
+ # PyTorch implementation
+ output = multi_scale_deformable_attention(
+ value, spatial_shapes_list, sampling_locations, attention_weights
+ )
+ else:
+ try:
+ # custom kernel
+ output = MultiScaleDeformableAttentionFunction.apply(
+ value,
+ spatial_shapes,
+ level_start_index,
+ sampling_locations,
+ attention_weights,
+ self.im2col_step,
+ )
+ except Exception:
+ # PyTorch implementation
+ output = multi_scale_deformable_attention(
+ value, spatial_shapes_list, sampling_locations, attention_weights
+ )
+ output = self.output_proj(output)
+
+ return output, attention_weights
+
+
+# Copied from transformers.models.rt_detr.modeling_rt_detr.RTDetrConvNormLayer with RTDetr->OmDetTurbo
+class OmDetTurboConvNormLayer(nn.Module):
+ def __init__(self, config, in_channels, out_channels, kernel_size, stride, padding=None, activation=None):
+ super().__init__()
+ self.conv = nn.Conv2d(
+ in_channels,
+ out_channels,
+ kernel_size,
+ stride,
+ padding=(kernel_size - 1) // 2 if padding is None else padding,
+ bias=False,
+ )
+ self.norm = nn.BatchNorm2d(out_channels, config.batch_norm_eps)
+ self.activation = nn.Identity() if activation is None else ACT2CLS[activation]()
+
+ def forward(self, hidden_state):
+ hidden_state = self.conv(hidden_state)
+ hidden_state = self.norm(hidden_state)
+ hidden_state = self.activation(hidden_state)
+ return hidden_state
+
+
+# Copied from transformers.models.rt_detr.modeling_rt_detr.RTDetrRepVggBlock with RTDetr->OmDetTurbo, activation_function->csp_activation
+class OmDetTurboRepVggBlock(nn.Module):
+ """
+ RepVGG architecture block introduced by the work "RepVGG: Making VGG-style ConvNets Great Again".
+ """
+
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+
+ activation = config.csp_activation
+ hidden_channels = int(config.encoder_hidden_dim * config.hidden_expansion)
+ self.conv1 = OmDetTurboConvNormLayer(config, hidden_channels, hidden_channels, 3, 1, padding=1)
+ self.conv2 = OmDetTurboConvNormLayer(config, hidden_channels, hidden_channels, 1, 1, padding=0)
+ self.activation = nn.Identity() if activation is None else ACT2CLS[activation]()
+
+ def forward(self, x):
+ y = self.conv1(x) + self.conv2(x)
+ return self.activation(y)
+
+
+# Copied from transformers.models.rt_detr.modeling_rt_detr.RTDetrCSPRepLayer with RTDetr->OmDetTurbo, activation_function->csp_activation
+class OmDetTurboCSPRepLayer(nn.Module):
+ """
+ Cross Stage Partial (CSP) network layer with RepVGG blocks.
+ """
+
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+
+ in_channels = config.encoder_hidden_dim * 2
+ out_channels = config.encoder_hidden_dim
+ num_blocks = 3
+ activation = config.csp_activation
+
+ hidden_channels = int(out_channels * config.hidden_expansion)
+ self.conv1 = OmDetTurboConvNormLayer(config, in_channels, hidden_channels, 1, 1, activation=activation)
+ self.conv2 = OmDetTurboConvNormLayer(config, in_channels, hidden_channels, 1, 1, activation=activation)
+ self.bottlenecks = nn.Sequential(*[OmDetTurboRepVggBlock(config) for _ in range(num_blocks)])
+ if hidden_channels != out_channels:
+ self.conv3 = OmDetTurboConvNormLayer(config, hidden_channels, out_channels, 1, 1, activation=activation)
+ else:
+ self.conv3 = nn.Identity()
+
+ def forward(self, hidden_state):
+ device = hidden_state.device
+ hidden_state_1 = self.conv1(hidden_state)
+ hidden_state_1 = self.bottlenecks(hidden_state_1).to(device)
+ hidden_state_2 = self.conv2(hidden_state).to(device)
+ return self.conv3(hidden_state_1 + hidden_state_2)
+
+
+class OmDetTurboMultiheadAttention(nn.Module):
+ """Equivalent implementation of nn.MultiheadAttention with `batch_first=True`."""
+
+ def __init__(self, config, hidden_size, num_attention_heads, dropout):
+ super().__init__()
+ if hidden_size % num_attention_heads != 0:
+ raise ValueError(
+ f"The hidden size ({hidden_size}) is not a multiple of the number of attention "
+ f"heads ({num_attention_heads})"
+ )
+ self.num_attention_heads = num_attention_heads
+ self.attention_head_size = int(hidden_size / num_attention_heads)
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
+ self.query = nn.Linear(hidden_size, self.all_head_size)
+ self.key = nn.Linear(hidden_size, self.all_head_size)
+ self.value = nn.Linear(hidden_size, self.all_head_size)
+ self.out_proj = nn.Linear(hidden_size, hidden_size)
+ self.dropout = nn.Dropout(dropout)
+
+ def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
+ new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+ x = x.view(new_x_shape)
+ return x.permute(0, 2, 1, 3)
+
+ def forward(
+ self,
+ queries: torch.Tensor,
+ keys: torch.Tensor,
+ values: torch.Tensor,
+ attention_mask: Optional[torch.FloatTensor] = None,
+ output_attentions: Optional[bool] = False,
+ ) -> Tuple[torch.Tensor]:
+ query_layer = self.transpose_for_scores(self.query(queries))
+ key_layer = self.transpose_for_scores(self.key(keys))
+ value_layer = self.transpose_for_scores(self.value(values))
+
+ # Take the dot product between "query" and "key" to get the raw attention scores.
+ attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+
+ if attention_mask is not None:
+ attention_scores = attention_scores + attention_mask
+
+ # Normalize the attention scores to probabilities.
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+
+ # This is actually dropping out entire tokens to attend to, which might
+ # seem a bit unusual, but is taken from the original Transformer paper.
+ attention_probs = self.dropout(attention_probs)
+
+ context_layer = torch.matmul(attention_probs, value_layer)
+
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+ context_layer = context_layer.view(new_context_layer_shape)
+
+ context_layer = self.out_proj(context_layer)
+
+ outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+
+ return outputs
+
+
+class OmDetTurboEncoderLayer(nn.Module):
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+ self.self_attn = OmDetTurboMultiheadAttention(
+ config,
+ hidden_size=config.encoder_hidden_dim,
+ num_attention_heads=config.num_attention_heads,
+ dropout=config.encoder_dropout,
+ )
+ self.self_attn_layer_norm = nn.LayerNorm(config.encoder_hidden_dim, eps=config.layer_norm_eps)
+ self.dropout = nn.Dropout(config.encoder_dropout)
+ self.activation_fn = ACT2FN[config.encoder_feedforward_activation]
+ self.encoder_feedforward_dropout = nn.Dropout(config.encoder_feedforward_dropout)
+ self.fc1 = nn.Linear(config.encoder_hidden_dim, config.encoder_dim_feedforward)
+ self.fc2 = nn.Linear(config.encoder_dim_feedforward, config.encoder_hidden_dim)
+ self.final_layer_norm = nn.LayerNorm(config.encoder_hidden_dim, eps=config.layer_norm_eps)
+
+ @staticmethod
+ def with_pos_embed(tensor, pos_embed):
+ return tensor if pos_embed is None else tensor + pos_embed
+
+ def forward(
+ self,
+ hidden_states: torch.Tensor,
+ attention_mask: torch.Tensor,
+ position_embeddings: torch.Tensor = None,
+ output_attentions: bool = False,
+ ):
+ """
+ Args:
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
+ attention_mask (`torch.FloatTensor`): attention mask of size
+ `(batch, 1, target_len, source_len)` where padding elements are indicated by very large negative
+ values.
+ position_embeddings (`torch.FloatTensor`, *optional*):
+ Object queries (also called content embeddings), to be added to the hidden states.
+ output_attentions (`bool`, *optional*, defaults to `False`):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+ returned tensors for more detail.
+ """
+ residual = hidden_states
+ query = key = self.with_pos_embed(hidden_states, position_embeddings)
+
+ hidden_states = self.self_attn(
+ queries=query,
+ keys=key,
+ values=hidden_states,
+ attention_mask=attention_mask,
+ output_attentions=output_attentions,
+ )
+ hidden_states, attentions = hidden_states if output_attentions else (hidden_states[0], None)
+ hidden_states = self.dropout(hidden_states)
+ hidden_states = residual + hidden_states
+ hidden_states = self.self_attn_layer_norm(hidden_states)
+ residual = hidden_states
+ hidden_states = self.activation_fn(self.fc1(hidden_states))
+ hidden_states = self.encoder_feedforward_dropout(hidden_states)
+ hidden_states = self.fc2(hidden_states)
+ hidden_states = self.dropout(hidden_states)
+ hidden_states = residual + hidden_states
+ hidden_states = self.final_layer_norm(hidden_states)
+ if self.training:
+ if torch.isinf(hidden_states).any() or torch.isnan(hidden_states).any():
+ clamp_value = torch.finfo(hidden_states.dtype).max - 1000
+ hidden_states = torch.clamp(hidden_states, min=-clamp_value, max=clamp_value)
+
+ if output_attentions:
+ return hidden_states, attentions
+
+ return (hidden_states,)
+
+
+class OmDetTurboEncoder(nn.Module):
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+
+ self.layers = nn.ModuleList([OmDetTurboEncoderLayer(config) for _ in range(config.encoder_layers)])
+
+ def forward(
+ self, src, src_mask=None, pos_embed=None, output_attentions: bool = False
+ ) -> Tuple[Union[torch.Tensor, Tuple[torch.Tensor]]]:
+ hidden_states = src
+ attention = () if output_attentions else None
+ for layer in self.layers:
+ hidden_states = layer(
+ hidden_states,
+ attention_mask=src_mask,
+ position_embeddings=pos_embed,
+ output_attentions=output_attentions,
+ )
+ if output_attentions:
+ attention = attention + (hidden_states[1],)
+ hidden_states = hidden_states[0]
+
+ return hidden_states, attention
+
+
+class OmDetTurboHybridEncoder(nn.Module):
+ """
+ Encoder consisting of channel projection layers, a set of `OmDetTurboEncoder`, a top-down Feature Pyramid Network
+ (FPN) and a bottom-up Path Aggregation Network (PAN). More details on the paper: https://arxiv.org/abs/2304.08069
+
+ Args:
+ config: OmDetTurboConfig
+ """
+
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__()
+ self.config = config
+ self.in_channels = config.encoder_in_channels
+ self.encoder_hidden_dim = config.encoder_hidden_dim
+ self.encoder_projection_indices = config.encoder_projection_indices
+ self.positional_encoding_temperature = config.positional_encoding_temperature
+ self.eval_size = config.eval_size
+ self.out_channels = [self.encoder_hidden_dim for _ in self.in_channels]
+
+ self.channel_projection_layers = nn.ModuleList()
+ for in_channel in self.in_channels:
+ self.channel_projection_layers.append(
+ nn.Sequential(
+ nn.Conv2d(in_channel, self.encoder_hidden_dim, kernel_size=(1, 1), bias=False),
+ nn.BatchNorm2d(self.encoder_hidden_dim),
+ )
+ )
+
+ # encoder transformer
+ self.encoder = nn.ModuleList([OmDetTurboEncoder(config) for _ in range(len(self.encoder_projection_indices))])
+ # top-down fpn
+ self.lateral_convs = nn.ModuleList()
+ self.fpn_blocks = nn.ModuleList()
+ for _ in range(len(self.in_channels) - 1, 0, -1):
+ self.lateral_convs.append(
+ OmDetTurboConvNormLayer(
+ config,
+ in_channels=self.encoder_hidden_dim,
+ out_channels=self.encoder_hidden_dim,
+ kernel_size=1,
+ stride=1,
+ activation=config.conv_norm_activation,
+ )
+ )
+ self.fpn_blocks.append(OmDetTurboCSPRepLayer(config))
+
+ # bottom-up pan
+ self.downsample_convs = nn.ModuleList()
+ self.pan_blocks = nn.ModuleList()
+ for _ in range(len(self.in_channels) - 1):
+ self.downsample_convs.append(
+ OmDetTurboConvNormLayer(
+ config,
+ in_channels=self.encoder_hidden_dim,
+ out_channels=self.encoder_hidden_dim,
+ kernel_size=3,
+ stride=2,
+ activation=config.conv_norm_activation,
+ )
+ )
+ self.pan_blocks.append(OmDetTurboCSPRepLayer(config))
+
+ @staticmethod
+ def build_2d_sincos_position_embedding(
+ width, height, embed_dim=256, temperature=10000.0, device="cpu", dtype=torch.float32
+ ):
+ grid_w = torch.arange(int(width), dtype=dtype, device=device)
+ grid_h = torch.arange(int(height), dtype=dtype, device=device)
+ grid_w, grid_h = torch.meshgrid(grid_w, grid_h, indexing="ij")
+ if embed_dim % 4 != 0:
+ raise ValueError("Embed dimension must be divisible by 4 for 2D sin-cos position embedding")
+ pos_dim = embed_dim // 4
+ omega = torch.arange(pos_dim, dtype=dtype, device=device) / pos_dim
+ omega = 1.0 / (temperature**omega)
+
+ out_w = grid_w.flatten()[..., None] @ omega[None]
+ out_h = grid_h.flatten()[..., None] @ omega[None]
+
+ return torch.concat([out_w.sin(), out_w.cos(), out_h.sin(), out_h.cos()], dim=1)[None, :, :]
+
+ def forward(
+ self,
+ inputs_embeddings=None,
+ output_attentions=None,
+ output_hidden_states=None,
+ return_dict=None,
+ ):
+ r"""
+ Args:
+ inputs_embeddings (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
+ Flattened feature map (output of the backbone + projection layers) that is passed to the encoder.
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
+ returned tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
+ for more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain tuple.
+ """
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ hidden_states = inputs_embeddings
+
+ encoder_states = () if output_hidden_states else None
+ all_attentions = () if output_attentions else None
+ # get projection features
+ projected_features = [self.channel_projection_layers[i](feature) for i, feature in enumerate(hidden_states)]
+ # encoder
+ for encoder_layer_index, feature_to_project_index in enumerate(self.encoder_projection_indices):
+ if output_hidden_states:
+ encoder_states = encoder_states + (projected_features[feature_to_project_index],)
+ height, width = projected_features[feature_to_project_index].shape[2:]
+ # flatten [batch, channel, height, width] to [batch, height*width, channel]
+ src_flatten = projected_features[feature_to_project_index].flatten(2).permute(0, 2, 1)
+ if self.training or self.eval_size is None:
+ pos_embed = self.build_2d_sincos_position_embedding(
+ width,
+ height,
+ self.encoder_hidden_dim,
+ self.positional_encoding_temperature,
+ device=src_flatten.device,
+ dtype=src_flatten.dtype,
+ ).to(src_flatten.device, src_flatten.dtype)
+ else:
+ pos_embed = None
+ layer_outputs = self.encoder[encoder_layer_index](
+ src_flatten,
+ pos_embed=pos_embed,
+ output_attentions=output_attentions,
+ )
+ projected_features[feature_to_project_index] = (
+ layer_outputs[0].permute(0, 2, 1).reshape(-1, self.encoder_hidden_dim, height, width).contiguous()
+ )
+
+ if output_attentions:
+ all_attentions = all_attentions + (layer_outputs[1],)
+
+ if output_hidden_states:
+ encoder_states = encoder_states + (projected_features[feature_to_project_index],)
+
+ # Feature Pyramid Network (FPN)
+ fpn_feature_maps = [projected_features[-1]]
+ for idx in range(len(self.in_channels) - 1, 0, -1):
+ feat_high = fpn_feature_maps[0]
+ feat_low = projected_features[idx - 1]
+ feat_high = self.lateral_convs[len(self.in_channels) - 1 - idx](feat_high)
+ fpn_feature_maps[0] = feat_high
+ upsample_feat = F.interpolate(feat_high, scale_factor=2.0, mode="nearest")
+ fps_map = self.fpn_blocks[len(self.in_channels) - 1 - idx](torch.concat([upsample_feat, feat_low], dim=1))
+ fpn_feature_maps.insert(0, fps_map)
+
+ # Path Aggregation Network (PAN)
+ fpn_states = [fpn_feature_maps[0]]
+ for idx in range(len(self.in_channels) - 1):
+ feat_low = fpn_states[-1]
+ feat_high = fpn_feature_maps[idx + 1]
+ downsample_feat = self.downsample_convs[idx](feat_low)
+ hidden_states = self.pan_blocks[idx](
+ torch.concat([downsample_feat, feat_high.to(downsample_feat.device)], dim=1)
+ )
+ fpn_states.append(hidden_states)
+ if not return_dict:
+ return (fpn_states[-1], encoder_states, all_attentions, fpn_states)
+ return OmDetTurboEncoderOutput(
+ last_hidden_state=fpn_states[-1],
+ hidden_states=encoder_states,
+ attentions=all_attentions,
+ extracted_states=fpn_states,
+ )
+
+
+class OmDetTurboMLPWithDropout(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.linear1 = nn.Linear(config.class_embed_dim, config.task_encoder_hidden_dim)
+ self.activation = ACT2FN[config.decoder_activation]
+ self.dropout = nn.Dropout(config.decoder_dropout)
+ self.linear2 = nn.Linear(config.task_encoder_hidden_dim, config.class_embed_dim)
+
+ def forward(self, x):
+ return self.linear2(self.dropout(self.activation(self.linear1(x))))
+
+
+class OmDetTurboMLP(nn.Module):
+ """Very simple multi-layer perceptron (also called FFN)"""
+
+ def __init__(self, input_dim, hidden_dim, output_dim, num_layers):
+ super().__init__()
+ self.num_layers = num_layers
+ hidden_layers_dims = [hidden_dim] * (num_layers - 1)
+ layers_dims = [input_dim] + hidden_layers_dims + [output_dim]
+ self.layers = nn.ModuleList(
+ [nn.Linear(in_dim, out_dim) for in_dim, out_dim in zip(layers_dims[:-1], layers_dims[1:])]
+ )
+
+ def forward(self, x):
+ for i, layer in enumerate(self.layers):
+ x = F.relu(layer(x)) if i < self.num_layers - 1 else layer(x)
+ return x
+
+
+class OmDetTurboResidualLayer(nn.Module):
+ """
+ A residual connection followed by a layer norm.
+ """
+
+ def __init__(self, config):
+ super().__init__()
+ self.norm1 = nn.LayerNorm(config.class_embed_dim, eps=config.layer_norm_eps)
+ self.dropout = nn.Dropout(config.decoder_dropout)
+
+ def forward(self, x, y):
+ return self.norm1(x + self.dropout(y))
+
+
+class OmDetTurboTaskEncoder(nn.Module):
+ def __init__(self, config):
+ super().__init__()
+ self.mlp = OmDetTurboMLPWithDropout(config)
+ self.res1 = OmDetTurboResidualLayer(config)
+
+ def forward(self, x):
+ mlp_out = self.mlp(x)
+ x = self.res1(x, mlp_out)
+ return x
+
+
+class OmDetTurboDeformableTransformerDecoderLayer(nn.Module):
+ """
+ A single layer of the Deformable Transformer Decoder.
+ """
+
+ def __init__(self, config):
+ super().__init__()
+ # self attention
+ self.self_attn = OmDetTurboMultiheadAttention(
+ config,
+ hidden_size=config.decoder_hidden_dim,
+ num_attention_heads=config.decoder_num_heads,
+ dropout=config.decoder_dropout,
+ )
+ self.dropout1 = nn.Dropout(config.decoder_dropout)
+ self.norm1 = nn.LayerNorm(config.decoder_hidden_dim, eps=config.layer_norm_eps)
+
+ # cross attention
+ self.cross_attn = OmDetTurboMultiscaleDeformableAttention(
+ config, num_heads=config.decoder_num_heads, n_points=config.decoder_num_points
+ )
+ self.dropout2 = nn.Dropout(config.decoder_dropout)
+ self.norm2 = nn.LayerNorm(config.decoder_hidden_dim, eps=config.layer_norm_eps)
+
+ # feed forward network
+ self.linear1 = nn.Linear(config.decoder_hidden_dim, config.decoder_dim_feedforward)
+ self.act = ACT2FN[config.decoder_activation]
+ self.dropout3 = nn.Dropout(config.decoder_dropout)
+ self.linear2 = nn.Linear(config.decoder_dim_feedforward, config.decoder_hidden_dim)
+ self.dropout4 = nn.Dropout(config.decoder_dropout)
+ self.norm3 = nn.LayerNorm(config.decoder_hidden_dim, eps=config.layer_norm_eps)
+
+ self.output_attentions = config.output_attentions
+ self.output_hidden_states = config.output_hidden_states
+
+ @staticmethod
+ def with_pos_embed(tensor, pos):
+ return tensor if pos is None else tensor + pos
+
+ def forward(
+ self,
+ decoder_embeddings,
+ task_features,
+ reference_points,
+ vision_features,
+ vision_shapes,
+ vision_shapes_list,
+ level_start_index=None,
+ attention_mask=None,
+ padding_mask=None,
+ query_position=None,
+ output_attentions=None,
+ output_hidden_states=None,
+ ):
+ output_attentions = output_attentions if output_attentions is not None else self.output_attentions
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.output_hidden_states
+
+ origin_embedding_len = decoder_embeddings.shape[1]
+
+ # self attention
+ query = key = self.with_pos_embed(decoder_embeddings, query_position)
+ # combine task_features with query, key, value
+ task_features = task_features.transpose(0, 1)
+ query = torch.cat((query, task_features), dim=1)
+ key = torch.cat((key, task_features), dim=1)
+ decoder_embeddings = torch.cat((decoder_embeddings, task_features), dim=1)
+
+ outputs = self.self_attn(
+ query,
+ key,
+ decoder_embeddings,
+ attention_mask=attention_mask,
+ output_attentions=output_attentions,
+ )
+ context, self_attention = outputs if output_attentions else (outputs[0], None)
+ decoder_embeddings = decoder_embeddings + self.dropout1(context)
+ decoder_embeddings = self.norm1(decoder_embeddings)
+
+ task_features = decoder_embeddings[:, origin_embedding_len:, :].transpose(0, 1)
+ decoder_embeddings = decoder_embeddings[:, :origin_embedding_len, :]
+
+ # cross attention
+ hidden_states = self.with_pos_embed(decoder_embeddings, query_position)
+ reference_points = reference_points.unsqueeze(2)
+ outputs, cross_attention = self.cross_attn(
+ hidden_states=hidden_states,
+ attention_mask=padding_mask,
+ encoder_hidden_states=vision_features,
+ reference_points=reference_points,
+ spatial_shapes=vision_shapes,
+ spatial_shapes_list=vision_shapes_list,
+ level_start_index=level_start_index,
+ )
+ decoder_embeddings = decoder_embeddings + self.dropout2(outputs)
+ residual = self.norm2(decoder_embeddings)
+
+ # feed forward network
+ decoder_embeddings = self.linear2(self.dropout3(self.act(self.linear1(residual))))
+ decoder_embeddings = residual + self.dropout4(decoder_embeddings)
+ decoder_embeddings = self.norm3(decoder_embeddings)
+
+ return (
+ decoder_embeddings,
+ task_features,
+ self_attention if output_attentions else None,
+ cross_attention if output_attentions else None,
+ )
+
+
+class OmDetTurboPreTrainedModel(PreTrainedModel):
+ config_class = OmDetTurboConfig
+ base_model_prefix = "model"
+ main_input_name = "pixel_values"
+
+ def _init_weights(self, module):
+ def linear_init_(module_to_init):
+ bound = 1 / math.sqrt(module_to_init.weight.shape[0])
+ nn.init.uniform_(module_to_init.weight, -bound, bound)
+ if hasattr(module_to_init, "bias") and module_to_init.bias is not None:
+ nn.init.uniform_(module_to_init.bias, -bound, bound)
+
+ if isinstance(module, OmDetTurboEncoderLayer):
+ linear_init_(module.fc1)
+ linear_init_(module.fc2)
+ elif isinstance(module, OmDetTurboDecoder):
+ nn.init.constant_(module.encoder_bbox_head.layers[-1].weight, 0.0)
+ nn.init.constant_(module.encoder_bbox_head.layers[-1].bias, 0.0)
+ for mlp in module.decoder_bbox_head:
+ nn.init.constant_(mlp.layers[-1].weight, 0.0)
+ nn.init.constant_(mlp.layers[-1].bias, 0.0)
+ linear_init_(module.encoder_vision_features[0])
+ nn.init.xavier_uniform_(module.encoder_vision_features[0].weight)
+ if module.learn_initial_query:
+ nn.init.xavier_uniform_(module.tgt_embed.weight)
+ nn.init.xavier_uniform_(module.query_position_head.layers[0].weight)
+ nn.init.xavier_uniform_(module.query_position_head.layers[1].weight)
+ for layer in module.channel_projection_layers:
+ nn.init.xavier_uniform_(layer[0].weight)
+ elif isinstance(module, (nn.Linear, nn.Conv2d, nn.BatchNorm2d)):
+ module.weight.data.normal_(mean=0.0, std=self.config.init_std)
+ if module.bias is not None:
+ module.bias.data.zero_()
+
+ def _set_gradient_checkpointing(self, module, value=False):
+ if isinstance(module, OmDetTurboDecoder):
+ module.gradient_checkpointing = value
+
+ @staticmethod
+ def _get_cache_key_at_index(input_ids, attention_mask, index):
+ input_ids = input_ids[index]
+ input_mask = attention_mask[index]
+ cache_key = tuple(input_ids[input_mask != 0].tolist())
+ return cache_key
+
+ def get_cached_class_embeddings(self, classes_input_ids, classes_attention_mask):
+ not_cached_index = []
+ not_cached_classes = []
+ total_embeddings = []
+ for idx, _ in enumerate(classes_input_ids):
+ cache_key = self._get_cache_key_at_index(classes_input_ids, classes_attention_mask, idx)
+ if self.language_cache_class.has(cache_key):
+ total_embeddings.append(self.language_cache_class.get(cache_key))
+ else:
+ total_embeddings.append(None)
+ not_cached_index.append(idx)
+ not_cached_classes.append(cache_key)
+
+ if not_cached_classes:
+ not_cached_classes_ids = torch.stack([classes_input_ids[idx] for idx in not_cached_index])
+ embeddings = self.language_backbone(not_cached_classes_ids, encode_type="class")
+ for idx, emb in enumerate(embeddings):
+ idx_to_put = not_cached_index[idx]
+ total_embeddings[idx_to_put] = emb
+ self.language_cache_class.put(not_cached_classes[idx], emb)
+
+ total_class_embs = torch.stack(total_embeddings).to(self.device)
+ return total_class_embs
+
+ def get_cached_task_embeddings(self, tasks_input_ids, tasks_attention_mask):
+ not_cached_index = []
+ not_cached_tasks = []
+ total_task_features = []
+ total_task_masks = []
+ for idx, _ in enumerate(tasks_input_ids):
+ cache_key = self._get_cache_key_at_index(tasks_input_ids, tasks_attention_mask, idx)
+ if self.language_cache_prompt.has(cache_key):
+ task_feature, task_mask = self.language_cache_prompt.get(cache_key)
+ total_task_features.append(task_feature)
+ total_task_masks.append(task_mask)
+ else:
+ total_task_features.append(None)
+ total_task_masks.append(None)
+ not_cached_index.append(idx)
+ not_cached_tasks.append(cache_key)
+
+ if not_cached_tasks:
+ not_cached_index_ids = torch.stack([tasks_input_ids[idx] for idx in not_cached_index])
+ not_cached_mask = torch.stack([tasks_attention_mask[idx] for idx in not_cached_index])
+ embeddings, masks = self.language_backbone(not_cached_index_ids, mask=not_cached_mask, encode_type="task")
+
+ for idx in range(embeddings.shape[1]):
+ emb = embeddings[:, [idx], :]
+ idx_to_put = not_cached_index[idx]
+ cur_mask = torch.unsqueeze(masks[idx], dim=0).to(self.device)
+ total_task_features[idx_to_put] = emb
+ total_task_masks[idx_to_put] = cur_mask
+ self.language_cache_prompt.put(not_cached_tasks[idx], (emb, cur_mask))
+
+ # pad before concat if needed
+ max_len = max([task.shape[0] for task in total_task_features])
+ for idx, task in enumerate(total_task_features):
+ if task.shape[0] < max_len:
+ pad_size = max_len - task.shape[0]
+ total_task_features[idx] = F.pad(task, (0, 0, 0, 0, 0, pad_size))
+ total_task_masks[idx] = F.pad(total_task_masks[idx], (0, pad_size))
+
+ total_task_features = torch.cat(total_task_features, dim=1).to(self.device)
+ total_task_masks = torch.cat(total_task_masks, dim=0).to(self.device)
+
+ return total_task_features, total_task_masks
+
+ def get_language_embedding(
+ self,
+ classes_input_ids,
+ classes_attention_mask,
+ tasks_input_ids,
+ tasks_attention_mask,
+ classes_structure,
+ ):
+ batched_classes_embeddings = self.get_cached_class_embeddings(classes_input_ids, classes_attention_mask)
+ # regroup class embeddings using saved structure
+ max_class_size = torch.max(classes_structure)
+ class_embeddings_regrouped = []
+ start = 0
+ for size in classes_structure:
+ pad_size = max_class_size - size
+ class_embeddings_regrouped.append(
+ F.pad(batched_classes_embeddings[start : start + size], (0, 0, 0, pad_size)).unsqueeze(1)
+ )
+ start += size
+ class_embeddings = torch.cat(class_embeddings_regrouped, dim=1)
+
+ task_embeddings, task_mask = self.get_cached_task_embeddings(tasks_input_ids, tasks_attention_mask)
+
+ return class_embeddings, task_embeddings, task_mask
+
+
+OMDET_TURBO_START_DOCSTRING = r"""
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+ etc.)
+
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+ and behavior.
+
+ Parameters:
+ config ([`OmDetTurboConfig`]):
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
+ load the weights associated with the model, only the configuration. Check out the
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+
+OMDET_TURBO_INPUTS_DOCSTRING = r"""
+ Args:
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
+ Pixel values. Padding will be ignored by default should you provide it.
+
+ Pixel values can be obtained using [`AutoImageProcessor`]. See [`DetrImageProcessor.__call__`] for
+ details.
+
+ classes_input_ids (`torch.LongTensor` of shape `(total_classes (>= batch_size), sequence_length)`):
+ Indices of input classes sequence tokens in the vocabulary of the language model.
+ Several classes can be provided for each tasks, thus the tokenized classes are flattened
+ and the structure of the classes is provided in the `classes_structure` argument.
+
+ Indices can be obtained using [`OmDetTurboProcessor`]. See [`OmDetTurboProcessor.__call__`] for
+ details.
+
+ [What are input IDs?](../glossary#input-ids)
+
+ classes_attention_mask (`torch.BoolTensor` of shape `(total_classes (>= batch_size), num_classes, sequence_length)`):
+ Attention mask for the classes. This is a binary mask that indicates which tokens should be attended to,
+ and which should not.
+
+ tasks_input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
+ Indices of input tasks sequence tokens in the vocabulary of the language model.
+
+ Indices can be obtained using [`OmDetTurboProcessor`]. See [`OmDetTurboProcessor.__call__`] for
+ details.
+
+ [What are input IDs?](../glossary#input-ids)
+
+ tasks_attention_mask (`torch.BoolTensor` of shape `(batch_size, sequence_length)`):
+ Attention mask for the tasks. This is a binary mask that indicates which tokens should be attended to,
+ and which should not.
+
+ classes_structure (torch.LongTensor of shape `(batch_size)`):
+ Structure of the classes. This tensor indicates the number of classes for each task.
+
+ output_attentions (`bool`, *optional*):
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+ tensors for more detail.
+ output_hidden_states (`bool`, *optional*):
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+ more detail.
+ return_dict (`bool`, *optional*):
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+
+ """
+
+
+def _cosine_similarity_scaled(a, b, logit_scale):
+ a = a / a.norm(dim=2, keepdim=True).clamp_min(1e-12)
+ b = b / b.norm(dim=1, keepdim=True).clamp_min(1e-12)
+ logit_scale = logit_scale.exp()
+ logits_per_image = logit_scale * torch.bmm(a, b)
+ return logits_per_image
+
+
+def get_class_similarity(class_distance_type, cls_feature, class_proj):
+ logit_scale = torch.tensor(1 / 0.07).log()
+ if class_distance_type == "cosine":
+ class_logits = _cosine_similarity_scaled(cls_feature, class_proj, logit_scale)
+ elif class_distance_type == "dot":
+ class_logits = torch.bmm(cls_feature, class_proj)
+ else:
+ raise Exception("Unknown class_distance_type {}".format(class_distance_type))
+ return class_logits
+
+
+def _inverse_sigmoid(x, eps=1e-5):
+ x = x.clamp(min=0, max=1)
+ x1 = x.clamp(min=eps)
+ x2 = (1 - x).clamp(min=eps)
+ return torch.log(x1 / x2)
+
+
+class OmDetTurboDecoder(OmDetTurboPreTrainedModel):
+ def __init__(self, config: OmDetTurboConfig):
+ self.config = config
+ super().__init__(config)
+ self.gradient_checkpointing = False
+
+ hidden_dim = config.decoder_hidden_dim
+ self.num_queries = config.num_queries
+ self.class_distance_type = config.class_distance_type
+ self.learn_initial_query = config.learn_initial_query
+
+ # backbone feature projection
+ self.channel_projection_layers = nn.ModuleList(
+ nn.Sequential(nn.Conv2d(x, hidden_dim, 1, bias=False), nn.BatchNorm2d(hidden_dim))
+ for x in config.vision_features_channels
+ )
+ self.task_encoder = OmDetTurboTaskEncoder(config)
+ if config.class_embed_dim != hidden_dim:
+ self.task_project = nn.Linear(config.class_embed_dim, hidden_dim)
+
+ # Transformer module
+ self.layers = nn.ModuleList(
+ [OmDetTurboDeformableTransformerDecoderLayer(config) for _ in range(config.decoder_num_layers)]
+ )
+ self.decoder_num_layers = config.decoder_num_layers
+ # decoder embedding
+ if self.learn_initial_query:
+ self.tgt_embed = nn.Embedding(self.num_queries, hidden_dim)
+ self.query_position_head = OmDetTurboMLP(
+ input_dim=4, hidden_dim=2 * hidden_dim, output_dim=hidden_dim, num_layers=2
+ )
+
+ # encoder head
+ self.encoder_vision_features = nn.Sequential(
+ nn.Linear(hidden_dim, hidden_dim), nn.LayerNorm(hidden_dim, eps=config.layer_norm_eps)
+ )
+ self.encoder_class_head = nn.Linear(config.class_embed_dim, hidden_dim)
+ self.encoder_bbox_head = OmDetTurboMLP(input_dim=hidden_dim, hidden_dim=hidden_dim, output_dim=4, num_layers=3)
+
+ # decoder head
+ self.decoder_class_head = nn.ModuleList(
+ [nn.Linear(config.class_embed_dim, hidden_dim) for _ in range(config.decoder_num_layers)]
+ )
+ self.decoder_bbox_head = nn.ModuleList(
+ [OmDetTurboMLP(hidden_dim, hidden_dim, 4, num_layers=3) for _ in range(config.decoder_num_layers)]
+ )
+
+ # Initialize weights and apply final processing
+ self.post_init()
+
+ @lru_cache(maxsize=32)
+ def generate_anchors(self, spatial_shapes=None, grid_size=0.05, device="cpu", dtype=torch.float32):
+ # We always generate anchors in float32 to preserve equivalence between
+ # dynamic and static anchor inference
+ # Ignore copy
+ if spatial_shapes is None:
+ raise ValueError("spatial_shapes must be provided")
+
+ anchors = []
+ for level, (height, width) in enumerate(spatial_shapes):
+ grid_y, grid_x = torch.meshgrid(
+ torch.arange(end=height, dtype=dtype, device=device),
+ torch.arange(end=width, dtype=dtype, device=device),
+ indexing="ij",
+ )
+ grid_xy = torch.stack([grid_x, grid_y], -1)
+ valid_wh = torch.tensor([width, height], dtype=dtype, device=device)
+ grid_xy = (grid_xy.unsqueeze(0) + 0.5) / valid_wh
+ wh = torch.ones_like(grid_xy, dtype=dtype, device=device) * grid_size * (2.0**level)
+ anchors.append(torch.concat([grid_xy, wh], -1).reshape(-1, height * width, 4))
+ # define the valid range for anchor coordinates
+ eps = 1e-2
+ anchors = torch.concat(anchors, 1)
+ valid_mask = ((anchors > eps) * (anchors < 1 - eps)).all(-1, keepdim=True)
+ anchors = torch.log(anchors / (1 - anchors))
+ anchors = torch.where(valid_mask, anchors, torch.inf)
+
+ return anchors, valid_mask
+
+ def _get_encoder_input(self, vision_features):
+ # get projection features
+ vision_features = [self.channel_projection_layers[i](feat) for i, feat in enumerate(vision_features)]
+ # get encoder inputs
+ new_vision_features = []
+ new_vision_shapes_list = []
+ for feat in vision_features:
+ height, width = feat.shape[2:]
+ # [batch_size, channels, height, width] -> [batch_size, height*width, channels]
+ new_vision_features.append(feat.flatten(2).permute(0, 2, 1))
+ # [num_feature_levels, 2]
+ new_vision_shapes_list.append((height, width))
+
+ # [batch_size, height*width, channels]
+ new_vision_features = torch.cat(new_vision_features, 1)
+ new_vision_shapes = torch.tensor(new_vision_shapes_list, dtype=torch.int64).to(vision_features[0].device)
+ level_start_index = torch.cat((new_vision_shapes.new_zeros((1,)), new_vision_shapes.prod(1).cumsum(0)[:-1]))
+
+ return new_vision_features, new_vision_shapes, new_vision_shapes_list, level_start_index
+
+ def _get_decoder_input(
+ self, vision_features, vision_shapes, class_features, denoise_embeddings=None, denoise_bboxes=None
+ ):
+ batch_size = len(vision_features)
+ # prepare input for decoder
+ anchors, valid_mask = self.generate_anchors(
+ vision_shapes, device=vision_features.device, dtype=vision_features.dtype
+ )
+ predicted_class_features = self.encoder_vision_features(
+ torch.where(
+ valid_mask, vision_features, torch.tensor(0.0, dtype=vision_features.dtype).to(vision_features.device)
+ )
+ )
+
+ original_class_projected = self.encoder_class_head(class_features).permute(1, 2, 0)
+ encoder_class_similarity = get_class_similarity(
+ self.class_distance_type, predicted_class_features, original_class_projected
+ )
+
+ # dynamic anchors + static content
+ # (batch_size, height*width, 4)
+ encoder_outputs_bboxes = self.encoder_bbox_head(predicted_class_features) + anchors
+
+ # query selection
+ # (batch_size, num_queries)
+ topk_ind = torch.topk(encoder_class_similarity.max(-1).values, self.num_queries, dim=1).indices.view(-1)
+ # (batch_size, num_queries)
+ batch_ind = (
+ torch.arange(end=batch_size, dtype=topk_ind.dtype, device=topk_ind.device)
+ .unsqueeze(-1)
+ .repeat(1, self.num_queries)
+ .view(-1)
+ )
+
+ reference_points = encoder_outputs_bboxes[batch_ind, topk_ind].view(batch_size, self.num_queries, -1)
+ encoder_bboxes = reference_points.sigmoid()
+ if denoise_bboxes is not None:
+ reference_points = torch.cat([denoise_bboxes, reference_points], 1)
+ if self.training:
+ reference_points = reference_points.detach()
+ encoder_class_similarity = encoder_class_similarity[batch_ind, topk_ind].view(batch_size, self.num_queries, -1)
+
+ if self.learn_initial_query:
+ embeddings = self.tgt_embed.weight.unsqueeze(0).repeat(batch_size, 1, 1)
+ else:
+ embeddings = predicted_class_features[batch_ind, topk_ind].view(batch_size, self.num_queries, -1)
+ if self.training:
+ embeddings = embeddings.detach()
+ if denoise_embeddings is not None:
+ embeddings = torch.cat([denoise_embeddings, embeddings], 1)
+
+ return embeddings, reference_points, encoder_bboxes, encoder_class_similarity, anchors
+
+ def forward(
+ self,
+ vision_features,
+ class_features,
+ task_features,
+ task_mask,
+ output_attentions=None,
+ output_hidden_states=None,
+ return_dict=None,
+ ):
+ """
+ Args:
+ vision_features (`torch.FloatTensor`): The sequence of vision features. shape depends on the vision
+ backbone.
+ class_features (`torch.FloatTensor`): The sequence of class features of shape
+ `(class_sequence_length, batch_size, class_embed_dim)`.
+ task_features (`torch.FloatTensor`): The sequence of task features of shape
+ `(task_sequence_length, batch_size, decoder_hidden_dim)`.
+ task_mask (`torch.LongTensor`): The mask for the task features of shape `(batch_size, task_sequence_length)`.
+ output_attentions (`bool`, *optional*): Whether or not to return the attentions tensors of all attention
+ layers. See `attentions` under returned tensors for more detail.
+ output_hidden_states (`bool`, *optional*): Whether or not to return the hidden states of all layers. See
+ `hidden_states` under returned tensors for more detail.
+ return_dict (`bool`, *optional*): Whether or not to return a [`~file_utils.ModelOutput`] instead of a plain
+ tuple.
+ """
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ vision_features, vision_shapes, vision_shapes_list, level_start_index = self._get_encoder_input(
+ vision_features
+ )
+
+ # todo add denoising for training
+ denoise_embeddings, denoise_bboxes, key_padding_mask = None, None, None
+ batch_size = task_mask.shape[0]
+
+ # compose attn_mask for vision_emb and task_emb fusion
+ task_features = self.task_encoder(task_features)
+ if self.task_project is not None:
+ task_features = self.task_project(task_features)
+ src_key_mask = (task_mask == 0).detach()
+ attn_mask_len = self.num_queries
+ fusion_size = attn_mask_len + task_features.shape[0]
+ key_padding_mask = torch.zeros([batch_size, fusion_size], dtype=torch.bool).to(task_features.device)
+ key_padding_mask[:, attn_mask_len:] = src_key_mask
+ attention_mask = _prepare_4d_attention_mask(~key_padding_mask, dtype=vision_features.dtype)
+ decoder_embeddings, reference_points, encoder_bboxes, encoder_class_similarity, init_reference_points = (
+ self._get_decoder_input(
+ vision_features, tuple(vision_shapes_list), class_features, denoise_embeddings, denoise_bboxes
+ )
+ )
+
+ all_hidden_states = () if output_hidden_states else None
+ all_attns = () if output_attentions else None
+ all_self_attns = () if output_attentions else None
+ all_cross_attns = () if output_attentions else None
+ predicted_class_features = decoder_embeddings
+
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (predicted_class_features,)
+ decoder_bboxes = []
+ decoder_classes = []
+ last_refined_bbox = None
+ reference_points = reference_points.sigmoid()
+ for i, layer in enumerate(self.layers):
+ if self.gradient_checkpointing and self.training:
+ predicted_class_features, task_features, self_attention, cross_attention = (
+ self._gradient_checkpointing_func(
+ layer.__call__,
+ predicted_class_features,
+ task_features,
+ reference_points,
+ vision_features,
+ vision_shapes,
+ vision_shapes_list,
+ level_start_index=level_start_index,
+ attention_mask=attention_mask,
+ query_position=self.query_position_head(reference_points),
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ )
+ )
+ else:
+ predicted_class_features, task_features, self_attention, cross_attention = layer(
+ predicted_class_features,
+ task_features,
+ reference_points,
+ vision_features,
+ vision_shapes,
+ vision_shapes_list,
+ level_start_index=level_start_index,
+ attention_mask=attention_mask,
+ query_position=self.query_position_head(reference_points),
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ )
+ if output_attentions:
+ all_self_attns = all_self_attns + (self_attention,)
+ all_cross_attns = all_cross_attns + (cross_attention,)
+ if output_hidden_states:
+ all_hidden_states = all_hidden_states + (predicted_class_features,)
+
+ refined_bbox = torch.sigmoid(
+ self.decoder_bbox_head[i](predicted_class_features) + _inverse_sigmoid(reference_points)
+ )
+ original_class_projected = self.decoder_class_head[i](class_features).permute(1, 2, 0)
+ if self.training:
+ decoder_classes.append(
+ get_class_similarity(
+ class_distance_type=self.class_distance_type,
+ cls_feature=predicted_class_features,
+ class_proj=original_class_projected,
+ )
+ )
+ if i == 0:
+ decoder_bboxes.append(refined_bbox)
+ else:
+ decoder_bboxes.append(
+ torch.sigmoid(
+ self.decoder_bbox_head[i](predicted_class_features) + _inverse_sigmoid(last_refined_bbox)
+ )
+ )
+ elif i == self.decoder_num_layers - 1:
+ decoder_classes.append(
+ get_class_similarity(self.class_distance_type, predicted_class_features, original_class_projected)
+ )
+ decoder_bboxes.append(refined_bbox)
+ break
+ last_refined_bbox = refined_bbox
+ reference_points = refined_bbox.detach() if self.training else refined_bbox
+ if output_attentions:
+ all_attns += (all_self_attns, all_cross_attns)
+
+ last_hidden_state = predicted_class_features
+ decoder_bboxes = torch.stack(decoder_bboxes)
+ decoder_classes = torch.stack(decoder_classes)
+
+ if not return_dict:
+ return (
+ last_hidden_state,
+ all_hidden_states,
+ all_attns,
+ decoder_bboxes,
+ decoder_classes,
+ encoder_bboxes,
+ encoder_class_similarity,
+ init_reference_points,
+ reference_points,
+ )
+
+ return OmDetTurboDecoderOutput(
+ last_hidden_state=last_hidden_state,
+ hidden_states=all_hidden_states,
+ attentions=all_attns,
+ decoder_coords=decoder_bboxes,
+ decoder_classes=decoder_classes,
+ encoder_coord_logits=encoder_bboxes,
+ encoder_class_logits=encoder_class_similarity,
+ init_reference_points=init_reference_points,
+ intermediate_reference_points=reference_points,
+ )
+
+
+@add_start_docstrings(
+ """
+ OmDetTurbo Model (consisting of a vision and a text backbone, and encoder-decoder architecture) outputting
+ bounding boxes and classes scores for tasks such as COCO detection.
+ """,
+ OMDET_TURBO_START_DOCSTRING,
+)
+class OmDetTurboForObjectDetection(OmDetTurboPreTrainedModel):
+ def __init__(self, config: OmDetTurboConfig):
+ super().__init__(config)
+ self.vision_backbone = OmDetTurboVisionBackbone(config)
+ self.language_backbone = OmDetTurboLanguageBackbone(config)
+ self.encoder = OmDetTurboHybridEncoder(config)
+ self.decoder = OmDetTurboDecoder(config)
+ self.num_queries = config.num_queries
+
+ self.language_cache_class = OmDetTurboLRUCache(config.cache_size)
+ self.language_cache_prompt = OmDetTurboLRUCache(config.cache_size)
+ self.vocab_size = config.text_config.vocab_size
+ self.post_init()
+
+ def get_input_embeddings(self):
+ return self.language_backbone.model.get_input_embeddings()
+
+ def set_input_embeddings(self, value):
+ self.language_backbone.model.set_input_embeddings(value)
+
+ def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
+ model_embeds = self.language_backbone.model.resize_token_embeddings(
+ new_num_tokens=new_num_tokens, pad_to_multiple_of=pad_to_multiple_of
+ )
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
+ self.vocab_size = model_embeds.num_embeddings
+ return model_embeds
+
+ @add_start_docstrings_to_model_forward(OMDET_TURBO_INPUTS_DOCSTRING)
+ @replace_return_docstrings(output_type=OmDetTurboObjectDetectionOutput, config_class=_CONFIG_FOR_DOC)
+ def forward(
+ self,
+ pixel_values: Tensor,
+ classes_input_ids: Tensor,
+ classes_attention_mask: Tensor,
+ tasks_input_ids: Tensor,
+ tasks_attention_mask: Tensor,
+ classes_structure: Tensor,
+ labels: Optional[Tensor] = None,
+ output_attentions=None,
+ output_hidden_states=None,
+ return_dict=None,
+ ) -> Union[Tuple[torch.FloatTensor], OmDetTurboObjectDetectionOutput]:
+ r"""
+ Returns:
+
+ Examples:
+
+ ```python
+ >>> import requests
+ >>> from PIL import Image
+
+ >>> from transformers import AutoProcessor, OmDetTurboForObjectDetection
+
+ >>> processor = AutoProcessor.from_pretrained("omlab/omdet-turbo-tiny")
+ >>> model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-tiny")
+
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+ >>> image = Image.open(requests.get(url, stream=True).raw)
+ >>> classes = ["cat", "remote"]
+ >>> task = "Detect {}.".format(", ".join(classes))
+ >>> inputs = processor(image, text=classes, task=task, return_tensors="pt")
+
+ >>> outputs = model(**inputs)
+
+ >>> # convert outputs (bounding boxes and class logits)
+ >>> results = processor.post_process_grounded_object_detection(
+ ... outputs,
+ ... classes=classes,
+ ... target_sizes=[image.size[::-1]],
+ ... score_threshold=0.3,
+ ... nms_threshold=0.3,
+ >>> )[0]
+ >>> for score, class_name, box in zip(results["scores"], results["classes"], results["boxes"]):
+ ... box = [round(i, 1) for i in box.tolist()]
+ ... print(
+ ... f"Detected {class_name} with confidence "
+ ... f"{round(score.item(), 2)} at location {box}"
+ ... )
+ Detected remote with confidence 0.76 at location [39.9, 71.3, 176.5, 117.9]
+ Detected cat with confidence 0.72 at location [345.1, 22.5, 639.7, 371.9]
+ Detected cat with confidence 0.65 at location [12.7, 53.8, 315.5, 475.3]
+ Detected remote with confidence 0.57 at location [333.4, 75.6, 370.7, 187.0]
+ ```"""
+ if labels is not None:
+ raise NotImplementedError("Training is not implemented yet")
+
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+ output_hidden_states = (
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+ )
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+
+ loss = None
+ image_features = self.vision_backbone(pixel_values)
+ encoder_outputs = self.encoder(
+ image_features,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+ class_features, task_features, task_mask = self.get_language_embedding(
+ classes_input_ids,
+ classes_attention_mask,
+ tasks_input_ids,
+ tasks_attention_mask,
+ classes_structure,
+ )
+ encoder_extracted_states = encoder_outputs.extracted_states if return_dict else encoder_outputs[-1]
+ decoder_outputs = self.decoder(
+ encoder_extracted_states,
+ class_features,
+ task_features,
+ task_mask,
+ output_attentions=output_attentions,
+ output_hidden_states=output_hidden_states,
+ return_dict=return_dict,
+ )
+
+ if not return_dict:
+ return tuple(
+ output
+ for output in [
+ loss,
+ decoder_outputs[3][-1],
+ decoder_outputs[4][-1],
+ decoder_outputs[7],
+ decoder_outputs[8],
+ decoder_outputs[5],
+ decoder_outputs[6],
+ encoder_outputs[-1],
+ decoder_outputs[1],
+ decoder_outputs[2],
+ encoder_outputs[1],
+ encoder_outputs[2],
+ ]
+ if output is not None
+ )
+
+ return OmDetTurboObjectDetectionOutput(
+ loss=loss,
+ decoder_coord_logits=decoder_outputs.decoder_coords[-1],
+ decoder_class_logits=decoder_outputs.decoder_classes[-1],
+ init_reference_points=decoder_outputs.init_reference_points,
+ intermediate_reference_points=decoder_outputs.intermediate_reference_points,
+ encoder_coord_logits=decoder_outputs.encoder_coord_logits,
+ encoder_class_logits=decoder_outputs.encoder_class_logits,
+ encoder_extracted_states=encoder_outputs.extracted_states,
+ decoder_hidden_states=decoder_outputs.hidden_states,
+ decoder_attentions=decoder_outputs.attentions,
+ encoder_hidden_states=encoder_outputs.hidden_states,
+ encoder_attentions=encoder_outputs.attentions,
+ )
diff --git a/src/transformers/models/omdet_turbo/processing_omdet_turbo.py b/src/transformers/models/omdet_turbo/processing_omdet_turbo.py
new file mode 100644
index 000000000000..909281b0c686
--- /dev/null
+++ b/src/transformers/models/omdet_turbo/processing_omdet_turbo.py
@@ -0,0 +1,362 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Processor class for OmDet-Turbo.
+"""
+
+import sys
+from typing import List, Optional, Tuple, Union
+
+from ...feature_extraction_utils import BatchFeature
+from ...image_transforms import center_to_corners_format
+from ...image_utils import ImageInput
+from ...processing_utils import ProcessingKwargs, ProcessorMixin, TextKwargs
+from ...tokenization_utils_base import PreTokenizedInput, TextInput
+from ...utils import (
+ TensorType,
+ is_torch_available,
+ is_torchvision_available,
+)
+
+
+if sys.version_info >= (3, 11):
+ from typing import Unpack
+else:
+ from typing_extensions import Unpack
+
+
+class OmDetTurboTextKwargs(TextKwargs, total=False):
+ task: Optional[Union[str, List[str], TextInput, PreTokenizedInput]]
+
+
+class OmDetTurboProcessorKwargs(ProcessingKwargs, total=False):
+ text_kwargs: OmDetTurboTextKwargs
+ _defaults = {
+ "text_kwargs": {
+ "add_special_tokens": True,
+ "padding": "max_length",
+ "truncation": True,
+ "max_length": 77,
+ "stride": 0,
+ "return_overflowing_tokens": False,
+ "return_special_tokens_mask": False,
+ "return_offsets_mapping": False,
+ "return_token_type_ids": False,
+ "return_length": False,
+ "verbose": True,
+ "task": None,
+ },
+ "images_kwargs": {},
+ }
+
+
+if is_torch_available():
+ import torch
+
+if is_torchvision_available():
+ from torchvision.ops.boxes import batched_nms
+
+
+def clip_boxes(box, box_size: Tuple[int, int]):
+ """
+ Clip the boxes by limiting x coordinates to the range [0, width]
+ and y coordinates to the range [0, height].
+
+ Args:
+ box (Tensor): The box to be clipped.
+ box_size (height, width): The clipping box's size.
+ """
+ assert torch.isfinite(box).all(), "Box tensor contains infinite or NaN!"
+ height, width = box_size
+ x1 = box[:, 0].clamp(min=0, max=width)
+ y1 = box[:, 1].clamp(min=0, max=height)
+ x2 = box[:, 2].clamp(min=0, max=width)
+ y2 = box[:, 3].clamp(min=0, max=height)
+ box = torch.stack((x1, y1, x2, y2), dim=-1)
+
+ return box
+
+
+def compute_score(boxes):
+ """
+ Compute logit scores per class for each box (proposal) and an array of class indices
+ corresponding to each proposal, flattened across the proposal_num.
+ The indices in `classes` will later be used to filter and match the predicted classes
+ with the input class names.
+ """
+ num_classes = boxes.shape[2]
+ proposal_num = boxes.shape[1]
+ scores = torch.sigmoid(boxes)
+ classes = torch.arange(num_classes, device=boxes.device).unsqueeze(0).repeat(proposal_num, 1).flatten(0, 1)
+ return scores, classes
+
+
+def _post_process_boxes_for_image(
+ boxes: TensorType,
+ scores: TensorType,
+ predicted_classes: TensorType,
+ classes: List[str],
+ image_size: Tuple[int, int],
+ num_classes: int,
+ score_threshold: float,
+ nms_threshold: float,
+ max_num_det: int = None,
+) -> dict:
+ """
+ Filter predicted results using given thresholds and NMS.
+ Args:
+ boxes (torch.Tensor): A Tensor of predicted class-specific or class-agnostic
+ boxes for the image. Shape : (num_queries, max_num_classes_in_batch * 4) if doing
+ class-specific regression, or (num_queries, 4) if doing class-agnostic
+ regression.
+ scores (torch.Tensor): A Tensor of predicted class scores for the image.
+ Shape : (num_queries, max_num_classes_in_batch + 1)
+ predicted_classes (torch.Tensor): A Tensor of predicted classes for the image.
+ Shape : (num_queries * (max_num_classes_in_batch + 1),)
+ classes (List[str]): The input classes names.
+ image_size (tuple): A tuple of (height, width) for the image.
+ num_classes (int): The number of classes given for this image.
+ score_threshold (float): Only return detections with a confidence score exceeding this
+ threshold.
+ nms_threshold (float): The threshold to use for box non-maximum suppression. Value in [0, 1].
+ max_num_det (int, optional): The maximum number of detections to return. Default is None.
+ Returns:
+ dict: A dictionary the following keys:
+ "boxes" (Tensor): A tensor of shape (num_filtered_objects, 4), containing the predicted boxes in (x1, y1, x2, y2) format.
+ "scores" (Tensor): A tensor of shape (num_filtered_objects,), containing the predicted confidence scores for each detection.
+ "classes" (List[str]): A list of strings, where each string is the predicted class for the
+ corresponding detection
+ """
+ proposal_num = len(boxes) if max_num_det is None else max_num_det
+ scores_per_image, topk_indices = scores.flatten(0, 1).topk(proposal_num, sorted=False)
+ classes_per_image = predicted_classes[topk_indices]
+ box_pred_per_image = boxes.view(-1, 1, 4).repeat(1, num_classes, 1).view(-1, 4)
+ box_pred_per_image = box_pred_per_image[topk_indices]
+
+ # Score filtering
+ box_pred_per_image = center_to_corners_format(box_pred_per_image)
+ box_pred_per_image = box_pred_per_image * torch.tensor(image_size[::-1]).repeat(2).to(box_pred_per_image.device)
+ filter_mask = scores_per_image > score_threshold # R x K
+ score_keep = filter_mask.nonzero(as_tuple=False).view(-1)
+ box_pred_per_image = box_pred_per_image[score_keep]
+ scores_per_image = scores_per_image[score_keep]
+ classes_per_image = classes_per_image[score_keep]
+
+ filter_classes_mask = classes_per_image < len(classes)
+ classes_keep = filter_classes_mask.nonzero(as_tuple=False).view(-1)
+ box_pred_per_image = box_pred_per_image[classes_keep]
+ scores_per_image = scores_per_image[classes_keep]
+ classes_per_image = classes_per_image[classes_keep]
+
+ # NMS
+ keep = batched_nms(box_pred_per_image, scores_per_image, classes_per_image, nms_threshold)
+ box_pred_per_image = box_pred_per_image[keep]
+ scores_per_image = scores_per_image[keep]
+ classes_per_image = classes_per_image[keep]
+ classes_per_image = [classes[i] for i in classes_per_image]
+
+ # create an instance
+ result = {}
+ result["boxes"] = clip_boxes(box_pred_per_image, image_size)
+ result["scores"] = scores_per_image
+ result["classes"] = classes_per_image
+
+ return result
+
+
+class OmDetTurboProcessor(ProcessorMixin):
+ r"""
+ Constructs a OmDet-Turbo processor which wraps a Deformable DETR image processor and an AutoTokenizer into a
+ single processor.
+
+ [`OmDetTurboProcessor`] offers all the functionalities of [`DetrImageProcessor`] and
+ [`AutoTokenizer`]. See the docstring of [`~OmDetTurboProcessor.__call__`] and [`~OmDetTurboProcessor.decode`]
+ for more information.
+
+ Args:
+ image_processor (`DetrImageProcessor`):
+ An instance of [`DetrImageProcessor`]. The image processor is a required input.
+ tokenizer (`AutoTokenizer`):
+ An instance of ['PreTrainedTokenizer`]. The tokenizer is a required input.
+ """
+
+ attributes = ["image_processor", "tokenizer"]
+ image_processor_class = "DetrImageProcessor"
+ tokenizer_class = "AutoTokenizer"
+
+ def __init__(self, image_processor, tokenizer):
+ super().__init__(image_processor, tokenizer)
+
+ def __call__(
+ self,
+ images: ImageInput = None,
+ text: Union[List[str], List[List[str]]] = None,
+ audio=None,
+ videos=None,
+ **kwargs: Unpack[OmDetTurboProcessorKwargs],
+ ) -> BatchFeature:
+ """
+ This method uses [*DetrImageProcessor.__call__] method to prepare image(s) for the model, and
+ [CLIPTokenizerFast.__call__] to prepare text for the model.
+
+ Please refer to the docstring of the above two methods for more information.
+
+ Args:
+ images (`ImageInput`):
+ Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255.
+ text (`Union[str, List[str], List[List[str]]]`):
+ The classes used to limit the scope of the open vocabulary detection. Expects a list of strings or a list
+ of list of strings. Batched classes can be of different lengths.
+ Examples: ["cat", "dog", "bird"], [["cat", "dog", "bird"], ["hat", "person"], ["car"]]
+ Kwargs:
+ task (`Union[str, List[str], TextInput, PreTokenizedInput]`):
+ The grounded text used to guide open vocabulary detection. Expects a single string or a list of strings.
+ Examples: "Detect a cat, a dog, and a bird.",[ "Detect everything.", "Detect trees and flowers."]
+ When not provided, the default task is "Detect [class1], [class2], [class3]" etc.
+ ...
+ """
+ if images is None or text is None:
+ raise ValueError("You have to specify both `images` and `text`")
+
+ output_kwargs = self._merge_kwargs(
+ OmDetTurboProcessorKwargs,
+ tokenizer_init_kwargs=self.tokenizer.init_kwargs,
+ **kwargs,
+ )
+
+ if isinstance(text, str):
+ text = text.strip(" ").split(",")
+
+ if not (len(text) and isinstance(text[0], (list, tuple))):
+ text = [text]
+
+ task = output_kwargs["text_kwargs"].pop("task", None)
+ if task is None:
+ task = ["Detect {}.".format(", ".join(text_single)) for text_single in text]
+ elif not isinstance(task, (list, tuple)):
+ task = [task]
+
+ encoding_image_processor = self.image_processor(images, **output_kwargs["images_kwargs"])
+ tasks_encoding = self.tokenizer(text=task, **output_kwargs["text_kwargs"])
+
+ classes = text
+
+ classes_structure = torch.tensor([len(class_single) for class_single in classes], dtype=torch.long)
+ classes_flattened = [class_single for class_batch in classes for class_single in class_batch]
+ classes_encoding = self.tokenizer(text=classes_flattened, **output_kwargs["text_kwargs"])
+
+ encoding = BatchFeature()
+ encoding.update({f"tasks_{key}": value for key, value in tasks_encoding.items()})
+ encoding.update({f"classes_{key}": value for key, value in classes_encoding.items()})
+ encoding.update({"classes_structure": classes_structure})
+ encoding.update(encoding_image_processor)
+
+ return encoding
+
+ # Copied from transformers.models.blip.processing_blip.BlipProcessor.batch_decode with BertTokenizerFast->PreTrainedTokenizer
+ def batch_decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.batch_decode`]. Please
+ refer to the docstring of this method for more information.
+ """
+ return self.tokenizer.batch_decode(*args, **kwargs)
+
+ # Copied from transformers.models.blip.processing_blip.BlipProcessor.decode with BertTokenizerFast->PreTrainedTokenizer
+ def decode(self, *args, **kwargs):
+ """
+ This method forwards all its arguments to PreTrainedTokenizer's [`~PreTrainedTokenizer.decode`]. Please refer to
+ the docstring of this method for more information.
+ """
+ return self.tokenizer.decode(*args, **kwargs)
+
+ def post_process_grounded_object_detection(
+ self,
+ outputs,
+ classes: Union[List[str], List[List[str]]],
+ score_threshold: float = 0.3,
+ nms_threshold: float = 0.5,
+ target_sizes: Optional[Union[TensorType, List[Tuple]]] = None,
+ max_num_det: Optional[int] = None,
+ ):
+ """
+ Converts the raw output of [`OmDetTurboForObjectDetection`] into final bounding boxes in (top_left_x, top_left_y,
+ bottom_right_x, bottom_right_y) format and get the associated text class.
+
+ Args:
+ outputs ([`OmDetTurboObjectDetectionOutput`]):
+ Raw outputs of the model.
+ classes (Union[List[str], List[List[str]]]): The input classes names.
+ score_threshold (float, defaults to 0.3): Only return detections with a confidence score exceeding this
+ threshold.
+ nms_threshold (float, defaults to 0.5): The threshold to use for box non-maximum suppression. Value in [0, 1].
+ target_sizes (`torch.Tensor` or `List[Tuple[int, int]]`, *optional*, defaults to None):
+ Tensor of shape `(batch_size, 2)` or list of tuples (`Tuple[int, int]`) containing the target size
+ `(height, width)` of each image in the batch. If unset, predictions will not be resized.
+ max_num_det (int, *optional*, defaults to None): The maximum number of detections to return.
+ Returns:
+ `List[Dict]`: A list of dictionaries, each dictionary containing the scores, classes and boxes for an image
+ in the batch as predicted by the model.
+ """
+ if isinstance(classes[0], str):
+ classes = [classes]
+
+ boxes_logits = outputs.decoder_coord_logits
+ scores_logits = outputs.decoder_class_logits
+
+ # Inputs consistency check
+ if target_sizes is None:
+ height = (
+ self.image_processor.size["height"]
+ if "height" in self.image_processor.size
+ else self.image_processor.size["shortest_edge"]
+ )
+ width = (
+ self.image_processor.size["width"]
+ if "width" in self.image_processor.size
+ else self.image_processor.size["longest_edge"]
+ )
+ target_sizes = ((height, width),) * len(boxes_logits)
+ elif len(target_sizes[0]) != 2:
+ raise ValueError(
+ "Each element of target_sizes must contain the size (height, width) of each image of the batch"
+ )
+ if len(target_sizes) != len(boxes_logits):
+ raise ValueError("Make sure that you pass in as many target sizes as output sequences")
+ if len(classes) != len(boxes_logits):
+ raise ValueError("Make sure that you pass in as many classes group as output sequences")
+
+ # Convert target_sizes to list for easier handling
+ if isinstance(target_sizes, torch.Tensor):
+ target_sizes = target_sizes.tolist()
+
+ scores, predicted_classes = compute_score(scores_logits)
+ num_classes = scores_logits.shape[2]
+ results = []
+ for scores_img, box_per_img, image_size, class_names in zip(scores, boxes_logits, target_sizes, classes):
+ results.append(
+ _post_process_boxes_for_image(
+ box_per_img,
+ scores_img,
+ predicted_classes,
+ class_names,
+ image_size,
+ num_classes,
+ score_threshold=score_threshold,
+ nms_threshold=nms_threshold,
+ max_num_det=max_num_det,
+ )
+ )
+
+ return results
diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py
index 2db7b38b5803..ef10b91ea558 100644
--- a/src/transformers/utils/dummy_pt_objects.py
+++ b/src/transformers/utils/dummy_pt_objects.py
@@ -6552,6 +6552,20 @@ def __init__(self, *args, **kwargs):
requires_backends(self, ["torch"])
+class OmDetTurboForObjectDetection(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
+class OmDetTurboPreTrainedModel(metaclass=DummyObject):
+ _backends = ["torch"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch"])
+
+
class OneFormerForUniversalSegmentation(metaclass=DummyObject):
_backends = ["torch"]
diff --git a/tests/models/omdet_turbo/__init__.py b/tests/models/omdet_turbo/__init__.py
new file mode 100644
index 000000000000..e69de29bb2d1
diff --git a/tests/models/omdet_turbo/test_modeling_omdet_turbo.py b/tests/models/omdet_turbo/test_modeling_omdet_turbo.py
new file mode 100644
index 000000000000..ed85c4c00078
--- /dev/null
+++ b/tests/models/omdet_turbo/test_modeling_omdet_turbo.py
@@ -0,0 +1,904 @@
+# coding=utf-8
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Testing suite for the PyTorch OmDet-Turbo model."""
+
+import copy
+import unittest
+from io import BytesIO
+
+import requests
+
+from transformers import OmDetTurboConfig, is_torch_available, is_vision_available
+from transformers.feature_extraction_utils import BatchFeature
+from transformers.file_utils import cached_property
+from transformers.testing_utils import (
+ require_timm,
+ require_torch,
+ require_torch_gpu,
+ require_vision,
+ slow,
+ torch_device,
+)
+
+from ...test_configuration_common import ConfigTester
+from ...test_modeling_common import ModelTesterMixin, _config_zero_init, floats_tensor, ids_tensor
+from ...test_pipeline_mixin import PipelineTesterMixin
+
+
+if is_torch_available():
+ import torch
+ import torch.nn.functional as F
+
+ from transformers import OmDetTurboForObjectDetection
+
+
+if is_vision_available():
+ from PIL import Image
+
+ from transformers import AutoProcessor
+
+
+class OmDetTurboModelTester:
+ def __init__(
+ self,
+ parent,
+ batch_size=6,
+ is_training=False,
+ num_channels=3,
+ max_text_len=7,
+ num_classes=3,
+ use_timm_backbone=False,
+ backbone=None,
+ apply_layernorm_after_vision_backbone=False,
+ image_size=224,
+ text_projection_in_dim=16,
+ text_projection_out_dim=16,
+ class_embed_dim=16,
+ hidden_size=8,
+ num_hidden_layers=2,
+ num_attention_heads=2,
+ num_queries=20,
+ encoder_in_channels=(16, 32, 64),
+ encoder_dim_feedforward=32,
+ num_projection_layers=1,
+ decoder_n_points=4,
+ num_feature_levels=3,
+ ):
+ super().__init__()
+ self.parent = parent
+ self.batch_size = batch_size
+ self.is_training = is_training
+ self.num_channels = num_channels
+ self.max_text_len = max_text_len
+ self.num_classes = num_classes
+ self.use_timm_backbone = use_timm_backbone
+ self.backbone = backbone
+ self.apply_layernorm_after_vision_backbone = apply_layernorm_after_vision_backbone
+ self.image_size = image_size
+ self.text_projection_in_dim = text_projection_in_dim
+ self.text_projection_out_dim = text_projection_out_dim
+ self.class_embed_dim = class_embed_dim
+ self.hidden_size = hidden_size
+ self.num_hidden_layers = num_hidden_layers
+ self.num_attention_heads = num_attention_heads
+ self.num_queries = num_queries
+ self.encoder_in_channels = encoder_in_channels
+ self.encoder_dim_feedforward = encoder_dim_feedforward
+ self.num_projection_layers = num_projection_layers
+ self.decoder_n_points = decoder_n_points
+ self.num_feature_levels = num_feature_levels
+
+ self.encoder_seq_length_vision = self.image_size // 32
+ self.decoder_seq_length = self.num_queries
+
+ def prepare_config_and_inputs(self):
+ pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size])
+
+ input_ids_tasks = ids_tensor([self.batch_size, self.max_text_len], self.num_classes)
+ input_ids_tasks = input_ids_tasks.to(torch_device)
+ input_ids_classes = torch.cat(
+ [ids_tensor([self.num_classes, self.max_text_len], self.num_classes) for _ in range(self.batch_size)]
+ )
+ input_ids_classes = input_ids_classes.to(torch_device)
+ attention_mask_tasks = torch.ones_like(input_ids_tasks, device=torch_device)
+ attention_mask_classes = torch.ones_like(input_ids_classes, device=torch_device)
+ classes_structure = torch.ones(self.batch_size, dtype=torch.long, device=torch_device) * self.num_classes
+ encoding = BatchFeature()
+ encoding.update(
+ {
+ "pixel_values": pixel_values,
+ "classes_input_ids": input_ids_classes,
+ "classes_attention_mask": attention_mask_classes,
+ "tasks_input_ids": input_ids_tasks,
+ "tasks_attention_mask": attention_mask_tasks,
+ "classes_structure": classes_structure,
+ }
+ )
+ config = self.get_config()
+ return config, encoding
+
+ def get_config(self):
+ text_backbone = {
+ "hidden_size": 16,
+ "num_hidden_layers": 2,
+ "num_attention_heads": 2,
+ "intermediate_size": 16,
+ "max_position_embeddings": 8,
+ "model_type": "clip_text_model",
+ }
+ backbone_config = {
+ "embed_dim": self.hidden_size,
+ "depths": (1, 1, 1, 1),
+ "num_heads": (1, 1, 1, 1),
+ "window_size": 7,
+ "image_size": self.image_size,
+ "out_indices": (2, 3, 4),
+ "model_type": "swin",
+ }
+
+ return OmDetTurboConfig(
+ text_config=text_backbone,
+ backbone_config=backbone_config,
+ use_timm_backbone=self.use_timm_backbone,
+ backbone=self.backbone,
+ apply_layernorm_after_vision_backbone=self.apply_layernorm_after_vision_backbone,
+ decoder_num_layers=self.num_hidden_layers,
+ image_size=self.image_size,
+ encoder_in_channels=self.encoder_in_channels,
+ num_queries=self.num_queries,
+ encoder_layers=self.num_hidden_layers,
+ encoder_projection_indices=[2] * self.num_projection_layers,
+ encoder_attention_heads=self.num_attention_heads,
+ decoder_num_heads=self.num_attention_heads,
+ decoder_num_points=self.decoder_n_points,
+ num_feature_levels=self.num_feature_levels,
+ encoder_dim_feedforward=self.encoder_dim_feedforward,
+ task_encoder_hidden_dim=self.encoder_dim_feedforward,
+ decoder_dim_feedforward=self.encoder_dim_feedforward,
+ class_embed_dim=self.class_embed_dim,
+ text_projection_in_dim=self.text_projection_in_dim,
+ text_projection_out_dim=self.text_projection_out_dim,
+ encoder_hidden_dim=self.hidden_size,
+ decoder_hidden_dim=self.hidden_size,
+ vision_features_channels=[self.hidden_size, self.hidden_size, self.hidden_size],
+ )
+
+ def prepare_config_and_inputs_for_common(self):
+ config, inputs_dict = self.prepare_config_and_inputs()
+ return config, inputs_dict
+
+ def create_and_check_object_detection_head_model(self, config, inputs_dict):
+ model = OmDetTurboForObjectDetection(config=config)
+ model.to(torch_device)
+ model.eval()
+
+ result = model(**inputs_dict)
+
+ self.parent.assertEqual(result.decoder_coord_logits.shape, (self.batch_size, self.num_queries, 4))
+ self.parent.assertEqual(
+ result.decoder_class_logits.shape, (self.batch_size, self.num_queries, self.num_classes)
+ )
+
+
+@require_torch
+class OmDetTurboModelTest(ModelTesterMixin, PipelineTesterMixin, unittest.TestCase):
+ all_model_classes = (OmDetTurboForObjectDetection,) if is_torch_available() else ()
+ is_encoder_decoder = True
+ test_pruning = False
+ test_head_masking = False
+ pipeline_model_mapping = (
+ {"zero-shot-object-detection": OmDetTurboForObjectDetection} if is_torch_available() else {}
+ )
+
+ # special case for head models
+ def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
+ inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
+
+ return inputs_dict
+
+ def setUp(self):
+ self.model_tester = OmDetTurboModelTester(self)
+ self.config_tester = ConfigTester(
+ self,
+ config_class=OmDetTurboConfig,
+ has_text_modality=False,
+ common_properties=["d_model", "encoder_attention_heads", "decoder_num_heads"],
+ )
+
+ def test_config(self):
+ self.config_tester.run_common_tests()
+
+ def test_object_detection_head_model(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs()
+ self.model_tester.create_and_check_object_detection_head_model(config, inputs_dict)
+
+ @unittest.skip(
+ reason="Unsupported as classes_input_ids are classes input are flattened by the processor: https://github.com/huggingface/transformers/issues/33669"
+ )
+ def test_multi_gpu_data_parallel_forward(self):
+ pass
+
+ @unittest.skip(reason="OmDet-Turbo does not use inputs_embeds")
+ def test_inputs_embeds(self):
+ pass
+
+ @unittest.skip(reason="OmDet-Turbo does not have 'input_ids' and 'attention_mask'")
+ def test_torchscript_output_attentions(self):
+ pass
+
+ @unittest.skip(reason="OmDet-Turbo does not have 'input_ids' and 'attention_mask'")
+ def test_torchscript_output_hidden_states(self):
+ pass
+
+ @unittest.skip(reason="OmDet-Turbo does not have 'input_ids' and 'attention_mask'")
+ def test_torchscript_simple(self):
+ pass
+
+ @unittest.skip(reason="OmDet-Turbo does not have 'input_ids' and 'attention_mask'")
+ def test_torchscript_output_hidden_state(self):
+ pass
+
+ def test_resize_tokens_embeddings(self):
+ # rewrite as OmDet-Turbo does not have "input_ids" and "decoder_input_ids"
+ (
+ original_config,
+ inputs_dict,
+ ) = self.model_tester.prepare_config_and_inputs_for_common()
+ if not self.test_resize_embeddings:
+ self.skipTest(reason="test_resize_embeddings is set to `False`")
+
+ for model_class in self.all_model_classes:
+ config = copy.deepcopy(original_config)
+ model = model_class(config)
+ model.to(torch_device)
+ model_embed_pre_resize = model.get_input_embeddings()
+ type_model_embed_pre_resize = type(model_embed_pre_resize)
+
+ if self.model_tester.is_training is False:
+ model.eval()
+
+ model_vocab_size = config.text_config.vocab_size if hasattr(config, "text_config") else config.vocab_size
+ # Retrieve the embeddings and clone theme
+ model_embed = model.resize_token_embeddings(model_vocab_size)
+ cloned_embeddings = model_embed.weight.clone()
+
+ # Check that resizing the token embeddings with a larger vocab size increases the model's vocab size
+ model_embed = model.resize_token_embeddings(model_vocab_size + 10)
+ new_model_vocab_size = (
+ model.config.text_config.vocab_size
+ if hasattr(model.config, "text_config")
+ else model.config.vocab_size
+ )
+ self.assertEqual(new_model_vocab_size, model_vocab_size + 10)
+ # Check that it actually resizes the embeddings matrix
+ self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] + 10)
+ # Check to make sure the type of embeddings returned post resizing is same as type of input
+ type_model_embed_post_resize = type(model_embed)
+ self.assertEqual(type_model_embed_pre_resize, type_model_embed_post_resize)
+ # Check that the model can still do a forward pass successfully (every parameter should be resized)
+ model(**self._prepare_for_class(inputs_dict, model_class))
+
+ # Check that resizing the token embeddings with a smaller vocab size decreases the model's vocab size
+ model_embed = model.resize_token_embeddings(model_vocab_size - 15)
+ new_model_vocab_size = (
+ model.config.text_config.vocab_size
+ if hasattr(model.config, "text_config")
+ else model.config.vocab_size
+ )
+ self.assertEqual(new_model_vocab_size, model_vocab_size - 15)
+ # Check that it actually resizes the embeddings matrix
+ self.assertEqual(model_embed.weight.shape[0], cloned_embeddings.shape[0] - 15)
+
+ # Check that the model can still do a forward pass successfully (every parameter should be resized)
+ # Input ids should be clamped to the maximum size of the vocabulary
+ inputs_dict["tasks_input_ids"].clamp_(max=model_vocab_size - 15 - 1)
+
+ # make sure that classes_input_ids are resized as well
+ if "classes_input_ids" in inputs_dict:
+ inputs_dict["classes_input_ids"].clamp_(max=model_vocab_size - 15 - 1)
+ model(**self._prepare_for_class(inputs_dict, model_class))
+
+ # Check that adding and removing tokens has not modified the first part of the embedding matrix.
+ models_equal = True
+ for p1, p2 in zip(cloned_embeddings, model_embed.weight):
+ if p1.data.ne(p2.data).sum() > 0:
+ models_equal = False
+
+ self.assertTrue(models_equal)
+
+ config = copy.deepcopy(original_config)
+ model = model_class(config)
+ model.to(torch_device)
+
+ model_vocab_size = config.text_config.vocab_size if hasattr(config, "text_config") else config.vocab_size
+ model.resize_token_embeddings(model_vocab_size + 10, pad_to_multiple_of=1)
+ new_model_vocab_size = (
+ model.config.text_config.vocab_size
+ if hasattr(model.config, "text_config")
+ else model.config.vocab_size
+ )
+ self.assertTrue(new_model_vocab_size + 10, model_vocab_size)
+
+ model_embed = model.resize_token_embeddings(model_vocab_size, pad_to_multiple_of=64)
+ new_model_vocab_size = (
+ model.config.text_config.vocab_size
+ if hasattr(model.config, "text_config")
+ else model.config.vocab_size
+ )
+ self.assertTrue(model_embed.weight.shape[0] // 64, 0)
+
+ self.assertTrue(model_embed.weight.shape[0], new_model_vocab_size)
+ self.assertTrue(new_model_vocab_size, model.vocab_size)
+
+ model_embed = model.resize_token_embeddings(model_vocab_size + 13, pad_to_multiple_of=64)
+ self.assertTrue(model_embed.weight.shape[0] // 64, 0)
+
+ # Check that resizing a model to a multiple of pad_to_multiple leads to a model of exactly that size
+ target_dimension = 128
+ model_embed = model.resize_token_embeddings(target_dimension, pad_to_multiple_of=64)
+ self.assertTrue(model_embed.weight.shape[0], target_dimension)
+
+ with self.assertRaisesRegex(
+ ValueError,
+ "Asking to pad the embedding matrix to a multiple of `1.3`, which is not and integer. Please make sure to pass an integer",
+ ):
+ model.resize_token_embeddings(model_vocab_size, pad_to_multiple_of=1.3)
+
+ # Overwrite as `init_reference_points` is not batch dependent and contains `inf` values
+ def test_batching_equivalence(self):
+ """
+ Tests that the model supports batching and that the output is nearly the same for the same input in
+ different batch sizes.
+ (Why "nearly the same" not "exactly the same"? Batching uses different matmul shapes, which often leads to
+ different results: https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535)
+ """
+
+ def get_tensor_equivalence_function(batched_input):
+ # models operating on continuous spaces have higher abs difference than LMs
+ # instead, we can rely on cos distance for image/speech models, similar to `diffusers`
+ if "input_ids" not in batched_input:
+ return lambda tensor1, tensor2: (
+ 1.0 - F.cosine_similarity(tensor1.float().flatten(), tensor2.float().flatten(), dim=0, eps=1e-38)
+ )
+ return lambda tensor1, tensor2: torch.max(torch.abs(tensor1 - tensor2))
+
+ def recursive_check(batched_object, single_row_object, model_name, key):
+ if isinstance(batched_object, (list, tuple)):
+ for batched_object_value, single_row_object_value in zip(batched_object, single_row_object):
+ recursive_check(batched_object_value, single_row_object_value, model_name, key)
+ elif isinstance(batched_object, dict):
+ for batched_object_value, single_row_object_value in zip(
+ batched_object.values(), single_row_object.values()
+ ):
+ recursive_check(batched_object_value, single_row_object_value, model_name, key)
+ # do not compare returned loss (0-dim tensor) / codebook ids (int) / caching objects
+ elif batched_object is None or not isinstance(batched_object, torch.Tensor):
+ return
+ elif batched_object.dim() == 0:
+ return
+ elif key != "init_reference_points":
+ # init
+ # indexing the first element does not always work
+ # e.g. models that output similarity scores of size (N, M) would need to index [0, 0]
+ slice_ids = [slice(0, index) for index in single_row_object.shape]
+ batched_row = batched_object[slice_ids]
+ self.assertFalse(
+ torch.isnan(batched_row).any(), f"Batched output has `nan` in {model_name} for key={key}"
+ )
+ self.assertFalse(
+ torch.isinf(batched_row).any(), f"Batched output has `inf` in {model_name} for key={key}"
+ )
+ self.assertFalse(
+ torch.isnan(single_row_object).any(), f"Single row output has `nan` in {model_name} for key={key}"
+ )
+ self.assertFalse(
+ torch.isinf(single_row_object).any(),
+ f"Single row output has `inf` in {model_name} for key={key}",
+ )
+ self.assertTrue(
+ (equivalence(batched_row, single_row_object)) <= 1e-03,
+ msg=(
+ f"Batched and Single row outputs are not equal in {model_name} for key={key}. "
+ f"Difference={equivalence(batched_row, single_row_object)}."
+ ),
+ )
+
+ config, batched_input = self.model_tester.prepare_config_and_inputs_for_common()
+ equivalence = get_tensor_equivalence_function(batched_input)
+
+ for model_class in self.all_model_classes:
+ config.output_hidden_states = True
+
+ model_name = model_class.__name__
+ if hasattr(self.model_tester, "prepare_config_and_inputs_for_model_class"):
+ config, batched_input = self.model_tester.prepare_config_and_inputs_for_model_class(model_class)
+ batched_input_prepared = self._prepare_for_class(batched_input, model_class)
+ model = model_class(config).to(torch_device).eval()
+ batch_size = self.model_tester.batch_size
+ single_row_input = {}
+ for key, value in batched_input_prepared.items():
+ single_batch_shape = value.shape[0] // batch_size
+ single_row_input[key] = value[:single_batch_shape]
+
+ with torch.no_grad():
+ model_batched_output = model(**batched_input_prepared)
+ model_row_output = model(**single_row_input)
+
+ if isinstance(model_batched_output, torch.Tensor):
+ model_batched_output = {"model_output": model_batched_output}
+ model_row_output = {"model_output": model_row_output}
+
+ for key in model_batched_output:
+ # DETR starts from zero-init queries to decoder, leading to cos_similarity = `nan`
+ if hasattr(self, "zero_init_hidden_state") and "decoder_hidden_states" in key:
+ model_batched_output[key] = model_batched_output[key][1:]
+ model_row_output[key] = model_row_output[key][1:]
+ if key in ("decoder_class_logits", "decoder_classes", "encoder_class_logits"):
+ # check if all elements are close to 0, if so skip the test as the test strugles with comparing
+ # tensors with all elements close to 0
+ if torch.allclose(
+ model_batched_output[key], torch.zeros_like(model_batched_output[key]), atol=1e-6
+ ) and torch.allclose(model_row_output[key], torch.zeros_like(model_row_output[key]), atol=1e-6):
+ continue
+
+ recursive_check(model_batched_output[key], model_row_output[key], model_name, key)
+
+ def test_attention_outputs(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+ config.return_dict = True
+
+ for model_class in self.all_model_classes:
+ inputs_dict["output_attentions"] = True
+ inputs_dict["output_hidden_states"] = False
+ config.return_dict = True
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+ with torch.no_grad():
+ outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+ attentions = outputs.encoder_attentions[-1]
+ self.assertEqual(len(attentions), self.model_tester.num_hidden_layers)
+
+ # check that output_attentions also work using config
+ del inputs_dict["output_attentions"]
+ config.output_attentions = True
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+ with torch.no_grad():
+ outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+ attentions = outputs.encoder_attentions[-1]
+ self.assertEqual(
+ len(attentions), self.model_tester.num_hidden_layers * self.model_tester.num_projection_layers
+ )
+ # Rest of the shape seems to depend on backbone output shapes and image size
+ self.assertListEqual(
+ list(attentions[0].shape[-3:]),
+ [
+ self.model_tester.num_attention_heads,
+ self.model_tester.encoder_seq_length_vision**2,
+ self.model_tester.encoder_seq_length_vision**2,
+ ],
+ )
+ # decoder attentions
+ decoder_attentions = outputs.decoder_attentions[0]
+ self.assertIsInstance(decoder_attentions, (list, tuple))
+ self.assertEqual(len(decoder_attentions), self.model_tester.num_hidden_layers)
+ self.assertListEqual(
+ list(decoder_attentions[0].shape[-3:]),
+ [
+ self.model_tester.num_attention_heads,
+ self.model_tester.num_queries + self.model_tester.max_text_len,
+ self.model_tester.num_queries + self.model_tester.max_text_len,
+ ],
+ )
+
+ # cross attentions
+ cross_attentions = outputs.decoder_attentions[-1]
+ self.assertIsInstance(cross_attentions, (list, tuple))
+ self.assertEqual(len(cross_attentions), self.model_tester.num_hidden_layers)
+ self.assertListEqual(
+ list(cross_attentions[0].shape[-3:]),
+ [
+ self.model_tester.num_attention_heads,
+ self.model_tester.num_feature_levels,
+ self.model_tester.decoder_n_points,
+ ],
+ )
+
+ # Check attention is always last and order is fine
+ inputs_dict["output_attentions"] = True
+ inputs_dict["output_hidden_states"] = True
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+ with torch.no_grad():
+ outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+ self_attentions = outputs.encoder_attentions[-1]
+
+ self.assertEqual(
+ len(self_attentions), self.model_tester.num_hidden_layers * self.model_tester.num_projection_layers
+ )
+ self.assertListEqual(
+ list(attentions[0].shape[-3:]),
+ [
+ self.model_tester.num_attention_heads,
+ self.model_tester.encoder_seq_length_vision**2,
+ self.model_tester.encoder_seq_length_vision**2,
+ ],
+ )
+
+ # overwrite since encoder_hidden_states are 3-dim and not 2-dim
+ def test_hidden_states_output(self):
+ def check_hidden_states_output(inputs_dict, config, model_class):
+ model = model_class(config)
+ model.to(torch_device)
+ model.eval()
+
+ with torch.no_grad():
+ outputs = model(**self._prepare_for_class(inputs_dict, model_class))
+
+ hidden_states = outputs.encoder_hidden_states
+ expected_num_layers = getattr(
+ self.model_tester, "expected_num_hidden_layers", self.model_tester.num_projection_layers + 1
+ )
+ self.assertEqual(len(hidden_states), expected_num_layers)
+
+ seq_len = self.model_tester.encoder_seq_length_vision
+
+ self.assertListEqual(list(hidden_states[0].shape[-3:]), [self.model_tester.hidden_size, seq_len, seq_len])
+
+ hidden_states = outputs.decoder_hidden_states
+ expected_num_layers = getattr(
+ self.model_tester, "expected_num_hidden_layers", self.model_tester.num_hidden_layers + 1
+ )
+ self.assertIsInstance(hidden_states, (list, tuple))
+ self.assertEqual(len(hidden_states), expected_num_layers)
+ self.assertListEqual(
+ list(hidden_states[0].shape[-2:]),
+ [self.model_tester.decoder_seq_length, self.model_tester.hidden_size],
+ )
+
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ for model_class in self.all_model_classes:
+ inputs_dict["output_hidden_states"] = True
+ check_hidden_states_output(inputs_dict, config, model_class)
+
+ # check that output_hidden_states also work using config
+ del inputs_dict["output_hidden_states"]
+ config.output_hidden_states = True
+
+ check_hidden_states_output(inputs_dict, config, model_class)
+
+ # removed retain_grad and grad on decoder_hidden_states, as queries don't require grad
+ def test_retain_grad_hidden_states_attentions(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+ config.output_hidden_states = True
+ config.output_attentions = True
+
+ # no need to test all models as different heads yield the same functionality
+ model_class = self.all_model_classes[0]
+ model = model_class(config)
+ model.to(torch_device)
+
+ inputs = self._prepare_for_class(inputs_dict, model_class)
+
+ outputs = model(**inputs)
+
+ output = outputs[0]
+
+ encoder_hidden_states = outputs.encoder_hidden_states[0]
+ encoder_attentions = outputs.encoder_attentions[0][0]
+ encoder_hidden_states.retain_grad()
+ encoder_attentions.retain_grad()
+
+ cross_attentions = outputs.decoder_attentions[-1][0]
+ cross_attentions.retain_grad()
+
+ output.flatten()[0].backward(retain_graph=True)
+
+ self.assertIsNotNone(encoder_hidden_states.grad)
+ self.assertIsNotNone(encoder_attentions.grad)
+ self.assertIsNotNone(cross_attentions.grad)
+
+ def test_initialization(self):
+ config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common()
+
+ configs_no_init = _config_zero_init(config)
+ for model_class in self.all_model_classes:
+ model = model_class(config=configs_no_init)
+ for name, param in model.named_parameters():
+ if param.requires_grad:
+ if (
+ "embeddings" in name
+ or ".fc" in name
+ or "decoder.channel_projection_layers" in name
+ or "query_position_head" in name
+ or "decoder.encoder_vision_features" in name
+ ):
+ continue
+ self.assertIn(
+ ((param.data.mean() * 1e9).round() / 1e9).item(),
+ [0.0, 1.0],
+ msg=f"Parameter {name} seems not properly initialized",
+ )
+
+
+# We will verify our results on an image of cute cats
+def prepare_img():
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+ image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+ return image
+
+
+def prepare_text():
+ classes = ["cat", "remote"]
+ task = "Detect {}.".format(", ".join(classes))
+ return classes, task
+
+
+def prepare_img_batched():
+ url1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
+ url2 = "http://images.cocodataset.org/train2017/000000257813.jpg"
+ url3 = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
+
+ return [Image.open(BytesIO(requests.get(url).content)).convert("RGB") for url in [url1, url2, url3]]
+
+
+def prepare_text_batched():
+ classes1 = ["cat", "remote"]
+ classes2 = ["boat"]
+ classes3 = ["statue", "trees", "torch"]
+
+ task1 = "Detect {}.".format(", ".join(classes1))
+ task2 = "Detect all the boat in the image."
+ task3 = "Focus on the foreground, detect statue, torch and trees."
+ return [classes1, classes2, classes3], [task1, task2, task3]
+
+
+@require_timm
+@require_vision
+@slow
+class OmDetTurboModelIntegrationTests(unittest.TestCase):
+ @cached_property
+ def default_processor(self):
+ return AutoProcessor.from_pretrained("omlab/omdet-turbo-swin-tiny-hf") if is_vision_available() else None
+
+ def test_inference_object_detection_head(self):
+ model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf").to(torch_device)
+
+ processor = self.default_processor
+ image = prepare_img()
+ classes, task = prepare_text()
+ encoding = processor(images=image, text=classes, task=task, return_tensors="pt").to(torch_device)
+
+ with torch.no_grad():
+ outputs = model(**encoding)
+
+ expected_shape_coord_logits = torch.Size((1, model.config.num_queries, 4))
+ expected_shape_class_logits = torch.Size((1, model.config.num_queries, 2))
+ self.assertEqual(outputs.decoder_coord_logits.shape, expected_shape_coord_logits)
+ self.assertEqual(outputs.decoder_class_logits.shape, expected_shape_class_logits)
+
+ expected_class_logits = torch.tensor([[[0.9427, -2.5958], [0.2105, -3.4569], [-2.6364, -4.1610]]]).to(
+ torch_device
+ )
+ expected_coord_logits = torch.tensor(
+ [[[0.2550, 0.5501, 0.4738, 0.8745], [0.7695, 0.4121, 0.4603, 0.7244], [0.7691, 0.4117, 0.4603, 0.7214]]]
+ ).to(torch_device)
+
+ self.assertTrue(torch.allclose(outputs.decoder_class_logits[:3, :3], expected_class_logits, atol=1e-1))
+ self.assertTrue(torch.allclose(outputs.decoder_coord_logits[:3, :3], expected_coord_logits, atol=1e-3))
+
+ # verify grounded postprocessing
+ results = processor.post_process_grounded_object_detection(
+ outputs, classes=[classes], target_sizes=[image.size[::-1]]
+ )[0]
+ expected_scores = torch.tensor([0.7675, 0.7196, 0.5634, 0.5524]).to(torch_device)
+ expected_slice_boxes = torch.tensor([39.8870, 70.3522, 176.7424, 118.0354]).to(torch_device)
+
+ self.assertEqual(len(results["scores"]), 4)
+ self.assertTrue(torch.allclose(results["scores"], expected_scores, atol=1e-2))
+ self.assertTrue(torch.allclose(results["boxes"][0, :], expected_slice_boxes, atol=1e-2))
+
+ expected_classes = ["remote", "cat", "remote", "cat"]
+ self.assertListEqual(results["classes"], expected_classes)
+
+ def test_inference_object_detection_head_fp16(self):
+ model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf").to(
+ torch_device, dtype=torch.float16
+ )
+
+ processor = self.default_processor
+ image = prepare_img()
+ classes, task = prepare_text()
+ encoding = processor(images=image, text=classes, task=task, return_tensors="pt").to(
+ torch_device, dtype=torch.float16
+ )
+
+ with torch.no_grad():
+ outputs = model(**encoding)
+
+ expected_shape_coord_logits = torch.Size((1, model.config.num_queries, 4))
+ expected_shape_class_logits = torch.Size((1, model.config.num_queries, 2))
+ self.assertEqual(outputs.decoder_coord_logits.shape, expected_shape_coord_logits)
+ self.assertEqual(outputs.decoder_class_logits.shape, expected_shape_class_logits)
+
+ expected_class_logits = torch.tensor([[[0.9427, -2.5958], [0.2105, -3.4569], [-2.6364, -4.1610]]]).to(
+ torch_device, dtype=torch.float16
+ )
+ expected_coord_logits = torch.tensor(
+ [[[0.2550, 0.5501, 0.4738, 0.8745], [0.7695, 0.4121, 0.4603, 0.7244], [0.7691, 0.4117, 0.4603, 0.7214]]]
+ ).to(torch_device, dtype=torch.float16)
+
+ self.assertTrue(torch.allclose(outputs.decoder_class_logits[:3, :3], expected_class_logits, atol=1e-1))
+ self.assertTrue(torch.allclose(outputs.decoder_coord_logits[:3, :3], expected_coord_logits, atol=1e-3))
+
+ # verify grounded postprocessing
+ results = processor.post_process_grounded_object_detection(
+ outputs, classes=[classes], target_sizes=[image.size[::-1]]
+ )[0]
+ expected_scores = torch.tensor([0.7675, 0.7196, 0.5634, 0.5524]).to(torch_device, dtype=torch.float16)
+ expected_slice_boxes = torch.tensor([39.8870, 70.3522, 176.7424, 118.0354]).to(
+ torch_device, dtype=torch.float16
+ )
+
+ self.assertEqual(len(results["scores"]), 4)
+ self.assertTrue(torch.allclose(results["scores"], expected_scores, atol=1e-2))
+ self.assertTrue(torch.allclose(results["boxes"][0, :], expected_slice_boxes, atol=1e-1))
+
+ expected_classes = ["remote", "cat", "remote", "cat"]
+ self.assertListEqual(results["classes"], expected_classes)
+
+ def test_inference_object_detection_head_no_task(self):
+ model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf").to(torch_device)
+
+ processor = self.default_processor
+ image = prepare_img()
+ classes, _ = prepare_text()
+ encoding = processor(images=image, text=classes, return_tensors="pt").to(torch_device)
+
+ with torch.no_grad():
+ outputs = model(**encoding)
+
+ expected_shape_coord_logits = torch.Size((1, model.config.num_queries, 4))
+ expected_shape_class_logits = torch.Size((1, model.config.num_queries, 2))
+ self.assertEqual(outputs.decoder_coord_logits.shape, expected_shape_coord_logits)
+ self.assertEqual(outputs.decoder_class_logits.shape, expected_shape_class_logits)
+
+ expected_class_logits = torch.tensor([[[0.9427, -2.5958], [0.2105, -3.4569], [-2.6364, -4.1610]]]).to(
+ torch_device
+ )
+ expected_coord_logits = torch.tensor(
+ [[[0.2550, 0.5501, 0.4738, 0.8745], [0.7695, 0.4121, 0.4603, 0.7244], [0.7691, 0.4117, 0.4603, 0.7214]]]
+ ).to(torch_device)
+
+ self.assertTrue(torch.allclose(outputs.decoder_class_logits[:3, :3], expected_class_logits, atol=1e-1))
+ self.assertTrue(torch.allclose(outputs.decoder_coord_logits[:3, :3], expected_coord_logits, atol=1e-3))
+
+ # verify grounded postprocessing
+ results = processor.post_process_grounded_object_detection(
+ outputs, classes=[classes], target_sizes=[image.size[::-1]]
+ )[0]
+ expected_scores = torch.tensor([0.7675, 0.7196, 0.5634, 0.5524]).to(torch_device)
+ expected_slice_boxes = torch.tensor([39.8870, 70.3522, 176.7424, 118.0354]).to(torch_device)
+
+ self.assertEqual(len(results["scores"]), 4)
+ self.assertTrue(torch.allclose(results["scores"], expected_scores, atol=1e-2))
+ self.assertTrue(torch.allclose(results["boxes"][0, :], expected_slice_boxes, atol=1e-2))
+
+ expected_classes = ["remote", "cat", "remote", "cat"]
+ self.assertListEqual(results["classes"], expected_classes)
+
+ def test_inference_object_detection_head_batched(self):
+ torch_device = "cpu"
+ model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf").to(torch_device)
+
+ processor = self.default_processor
+ images_batched = prepare_img_batched()
+ classes_batched, tasks_batched = prepare_text_batched()
+ encoding = processor(images=images_batched, text=classes_batched, task=tasks_batched, return_tensors="pt").to(
+ torch_device
+ )
+
+ with torch.no_grad():
+ outputs = model(**encoding)
+
+ expected_shape_coord_logits = torch.Size((len(images_batched), model.config.num_queries, 4))
+ expected_shape_class_logits = torch.Size((len(images_batched), model.config.num_queries, 3))
+ self.assertEqual(outputs.decoder_coord_logits.shape, expected_shape_coord_logits)
+ self.assertEqual(outputs.decoder_class_logits.shape, expected_shape_class_logits)
+
+ expected_class_logits = torch.tensor(
+ [[[0.9427, -2.5958, -7.7601]], [[-2.3408, -9.3516, -9.3516]], [[1.0740, -2.3315, -1.1885]]]
+ ).to(torch_device)
+
+ expected_coord_logits = torch.tensor(
+ [[[0.2550, 0.5501, 0.4738]], [[0.2535, 0.6006, 0.0353]], [[0.3742, 0.3337, 0.0666]]]
+ ).to(torch_device)
+
+ self.assertTrue(torch.allclose(outputs.decoder_class_logits[:, :1, :3], expected_class_logits, atol=1e-1))
+ self.assertTrue(torch.allclose(outputs.decoder_coord_logits[:, :1, :3], expected_coord_logits, atol=1e-3))
+
+ # verify grounded postprocessing
+ results = processor.post_process_grounded_object_detection(
+ outputs,
+ classes=classes_batched,
+ target_sizes=[image.size[::-1] for image in images_batched],
+ score_threshold=0.2,
+ )
+ expected_scores = torch.tensor([0.7675, 0.3016, 0.7454]).to(torch_device)
+ expected_slice_boxes = torch.tensor(
+ [
+ [39.8870, 70.3522, 176.7424, 118.0354],
+ [146.5446, 219.7132, 209.6983, 251.0456],
+ [545.3470, 209.9055, 651.9860, 502.1882],
+ ]
+ ).to(torch_device)
+
+ self.assertListEqual([len(result["scores"]) for result in results], [4, 4, 6])
+ self.assertTrue(
+ torch.allclose(torch.stack([result["scores"][0] for result in results]), expected_scores, atol=1e-2)
+ )
+ self.assertTrue(
+ torch.allclose(torch.stack([result["boxes"][0, :] for result in results]), expected_slice_boxes, atol=1e-2)
+ )
+
+ expected_classes = [
+ ["remote", "cat", "remote", "cat"],
+ ["boat", "boat", "boat", "boat"],
+ ["statue", "trees", "trees", "torch", "statue", "statue"],
+ ]
+ self.assertListEqual([result["classes"] for result in results], expected_classes)
+
+ @require_torch_gpu
+ def test_inference_object_detection_head_equivalence_cpu_gpu(self):
+ processor = self.default_processor
+ image = prepare_img()
+ classes, task = prepare_text()
+ encoding = processor(images=image, text=classes, task=task, return_tensors="pt")
+ # 1. run model on CPU
+ model = OmDetTurboForObjectDetection.from_pretrained("omlab/omdet-turbo-swin-tiny-hf")
+
+ with torch.no_grad():
+ cpu_outputs = model(**encoding)
+
+ # 2. run model on GPU
+ model.to("cuda")
+ encoding = encoding.to("cuda")
+ with torch.no_grad():
+ gpu_outputs = model(**encoding)
+
+ # 3. assert equivalence
+ expected_class_logits = torch.tensor([[[0.9427, -2.5958], [0.2105, -3.4569], [-2.6364, -4.1610]]])
+ expected_coord_logits = torch.tensor(
+ [[[0.2550, 0.5501, 0.4738, 0.8745], [0.7695, 0.4121, 0.4603, 0.7244], [0.7691, 0.4117, 0.4603, 0.7214]]]
+ )
+
+ self.assertTrue(torch.allclose(cpu_outputs.decoder_class_logits[:3, :3], expected_class_logits, atol=1e-1))
+ self.assertTrue(torch.allclose(cpu_outputs.decoder_coord_logits[:3, :3], expected_coord_logits, atol=1e-3))
+
+ # verify grounded postprocessing
+ results_cpu = processor.post_process_grounded_object_detection(
+ cpu_outputs, classes=[classes], target_sizes=[image.size[::-1]]
+ )[0]
+ result_gpu = processor.post_process_grounded_object_detection(
+ gpu_outputs, classes=[classes], target_sizes=[image.size[::-1]]
+ )[0]
+
+ self.assertTrue(torch.allclose(results_cpu["scores"], result_gpu["scores"].cpu(), atol=1e-2))
+ self.assertTrue(torch.allclose(results_cpu["boxes"][0, :], result_gpu["boxes"][0, :].cpu(), atol=1e-2))
diff --git a/tests/models/omdet_turbo/test_processor_omdet_turbo.py b/tests/models/omdet_turbo/test_processor_omdet_turbo.py
new file mode 100644
index 000000000000..e6e2a1f50c52
--- /dev/null
+++ b/tests/models/omdet_turbo/test_processor_omdet_turbo.py
@@ -0,0 +1,363 @@
+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+
+import shutil
+import tempfile
+import unittest
+
+import numpy as np
+import pytest
+
+from transformers import AutoProcessor, CLIPTokenizerFast, OmDetTurboProcessor
+from transformers.testing_utils import require_torch, require_vision
+from transformers.utils import is_torch_available, is_vision_available
+
+from ...test_processing_common import ProcessorTesterMixin
+
+
+IMAGE_MEAN = [123.675, 116.28, 103.53]
+IMAGE_STD = [58.395, 57.12, 57.375]
+
+if is_torch_available():
+ import torch
+
+ from transformers.models.omdet_turbo.modeling_omdet_turbo import OmDetTurboObjectDetectionOutput
+
+if is_vision_available():
+ from PIL import Image
+
+ from transformers import DetrImageProcessor
+
+
+@require_torch
+@require_vision
+class OmDetTurboProcessorTest(ProcessorTesterMixin, unittest.TestCase):
+ processor_class = OmDetTurboProcessor
+
+ def setUp(self):
+ self.tmpdirname = tempfile.mkdtemp()
+ image_processor = DetrImageProcessor()
+ tokenizer = CLIPTokenizerFast.from_pretrained("openai/clip-vit-base-patch32")
+
+ processor = OmDetTurboProcessor(image_processor, tokenizer)
+ processor.save_pretrained(self.tmpdirname)
+
+ self.input_keys = [
+ "tasks_input_ids",
+ "tasks_attention_mask",
+ "classes_input_ids",
+ "classes_attention_mask",
+ "classes_structure",
+ "pixel_values",
+ "pixel_mask",
+ ]
+
+ self.batch_size = 5
+ self.num_queries = 5
+ self.embed_dim = 3
+
+ def get_tokenizer(self, **kwargs):
+ return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).tokenizer
+
+ def get_image_processor(self, **kwargs):
+ return AutoProcessor.from_pretrained(self.tmpdirname, **kwargs).image_processor
+
+ def tearDown(self):
+ shutil.rmtree(self.tmpdirname)
+
+ def prepare_image_inputs(self):
+ """This function prepares a list of PIL images, or a list of numpy arrays if one specifies numpify=True,
+ or a list of PyTorch tensors if one specifies torchify=True.
+ """
+
+ image_inputs = [np.random.randint(255, size=(3, 30, 400), dtype=np.uint8)]
+
+ image_inputs = [Image.fromarray(np.moveaxis(x, 0, -1)) for x in image_inputs]
+
+ return image_inputs
+
+ def get_fake_omdet_turbo_output(self):
+ torch.manual_seed(42)
+ return OmDetTurboObjectDetectionOutput(
+ decoder_coord_logits=torch.rand(self.batch_size, self.num_queries, 4),
+ decoder_class_logits=torch.rand(self.batch_size, self.num_queries, self.embed_dim),
+ )
+
+ def get_fake_omdet_turbo_classes(self):
+ return [[f"class{i}_{j}" for i in range(self.num_queries)] for j in range(self.batch_size)]
+
+ def test_post_process_grounded_object_detection(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ omdet_turbo_output = self.get_fake_omdet_turbo_output()
+ omdet_turbo_classes = self.get_fake_omdet_turbo_classes()
+
+ post_processed = processor.post_process_grounded_object_detection(
+ omdet_turbo_output, omdet_turbo_classes, target_sizes=[(400, 30) for _ in range(self.batch_size)]
+ )
+
+ self.assertEqual(len(post_processed), self.batch_size)
+ self.assertEqual(list(post_processed[0].keys()), ["boxes", "scores", "classes"])
+ self.assertEqual(post_processed[0]["boxes"].shape, (self.num_queries, 4))
+ self.assertEqual(post_processed[0]["scores"].shape, (self.num_queries,))
+ expected_scores = torch.tensor([0.7310, 0.6579, 0.6513, 0.6444, 0.6252])
+ self.assertTrue(torch.allclose(post_processed[0]["scores"], expected_scores, atol=1e-4))
+
+ expected_box_slice = torch.tensor([14.9657, 141.2052, 30.0000, 312.9670])
+ self.assertTrue(torch.allclose(post_processed[0]["boxes"][0], expected_box_slice, atol=1e-4))
+
+ def test_save_load_pretrained_additional_features(self):
+ processor = OmDetTurboProcessor(tokenizer=self.get_tokenizer(), image_processor=self.get_image_processor())
+ processor.save_pretrained(self.tmpdirname)
+
+ tokenizer_add_kwargs = self.get_tokenizer(bos_token="(BOS)", eos_token="(EOS)")
+ image_processor_add_kwargs = self.get_image_processor(do_normalize=False, padding_value=1.0)
+
+ processor = OmDetTurboProcessor.from_pretrained(
+ self.tmpdirname, bos_token="(BOS)", eos_token="(EOS)", do_normalize=False, padding_value=1.0
+ )
+
+ self.assertEqual(processor.tokenizer.get_vocab(), tokenizer_add_kwargs.get_vocab())
+ self.assertIsInstance(processor.tokenizer, CLIPTokenizerFast)
+
+ self.assertEqual(processor.image_processor.to_json_string(), image_processor_add_kwargs.to_json_string())
+ self.assertIsInstance(processor.image_processor, DetrImageProcessor)
+
+ def test_image_processor(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor).image_processor
+
+ image_input = self.prepare_image_inputs()
+
+ input_image_proc = image_processor(image_input, return_tensors="np")
+ input_processor = processor(images=image_input, return_tensors="np")
+
+ for key in input_image_proc.keys():
+ self.assertAlmostEqual(input_image_proc[key].sum(), input_processor[key].sum(), delta=1e-2)
+
+ def test_tokenizer(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor).tokenizer
+
+ input_str = "lower newer"
+
+ encoded_processor = processor(text=input_str, padding="max_length", truncation=True, max_length=77)
+
+ encoded_tok = tokenizer(input_str, padding="max_length", truncation=True, max_length=77)
+
+ for key in encoded_tok.keys():
+ self.assertListEqual(encoded_tok[key], encoded_processor[key])
+
+ def test_processor(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ input_tasks = "task"
+ input_classes = ["class1", "class2"]
+ image_input = self.prepare_image_inputs()
+
+ input_processor = processor(images=image_input, text=input_classes, task=input_tasks, return_tensors="pt")
+
+ for key in self.input_keys:
+ assert torch.is_tensor(input_processor[key])
+ # test if it raises when no input is passed
+ with pytest.raises(ValueError):
+ processor()
+
+ def test_tokenizer_decode(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ predicted_ids = [[1, 4, 5, 8, 1, 0, 8], [3, 4, 3, 1, 1, 8, 9]]
+
+ decoded_processor = processor.batch_decode(predicted_ids)
+ decoded_tok = tokenizer.batch_decode(predicted_ids)
+
+ self.assertListEqual(decoded_tok, decoded_processor)
+
+ def test_model_input_names(self):
+ image_processor = self.get_image_processor()
+ tokenizer = self.get_tokenizer()
+
+ processor = OmDetTurboProcessor(tokenizer=tokenizer, image_processor=image_processor)
+
+ input_tasks = "task"
+ input_classes = ["class1", "class2"]
+ image_input = self.prepare_image_inputs()
+ inputs = processor(images=image_input, text=input_classes, task=input_tasks, return_tensors="pt")
+
+ self.assertListEqual(list(inputs.keys()), self.input_keys)
+
+ @require_vision
+ @require_torch
+ def test_tokenizer_defaults_preserved_by_kwargs(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer", max_length=117)
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+ inputs = processor(images=image_input, text=[input_str], task=input_str, return_tensors="pt")
+
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 117)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 117)
+
+ @require_vision
+ @require_torch
+ def test_kwargs_overrides_default_tokenizer_kwargs(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer", max_length=117)
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+ inputs = processor(images=image_input, text=[input_str], task=input_str, return_tensors="pt", max_length=112)
+
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 112)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 112)
+
+ @require_torch
+ @require_vision
+ def test_unstructured_kwargs(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer")
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+ inputs = processor(
+ images=image_input,
+ text=[input_str],
+ task=input_str,
+ return_tensors="pt",
+ size={"height": 214, "width": 214},
+ padding="max_length",
+ max_length=76,
+ )
+
+ self.assertEqual(inputs["pixel_values"].shape[2], 214)
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 76)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 76)
+
+ @require_torch
+ @require_vision
+ def test_unstructured_kwargs_batched(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer")
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+
+ input_str = ["lower newer", "upper older longer string"]
+ image_input = self.prepare_image_inputs() * 2
+ inputs = processor(
+ images=image_input,
+ text=[input_str],
+ task=input_str,
+ return_tensors="pt",
+ size={"height": 214, "width": 214},
+ padding="longest",
+ max_length=76,
+ )
+
+ self.assertEqual(inputs["pixel_values"].shape[2], 214)
+
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 6)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 6)
+
+ @require_torch
+ @require_vision
+ def test_structured_kwargs_nested(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer")
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+
+ # Define the kwargs for each modality
+ all_kwargs = {
+ "common_kwargs": {"return_tensors": "pt"},
+ "images_kwargs": {"size": {"height": 214, "width": 214}},
+ "text_kwargs": {"padding": "max_length", "max_length": 76, "task": input_str},
+ }
+
+ inputs = processor(images=image_input, text=[input_str], **all_kwargs)
+ self.skip_processor_without_typed_kwargs(processor)
+
+ self.assertEqual(inputs["pixel_values"].shape[2], 214)
+
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 76)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 76)
+
+ @require_torch
+ @require_vision
+ def test_structured_kwargs_nested_from_dict(self):
+ # Rewrite as OmDet-Turbo processor outputs "input_ids" for both tasks and classes.
+ if "image_processor" not in self.processor_class.attributes:
+ self.skipTest(f"image_processor attribute not present in {self.processor_class}")
+
+ image_processor = self.get_component("image_processor")
+ tokenizer = self.get_component("tokenizer")
+
+ processor = self.processor_class(tokenizer=tokenizer, image_processor=image_processor)
+ self.skip_processor_without_typed_kwargs(processor)
+ input_str = "lower newer"
+ image_input = self.prepare_image_inputs()
+
+ # Define the kwargs for each modality
+ all_kwargs = {
+ "common_kwargs": {"return_tensors": "pt"},
+ "images_kwargs": {"size": {"height": 214, "width": 214}},
+ "text_kwargs": {"padding": "max_length", "max_length": 76, "task": input_str},
+ }
+
+ inputs = processor(images=image_input, text=[input_str], **all_kwargs)
+ self.assertEqual(inputs["pixel_values"].shape[2], 214)
+
+ self.assertEqual(len(inputs["tasks_input_ids"][0]), 76)
+ self.assertEqual(len(inputs["classes_input_ids"][0]), 76)
diff --git a/utils/check_table.py b/utils/check_table.py
index 02541e87ddba..587681844955 100644
--- a/utils/check_table.py
+++ b/utils/check_table.py
@@ -173,7 +173,13 @@ def _center_text(text: str, width: int) -> str:
"XLS-R": "Wav2Vec2",
"XLSR-Wav2Vec2": "Wav2Vec2",
}
-MODEL_NAMES_TO_IGNORE = ["CLIPVisionModel", "SiglipVisionModel", "ChineseCLIPVisionModel", "Qwen2AudioEncoder"]
+MODEL_NAMES_TO_IGNORE = [
+ "ChineseCLIPVisionModel",
+ "CLIPTextModel",
+ "CLIPVisionModel",
+ "Qwen2AudioEncoder",
+ "SiglipVisionModel",
+]
def get_model_table_from_auto_modules() -> str: