Uniformize zero-shot object detection postprocessing methods #34926

qubvel · 2024-11-25T18:37:14Z

Uniformizing Zero-Shot Object Detection post-processing

Introduction

Currently, we have four zero-shot object detection models in the Transformers library:

OwlVit
OwlV2
Grounding Dino
OmDet Turbo

Each model uses slightly different postprocessing arguments and produces different output formats, which complicates user experience and makes it harder to use them in pipelines.

To address these inconsistencies, proposed a unified postprocessing interface for all four models. This will enhance usability, reduce confusion, and enable seamless integration with existing pipelines.

Comparison of Postprocessing Methods

Below is a comparison of the current postprocessing methods and their arguments:

Model	Postprocessing Method	Key Arguments
OwlVit / OwlV2	post_process_object_detection	`outputs`, `threshold`, `target_sizes`
Grounding Dino	post_process_grounded_object_detection	`outputs`, `input_ids`, `box_threshold`, `text_threshold`, `target_sizes`
OmDet Turbo	post_process_grounded_object_detection	`outputs`, `classes`, `score_threshold`, `nms_threshold`, `target_sizes`, `max_num_det`

Suggested Changes to Arguments

To standardize postprocessing across all models, the following suggestions are proposed:

Standardize Method Naming:

Use a single method, post_process_grounded_object_detection, for all models for text-guided object detection. For backward compatibility, retain additional methods (e.g., OwlVit/OwlV2’s post_process_object_detection) with a deprecation cycle.

Unify Required Arguments:

Make outputs the only required argument.
- For Grounding Dino, pass input_ids inside the outputs parameter.
- For OmDet Turbo, make classes optional to provide additional flexibility.

Rename Threshold Parameters:

Standardize parameter names (score_threshold and box_threshold) to a single name: threshold. These parameters perform the same function (filtering detections by confidence score), so a uniform name reduces confusion.

Add text_labels Argument:

Introduce an optional text_labels parameter to map detected labels (integer IDs) to their corresponding text names.

Final Unified Method Signature

The new method would look like this:

def post_process_grounded_object_detection(
    self,
    outputs,
    threshold: float = ...,
    target_sizes: Optional[Union[TensorType, List[Tuple]]] = None,
    text_labels: Union[List[str], List[List[str]]] = None,
    <additional model-specific params>
)

Postprocessing Outputs

Current outputs by post processing

Model	Current Output Format
OwlVit / OwlV2	`{"scores": score, "labels": label, "boxes": box}` (labels are integer class IDs)
Grounding Dino	`{"scores": score, "labels": label, "boxes": box}` (labels are text names decoded of detected objects from `input_ids`)
OmDet Turbo	`{"scores": score, "classes": class, "boxes": box}` (classes are text names of detected objects)

Suggested unified output format

The output format will be standardized to:

{
    "scores": score,
    "labels": label,        # Integer class IDs
    "boxes": box,           # Detected bounding boxes
    "text_labels": text     # Optional: text labels 
}

Detailed Model Changes

OwlVit / OwlV2

Current:

{"scores": score, "labels": label, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": label,
    "boxes": box,
    "text_labels": text
}

Grounding Dino

Current:

{"scores": score, "labels": label, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": text,  # Will be set to `None` with deprecation cycle
    "boxes": box,
    "text_labels": text
}

OmDet Turbo

Current:

{"scores": score, "classes": class, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": label,         # Add integer labels
    "boxes": box,
    "text_labels": text,     # Copy of current `classes`
    "classes": text         # Retain temporarily, remove with deprecation cycle
}

Feel free to provide feedback on the suggested changes!

Motivation

This will enhance usability, reduce confusion, and enable integration with existing zero-shot object detection pipelines.

Your contribution

I will work on this and already have draft PRs.

The text was updated successfully, but these errors were encountered:

daniel-bogdoll · 2025-02-19T15:57:12Z

Thanks @qubvel for all that work! One of my projects actually uses all 4 architectures and my code just got so much cleaner <3

qubvel · 2025-02-19T17:04:58Z

Thanks for the feedback 🤗

qubvel added Feature request Request for a new feature Vision Processing labels Nov 25, 2024

This was referenced Nov 25, 2024

OwlViT/Owlv2 post processing standardization #34929

Merged

Grounding DINO Processor standardization #34853

Merged

OmDet Turbo processor standardization #34937

Merged

qubvel closed this as completed Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformize zero-shot object detection postprocessing methods #34926

Uniformize zero-shot object detection postprocessing methods #34926

qubvel commented Nov 25, 2024 •

edited

Loading

daniel-bogdoll commented Feb 19, 2025

qubvel commented Feb 19, 2025

Uniformize zero-shot object detection postprocessing methods #34926

Uniformize zero-shot object detection postprocessing methods #34926

Comments

qubvel commented Nov 25, 2024 • edited Loading

Uniformizing Zero-Shot Object Detection post-processing

Introduction

Comparison of Postprocessing Methods

Suggested Changes to Arguments

Final Unified Method Signature

Postprocessing Outputs

Current outputs by post processing

Suggested unified output format

Detailed Model Changes

OwlVit / OwlV2

Grounding Dino

OmDet Turbo

Motivation

Your contribution

daniel-bogdoll commented Feb 19, 2025

qubvel commented Feb 19, 2025

qubvel commented Nov 25, 2024 •

edited

Loading