Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformize zero-shot object detection postprocessing methods #34926

Closed
qubvel opened this issue Nov 25, 2024 · 2 comments
Closed

Uniformize zero-shot object detection postprocessing methods #34926

qubvel opened this issue Nov 25, 2024 · 2 comments
Labels

Comments

@qubvel
Copy link
Member

qubvel commented Nov 25, 2024

Uniformizing Zero-Shot Object Detection post-processing

Introduction

Currently, we have four zero-shot object detection models in the Transformers library:

  • OwlVit
  • OwlV2
  • Grounding Dino
  • OmDet Turbo

Each model uses slightly different postprocessing arguments and produces different output formats, which complicates user experience and makes it harder to use them in pipelines.

To address these inconsistencies, proposed a unified postprocessing interface for all four models. This will enhance usability, reduce confusion, and enable seamless integration with existing pipelines.

Comparison of Postprocessing Methods

Below is a comparison of the current postprocessing methods and their arguments:

Model Postprocessing Method Key Arguments
OwlVit / OwlV2 post_process_object_detection outputs, threshold, target_sizes
Grounding Dino post_process_grounded_object_detection outputs, input_ids, box_threshold, text_threshold, target_sizes
OmDet Turbo post_process_grounded_object_detection outputs, classes, score_threshold, nms_threshold, target_sizes, max_num_det

Suggested Changes to Arguments

To standardize postprocessing across all models, the following suggestions are proposed:

  1. Standardize Method Naming:

Use a single method, post_process_grounded_object_detection, for all models for text-guided object detection. For backward compatibility, retain additional methods (e.g., OwlVit/OwlV2’s post_process_object_detection) with a deprecation cycle.

  1. Unify Required Arguments:

Make outputs the only required argument.
- For Grounding Dino, pass input_ids inside the outputs parameter.
- For OmDet Turbo, make classes optional to provide additional flexibility.

  1. Rename Threshold Parameters:

Standardize parameter names (score_threshold and box_threshold) to a single name: threshold. These parameters perform the same function (filtering detections by confidence score), so a uniform name reduces confusion.

  1. Add text_labels Argument:

Introduce an optional text_labels parameter to map detected labels (integer IDs) to their corresponding text names.

Final Unified Method Signature

The new method would look like this:

def post_process_grounded_object_detection(
    self,
    outputs,
    threshold: float = ...,
    target_sizes: Optional[Union[TensorType, List[Tuple]]] = None,
    text_labels: Union[List[str], List[List[str]]] = None,
    <additional model-specific params>
)

Postprocessing Outputs

Current outputs by post processing

Model Current Output Format
OwlVit / OwlV2 {"scores": score, "labels": label, "boxes": box}
(labels are integer class IDs)
Grounding Dino {"scores": score, "labels": label, "boxes": box}
(labels are text names decoded of detected objects from input_ids)
OmDet Turbo {"scores": score, "classes": class, "boxes": box}
(classes are text names of detected objects)

Suggested unified output format

The output format will be standardized to:

{
    "scores": score,
    "labels": label,        # Integer class IDs
    "boxes": box,           # Detected bounding boxes
    "text_labels": text     # Optional: text labels 
}

Detailed Model Changes

OwlVit / OwlV2

Current:

{"scores": score, "labels": label, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": label,
    "boxes": box,
    "text_labels": text
}

Grounding Dino

Current:

{"scores": score, "labels": label, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": text,  # Will be set to `None` with deprecation cycle
    "boxes": box,
    "text_labels": text
}

OmDet Turbo

Current:

{"scores": score, "classes": class, "boxes": box}

Proposed:

{
    "scores": score,
    "labels": label,         # Add integer labels
    "boxes": box,
    "text_labels": text,     # Copy of current `classes`
    "classes": text         # Retain temporarily, remove with deprecation cycle
}

Feel free to provide feedback on the suggested changes!

Motivation

This will enhance usability, reduce confusion, and enable integration with existing zero-shot object detection pipelines.

Your contribution

I will work on this and already have draft PRs.

@daniel-bogdoll
Copy link
Contributor

Thanks @qubvel for all that work! One of my projects actually uses all 4 architectures and my code just got so much cleaner <3

@qubvel
Copy link
Member Author

qubvel commented Feb 19, 2025

Thanks for the feedback 🤗

@qubvel qubvel closed this as completed Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants