You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, we have four zero-shot object detection models in the Transformers library:
OwlVit
OwlV2
Grounding Dino
OmDet Turbo
Each model uses slightly different postprocessing arguments and produces different output formats, which complicates user experience and makes it harder to use them in pipelines.
To address these inconsistencies, proposed a unified postprocessing interface for all four models. This will enhance usability, reduce confusion, and enable seamless integration with existing pipelines.
Comparison of Postprocessing Methods
Below is a comparison of the current postprocessing methods and their arguments:
To standardize postprocessing across all models, the following suggestions are proposed:
Standardize Method Naming:
Use a single method, post_process_grounded_object_detection, for all models for text-guided object detection. For backward compatibility, retain additional methods (e.g., OwlVit/OwlV2’s post_process_object_detection) with a deprecation cycle.
Unify Required Arguments:
Make outputs the only required argument.
- For Grounding Dino, pass input_ids inside the outputs parameter.
- For OmDet Turbo, make classes optional to provide additional flexibility.
Rename Threshold Parameters:
Standardize parameter names (score_threshold and box_threshold) to a single name: threshold. These parameters perform the same function (filtering detections by confidence score), so a uniform name reduces confusion.
Add text_labels Argument:
Introduce an optional text_labels parameter to map detected labels (integer IDs) to their corresponding text names.
Uniformizing Zero-Shot Object Detection post-processing
Introduction
Currently, we have four zero-shot object detection models in the Transformers library:
Each model uses slightly different postprocessing arguments and produces different output formats, which complicates user experience and makes it harder to use them in pipelines.
To address these inconsistencies, proposed a unified postprocessing interface for all four models. This will enhance usability, reduce confusion, and enable seamless integration with existing pipelines.
Comparison of Postprocessing Methods
Below is a comparison of the current
postprocessing
methods and their arguments:outputs
,threshold
,target_sizes
outputs
,input_ids
,box_threshold
,text_threshold
,target_sizes
outputs
,classes
,score_threshold
,nms_threshold
,target_sizes
,max_num_det
Suggested Changes to Arguments
To standardize postprocessing across all models, the following suggestions are proposed:
Use a single method,
post_process_grounded_object_detection
, for all models for text-guided object detection. For backward compatibility, retain additional methods (e.g., OwlVit/OwlV2’spost_process_object_detection
) with a deprecation cycle.Make
outputs
the only required argument.- For Grounding Dino, pass
input_ids
inside theoutputs
parameter.- For OmDet Turbo, make
classes
optional to provide additional flexibility.Standardize parameter names (
score_threshold
andbox_threshold
) to a single name:threshold
. These parameters perform the same function (filtering detections by confidence score), so a uniform name reduces confusion.text_labels
Argument:Introduce an optional
text_labels
parameter to map detected labels (integer IDs) to their corresponding text names.Final Unified Method Signature
The new method would look like this:
Postprocessing Outputs
Current outputs by post processing
{"scores": score, "labels": label, "boxes": box}
(labels are integer class IDs)
{"scores": score, "labels": label, "boxes": box}
(labels are text names decoded of detected objects from
input_ids
){"scores": score, "classes": class, "boxes": box}
(classes are text names of detected objects)
Suggested unified output format
The output format will be standardized to:
Detailed Model Changes
OwlVit / OwlV2
Current:
Proposed:
Grounding Dino
Current:
Proposed:
OmDet Turbo
Current:
Proposed:
Feel free to provide feedback on the suggested changes!
Motivation
This will enhance usability, reduce confusion, and enable integration with existing zero-shot object detection pipelines.
Your contribution
I will work on this and already have draft PRs.
The text was updated successfully, but these errors were encountered: