Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OmDet-Turbo #31843

Merged
merged 62 commits into from
Sep 25, 2024
Merged

Add OmDet-Turbo #31843

merged 62 commits into from
Sep 25, 2024

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Jul 8, 2024

What does this PR do?

This PR adds support for OmDet-Turbo, an open-vocabulary detection model from Om Research Lab.

Who can review?

@amyeroberts @qubvel

@yonigozlan yonigozlan requested a review from amyeroberts July 8, 2024 16:02
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@yonigozlan yonigozlan requested a review from qubvel July 8, 2024 16:33
@yonigozlan yonigozlan changed the title Add OmDet-Turbo [WIP] Add OmDet-Turbo Jul 9, 2024
@yonigozlan
Copy link
Member Author

yonigozlan commented Jul 17, 2024

Hey @amyeroberts @qubvel !
I think this PR is ready for a first review. What's missing for now is adding/modifying tests (current ones are for Grounding Dino and not adapted to this model).
Short-terms TODOs left:

  • Tests
  • Docs, examples and some docstrings left to modify

Longer-terms TODOs:

  • Add support for training

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR!

Overall looks good for a first draft! I've just done a fairly high-level review pointing out some library patterns and structures to be propagated throughout the rest of the PR.

WRT to training, if the loss function isn't available or it's hard to add into the library, it's OK to add this at a later stage if requested by the community

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@amyeroberts did a review already (I also learned new things from it 🙂 ), so I just added a few more comments below.

In terms of draft PR reviews would be probably helpful to specify which files should be reviewed and which are not ready yet.

@yonigozlan
Copy link
Member Author

Hi @amyeroberts and @qubvel! When you have some time, could you please take another look at this PR? I've resolved your previous remarks and left the ones where I had questions open.

All the files I've modified are ready for review. I've commented out the modeling tests I haven't addressed yet (there's only one end-to-end integration test for now), but the processor tests are ready for review.

I'd like to highlight a significant issue: the OmDetTurboMultiheadAttention in the OmDetTurboEncoderLayer module. It reproduces the original model's behavior, but this one appears to be incorrectly implemented for batch inference. This issue makes the whole model unusable for batch inference if we want to use the original model weights and be consistent with the original model outputs.

Thanks in advance!

@qubvel qubvel self-requested a review July 26, 2024 12:29
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more comments regarding modeling file. Forgot to turn on review mode in VSCode, so the upper comments go as separate comments, sorry for that :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

As before - I've done another high-level pass, but not everything as review was getting quite long. A few general comments:

  • We should use as clear and explicit variable names as possible, both within code and in comments e.g. bs -> batch_size
  • Code should be self-documenting as much as possible: we should try to avoid needing to refer back to docstrings and comments to figure out what's happening. Clear variable names will help with this
  • There are specific cases for needing static methods, but not requiring self isn't a sufficient condition. This is normally an indication the method should just be a function outside the class.
  • +1 to all of @qubvel's comments - in particular in relation to the attention implementation
  • Make sure to refer to other model's implementations (not just Grounding DINO) to see the library patterns for e.g. naming, position of objects in the file etc.

Next steps are adding the model tests and addressing all the outstanding comments. With regards to reviewers' comments, it's important to fully address before marking as resolved (apply the change/ reply why it's not applicable / ask follow-up questions)

output_kwargs = self._merge_kwargs(
OmDetTurboProcessorKwargs,
tokenizer_init_kwargs=OmDetTurboProcessorKwargs._defaults["text_kwargs"],
# tokenizer_init_kwargs=self.tokenizer.init_kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to delete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this here because it should merge the tokenizer init_kwargs to be consistent with other processors, but as we discussed, when I load the model from a saved checkpoint, the tokenizer init_kwargs contains kwargs that shouldn't be there such as padding_side etc. I'm really not sure what causes this as the checkpoint is saved and loaded similarly to other models.

@yonigozlan yonigozlan force-pushed the add-om-det-turbo branch 2 times, most recently from 9ceb5fc to d52c345 Compare August 5, 2024 19:00
@yonigozlan yonigozlan marked this pull request as ready for review August 8, 2024 04:44
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonigozlan Thanks for the update, in general looks good to me, the code is easier to read and understand! I left some nit comments regarding variable naming, type hints, and the attention mask. And the main concern is regarding removing OmDetTurboModel, I suppose we have to have FooModel class for every model, and the idea is to run this model without heads, then add specific heads inside the specific model class, such as OmDetTurboForObjectDetection.

@yonigozlan
Copy link
Member Author

Thanks for the review @qubvel ! For the OmdetTurboModel, the task specific part of the model starts at the very beginning of the decoder, where there are two heads defined, one for the object detection scores and the other for the bboxes coordinates. So it is quite difficult to define a non task-specific OmdetTurboModel. It wouldn't be the first model not to have a FooModel, for example LLaVaNext doesn't have one either, maybe for the same reason?
@amyeroberts if you think not having OmdetTurboModel is ok, I think this should be ready for you to review! Sorry I probably pinged you for a final review too early before, but I'm hopeful it's better this time :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

I think we're close to being ready for merge 🤗 Most comments are about making sure we're using clear variable names.

General comments:

  • It looks like there's no logic to account for gradient checkpointing e.g. like here which would be nice to support if possible
  • For the tests - in particular processing and one inference test - it would be good to account for batched inputs. This is something we often forget to test for and particularly for models like this which are more complicated than just input_ids we want to make sure it all works!
  • Make sure docstrings are in sync with the code they document and in-line with the library patterns.

Comment on lines 83 to 90
decoder_coord_logits: torch.FloatTensor = None
decoder_class_logits: torch.FloatTensor = None
encoder_coord_logits: torch.FloatTensor = None
encoder_class_logits: Tuple[torch.FloatTensor] = None
init_reference_points: torch.FloatTensor = None
intermediate_reference_points: Tuple[Tuple[torch.FloatTensor]] = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I cannot have more than one required field if I inherit from ModelOutput and use the dataclass decorator. @amyeroberts do you know the reason for that? And what should I do in this case? Thanks

Could you share a code snippet and error? I seem to be able to create a model output with more than required input

In [1]: from transformers.utils import ModelOutput

In [2]: class FooOutput(ModelOutput):
   ...:     a: int
   ...:     b: int
   ...:

One thing to note is that although some inputs aren't typed as optional in our modeling_outputs module, they still default to None. Not sure why this is the case - the decision was before my time :)

Comment on lines +223 to +228
Kwargs:
task (`Union[str, List[str], TextInput, PreTokenizedInput]`):
The grounded text used to guide open vocabulary detection. Expects a single string or a list of strings.
Examples: "Detect a cat, a dog, and a bird.",[ "Detect everything.", "Detect trees and flowers."]
When not provided, the default task is "Detect [class1], [class2], [class3]" etc.
...
Copy link
Member Author

@yonigozlan yonigozlan Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed task from the Args section of the docstring, and added this Kwargs section, as the task kwarg is very specific to this model, and I figured it would be nice to have a description of what it does here.

Copy link
Member

@qubvel qubvel Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When not provided, the default task is "Detect [class1], [class2], [class3]" etc.

Is it better than "Detect everything."? Is there any mention in the paper which one the authors used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A mix of prompt format is used during training. It also says in the paper that generic directive like "Detect
all objects" are used in scenarios of large-vocabulary detection. There's no mention of what works best, but in my experience, a non-generic task prompt works better when there is a small number of classes.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the continued work on this!

It's looking great - I think we're very close to merge. Two main comments are the cache logic and @qubvel's point about the OmDetTurboModel class


@add_start_docstrings_to_model_forward(OMDET_TURBO_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=OmDetTurboModelOutput, config_class=_CONFIG_FOR_DOC)
def forward(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - I missed this as it was marked as resolved.

but all models should have FooModel

Generally, yes, we want this. It's not a hard-and-fast rule though, other models like Fuyu have just a task-specific model.

However, yes, you're right in this case - as there's decoder heads we can and should separate this out in order to have an OmDetTurboModel

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good to me - thanks for all the work adding this model!

Let's make sure to remove the [WIP] from the title before merge

Comment on lines 1666 to 1668
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.use_return_dict = config.use_return_dict
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to store these in the model - the typical pattern is to store self.config in the publically importable classes and use output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions

Suggested change
self.output_attentions = config.output_attentions
self.output_hidden_states = config.output_hidden_states
self.use_return_dict = config.use_return_dict

@yonigozlan yonigozlan changed the title [WIP] Add OmDet-Turbo Add OmDet-Turbo Sep 25, 2024
@yonigozlan yonigozlan merged commit 94f18cf into huggingface:main Sep 25, 2024
24 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Add template with add-new-model-like

* Add rough OmDetTurboEncoder and OmDetTurboDecoder

* Add working OmDetTurbo convert to hf

* Change OmDetTurbo encoder to RT-DETR encoder

* Add swin timm backbone as default, add always partition fix for swin timm

* Add labels and tasks caching

* Fix make fix-copies

* Format omdet_turbo

* fix Tokenizer tests

* Fix style and quality

* Reformat omdet_turbo

* Fix quality, style, copies

* Standardize processor kwargs

* Fix style

* Add output_hidden_states and ouput_attentions

* Add personalize multi-head attention, improve docstrings

* Add integrated test and fix copy, style, quality

* Fix unprotected import

* Cleanup comments and fix unprotected imports

* Add fix different prompts in batch (key_padding_mask)

* Add key_padding_mask to custom multi-head attention module

* Replace attention_mask by key_padding_mask

* Remove OmDetTurboModel and refactor

* Refactor processing of classes and abstract use of timm backbone

* Add testing, fix output attentions and hidden states, add cache for anchors generation

* Fix copies, style, quality

* Add documentation, conver key_padding_mask to attention_mask

* revert changes to backbone_utils

* Fic docstrings rst

* Fix unused argument in config

* Fix image link documentation

* Reorder config and cleanup

* Add tokenizer_init_kwargs in merge_kwargs of the processor

* Change AutoTokenizer to CLIPTokenizer in convert

* Fix init_weights

* Add ProcessorMixin tests, Fix convert while waiting on uniform kwargs

* change processor kwargs and make task input optional

* Fix omdet docs

* Remove unnecessary tests for processor kwargs

* Replace nested BatchEncoding output of the processor by a flattened BatchFeature

* Make modifications from Pavel review

* Add changes Amy review

* Remove unused param

* Remove normalize_before param, Modify processor call docstring

* Remove redundant decoder class, add gradient checkpointing for decoder

* Remove commented out code

* Fix inference in fp16 and add fp16 integrated test

* update omdet md doc

* Add OmdetTurboModel

* fix caching and nit

* add OmDetTurboModel to tests

* nit change repeated key test

* Improve inference speed in eager mode

* fix copies

* Fix nit

* remove OmdetTurboModel

* [run-slow] omdet_turbo

* [run-slow] omdet_turbo

* skip dataparallel test

* [run-slow] omdet_turbo

* update weights to new path

* remove unnecessary config in class

---------

Co-authored-by: Ubuntu <ubuntu@ip-172-31-91-248.ec2.internal>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants