[WIP] Uniformize processors in text+image multimodal models. #27768

molbap · 2023-11-30T10:24:33Z

What does this PR do?

This PR is a work in progress aiming at uniformizing all text-image multimodal processors. Ideally, leveraging AutoProcessor(...) or an equivalent for every model would be the best.

The processor is one of the most fundamental blocks of transformers, and modifying it can only be done with careful deprecation cycles. It is however the opportunity to enforce a standard, design-wise, for future processing utilties and down-the-line pipeline integrations.

For instance align has a current __call__ method def __call__(self, text=None, images=None, padding="max_length", max_length=64, return_tensors=None, **kwargs)
altclip has __call__(self, text=None, images=None, return_tensors=None, **kwargs)
blip has

    def __call__(
        self,
        images: ImageInput = None,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
        add_special_tokens: bool = True,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        stride: int = 0,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
        return_overflowing_tokens: bool = False,
        return_special_tokens_mask: bool = False,
        return_offsets_mapping: bool = False,
        return_token_type_ids: bool = False,
        return_length: bool = False,
        verbose: bool = True,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> BatchEncoding:

And so on, with recently for instance Kosmos-2

    def __call__(
        self,
        images: ImageInput = None,
        text: Union[TextInput, List[TextInput]] = None,
        bboxes: BboxInput = None,
        num_image_tokens: Optional[int] = 64,
        first_image_token_id: Optional[int] = None,
        add_special_tokens: bool = True,
        add_eos_token: bool = False,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length: Optional[int] = None,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
        return_length: bool = False,
        verbose: bool = True,
        return_tensors: Optional[Union[str, TensorType]] = None,
        **kwargs,
    ) -> BatchFeature:

Currently, there are 30 text + image models that have a dedicated processing_<model> file. All should be reviewed and made pipeline-compatible. All of them have to be checked, modified or wrapped with a common class.

Related works:

See the insightful discussion in this PR Add InstructBlip to VQA pipeline #26885 about invariants and their importance.
@NielsRogge has started working on adding new processor tests in a separate PR as well. Add common processor tests #27720. Please follow both as tests will enforce signatures.

Before submitting

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

github-actions · 2023-12-31T08:03:19Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

LysandreJik · 2023-12-31T17:11:05Z

Still being worked on but a longer-term project; putting the WIP label so that the bot doesn't close it.

add TODOs in processor signatures

416d40f

LysandreJik added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Dec 31, 2023

molbap mentioned this pull request Jan 25, 2024

[WIP] Improve multimodal processors - rely less on kwargs #28711

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Uniformize processors in text+image multimodal models. #27768

[WIP] Uniformize processors in text+image multimodal models. #27768

molbap commented Nov 30, 2023

github-actions bot commented Dec 31, 2023

LysandreJik commented Dec 31, 2023

[WIP] Uniformize processors in text+image multimodal models. #27768

Are you sure you want to change the base?

[WIP] Uniformize processors in text+image multimodal models. #27768

Conversation

molbap commented Nov 30, 2023

What does this PR do?

Before submitting

github-actions bot commented Dec 31, 2023

LysandreJik commented Dec 31, 2023