-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Llava
] Add Llava to transformers
#27662
[Llava
] Add Llava to transformers
#27662
Conversation
prompts: Union[List[TextInput], List[List[TextInput]]], | ||
images=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ydshieh as can be seen, any multimodal processor has a different signature here, which will make it very difficult to support them on the image-to-text pipeline for instance. Hence we need additional tests that check whether they all comply to the same API. In this case I'd advocate for the images
keyword argument followed by text
as that's what most processors use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest here to use text
as the argument name to LlavaProcessor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but also reversed order right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, when I worked on Kosmos-2, I kinda copied from Blip2/InstructBlip, where the order is images, text
.
I see Fuyu and CLIP processors are text, images
.
I don't know if there is a clear criteria to determine this. In this case, we can probably say images
is the main input, so should be before text
.
What's your opinion on this order thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so that's another reason why we need tests for this 😓 perhaps we could extend test_forward_signature for both models + processors (currently it's still implemented for text-only models). Personally I would go for images
and then text
(and ideally the latter should have been called texts to also be plural).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current text
is likely because it is the argument name in PreTrainedTokenizerBase
(so every tokenizer classes)
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
…ansformers into add-llava-final
…ansformers into add-llava-final
batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index) | ||
|
||
# 2. Compute the positions where text should be written | ||
new_token_positions = torch.cumsum((image_token_mask * (nb_text_tokens_per_images - 1) + 1), -1) - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very smart! Maybe add a comment there, like
# Calculate new positions for text tokens in merged image-text sequence.
# `image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
# `torch.cumsum` computes how each image token shifts subsequent text token positions.
# - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
or make it part of a step-by-step explanation in the docstrings
|
||
super().__init__(image_processor, tokenizer) | ||
|
||
def __call__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, maybe add type hints there, and maybe explicit args to tokenizer?
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
images: ImageInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length: Optional[int] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
**kwargs,
) -> BatchFeature:
```
also, would either pass **kwargs to tokenizer (or pass explicitly needed arguments), I'm thinking about `pad_to_multiple_of` and `verbose` for instance
"""The LLAVA model which consists of a vision backbone and a language model.""", | ||
LLAVA_START_DOCSTRING, | ||
) | ||
class LlavaForConditionalGeneration(LlavaPreTrainedModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question here, are we ok with not having a LlavaModel
? I remember for BLIP-2 people were asking for it afterwards, and as we're using AutoModelForCausalLM
in the head class, we cannot use AutoModel
in the base LlavaModel
class, as it would not be compatible with the weights. Hence if we were to add a LlavaModel
, we would need to use AutoModelForCausalLM
and remove the head, I assume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we leave it as is without extra layer of complexity! And jiust add a new model output class maybe with the hidden states before the head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also don't want to have issue with base model prefixes)
# Retrieve the first layer to inspect the logits and mask out the hidden states | ||
# that are set to 0 | ||
first_layer_past_key_value = past_key_values[0][0][:, 0, :, 0] | ||
batch_index, non_attended_tokens = torch.where(first_layer_past_key_value == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool trick
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True) | ||
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This made me realize the current way of getting features out of our AutoBackbone
classes is very inefficient as well (they also use output_hidden_states=True
as done here for instance).
Will update this for all backbones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will create a Github issue for this. Maybe we can leverage that as well here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that would be nice I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #27873
Hello, can you tell me how to use |
What does this PR do?
Adds Llava - a multimodal model in transformers library.
Llava is a multi-modal model that claims to have competitive performance than GPT-4 for multi-modal tasks. There are currently 3 main variants of this architecture:
This implementation leverages
AutoModelForCausalLM
, similarly asBlip2
to load the correct language model. The goal of this PR is to make it agnostic across all language models architectures.Closes #25789
Closes #27221
Original llava author: https://github.com/haotian-liu/LLaVA @haotian-liu
Original PR author: @shauray8