-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Llava to transformers #25789
Adding Llava to transformers #25789
Conversation
@ArthurZucker Right now I've added Llava support directly to the MPT model as |
And there's no preprocessor_config unfortunately, so do I go about making one and push it to a new hugging face repo or just integrate all the CLIP preprocessing and Tokenize in the class itself? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! You should probably take inspiration from what was done with other composition models like Blip2 or encodec!
You need to create a new folder for this new model too! Following this
Okay, on it! 🫡 |
>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf" | ||
|
||
>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) | ||
>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf" | |
>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) | |
>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS) | |
>>> checkpoint = "shauray/Llava-Llama-2-7B-hf" | |
>>> model = LlavaForCausalLM.from_pretrained(checkpoint) | |
>>> processor = LlavaProcessor.from_pretrained(checkpoint) |
|
||
|
||
LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = { | ||
"shauray/Llava-Llama-2-7B-hf": "https://huggingface.co/shauray/Llava-Llama-2-7B-hf/resolve/main/config.json", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be updated once transferred
This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA | ||
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | ||
defaults will yield a similar configuration to that of the LLaMA-7B. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA | |
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
defaults will yield a similar configuration to that of the LLaMA-7B. | |
This is the configuration class to store the configuration of a [`LlavaTextModel`]. It is used to instantiate a LLaVa text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the | |
defaults will yield a similar configuration to that of the LLaMA-7B. |
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the | ||
`inputs_ids` passed when calling [`LlamaModel`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the | |
`inputs_ids` passed when calling [`LlamaModel`] | |
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the | |
`inputs_ids` passed when calling [`LlavaTextModel`]. |
# for backward compatibility | ||
if num_key_value_heads is None: | ||
num_key_value_heads = num_attention_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new model so we don't need to add backward compatible arguments
|
||
class LlavaVisionConfig(PretrainedConfig): | ||
""" | ||
This is the configuration class to store the configuration of a [`MptModel`]. It is used to instantiate a Mpt model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MptModel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked a bit into the design of this PR (as well as the LLaVa paper). First of all, I really appreciate the effort you're making into integrating it into the library 🙏 already some nice work.
However, for the model to get integrated into the library, there are some changes to be made. Specifically, I see you're using the vision_model
(CLIP) inside the preprocessor class. This is very different from all other models in the Transformers library, and not compliant to its design. What should actually be done is defining something along the lines of:
class LlavaModel(config):
def __init__(self, config):
self.vision_model = LlavaVisionModel(config.vision_config)
self.projection_layer = nn.Linear(...)
self.text_model = AutoModel(config.text_config)
for the base model (i.e. LLaVa without language modeling head on top), and then the head model:
class LlavaForCausalLM(config):
def __init__(self, config):
self.model = LLavaModel(config)
self.lm_head = nn.Linear(...)
i.e. the vision_model
is a PyTorch model, hence it needs to be part of the PyTorch implementation of LLaVa. The LlavaProcessor
class should combine a CLIPImageProcessor
and a LlamaTokenizer
, which takes in text and images and produces input_ids, pixel_values which are the inputs to the model. Refer to implementations like BLIP, BLIP-2 as examples of other multimodal models which also leverage CLIP as vision encoder, combined with a language model.
The LlavaVisionConfig
then includes all attributes regarding the vision encoder (very similar to Blip2VisionConfig
). Since the language model is just LLaMa as a decoder-only model, one can leverage the AutoModel class to support any decoder-only LLM (this was also done for BLIP-2 - see here), and specify any AutoConfig as text config (see BLIP-2 as example). Additional attributes, like things regarding the projection layers, can be defined as part of LlavaConfig
.
Thank you @NielsRogge for the review, I had my doubts regarding this, I'll make all the necessary changes as soon as I can. |
To make sure I understand everything, rather than having a |
Yes, most importantly is to remove the vision encoder from the preprocessor class and instead make it part of the model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Noticed some other nits and improvement opportunities.
@NielsRogge also included some relevant points that should be addressed.
def from_llava_configs( | ||
cls, | ||
text_config: PretrainedConfig, | ||
#text_config: LlavaTextConfig, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
return_dict: Optional[bool] = None, | ||
) -> Union[Tuple, CausalLMOutputWithPast]: | ||
r""" | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pixel_values: Optional[torch.FloatTensor] = None, | ||
return_dict: Optional[bool] = None, | ||
) -> Union[Tuple, CausalLMOutputWithPast]: | ||
r""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here regarding the docstrings. Use the forward of Blip2ForConditionalGeneration (here) as reference. Please, leave just the examples here.
return_token_type_ids=return_token_type_ids, | ||
return_length=return_length, | ||
verbose=verbose, | ||
# return_tensors=return_tensors, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not used, please, delete.
for chunk in text.split("<image>") | ||
] | ||
|
||
def insert_separator(X, sep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you consider renaming X
to a more descriptive variable name? Maybe word
, token
?
#vision_encoding = self.vision_model(image_encoding, output_hidden_states=True) | ||
#image_features = vision_encoding.hidden_states[-2] | ||
#image_features = image_features[:, 1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, remove any commented-out code if it's not being utilized
result.paste(pil_img, ((height - width) // 2, 0)) | ||
return result | ||
|
||
def feature_select(image_forward_outs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing self
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really being used anywhere?
self.DEFAULT_IMAGE_TOKEN = "<image>" | ||
self.IMAGE_TOKEN_INDEX = -200 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from PEP: "Constants are usually defined on a module level and written in all capital letters with underscores separating words."
So, it would be better to have it as a module constant as done in IDEFICS here
@rafaelpadilla I'm still working on it, I'll let you know when it's ready for review |
I also want to contribute to llava implementation. But I have a question, why we need to copy the vision encoder part but use AutoModel for language model part? e.g., blip, blip_2... |
Hi @shauray8, I'm working on doing some Llava training experiments and hopefully contributing. Could you share any guidance on how to test this Llava implementation in its current state starting from scratch from just pytorch weights (https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main) and CLIP? |
Hi @shauray8 any updates on this model addition? We'd like to have this model merge in within the next few weeks - is this something that would fit in your timeline? |
@amyeroberts I'm done with the architectural changes @NielsRogge suggested, uploading new weights for LLaVa and LLaVa 1.5, writing new tests and documentations could take up 2-3 days as I'm pretty caught up with my placements |
Hi @shauray8 - glad to hear the arch changes are done! I can see that there's still outstanding suggestions from @rafaelpadilla's PR which will also need to be addressed alongside tests etc. As there are currently 3 in-progress model PRs - #25001, #26360, #25789 - all of which we'd like to have in the library soon, and you mention you're busy with placements I propose that you continue with one and someone else can help finish off the other PRs. As Llava is the most complete and has already had some reviews this is the one I suggest you focus on. Let us know if you need any help! |
Hi @shauray8 ! |
@younesbelkada Thank you for the positive feedback! I appreciate your willingness to help in completing the PR. Opening a new PR would be a much more cleaner way of doing it but I don't mind the approach. I'm open to discussing any details or providing additional information you might need. |
What does this PR do?
Adds LLAVA to transformers.
author - https://github.com/haotian-liu/LLaVA
hub - https://huggingface.co/shauray/Llava-Llama-2-7B-hf
Fixes # (issue)
#25060
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker @amyeroberts
@younesbelkada
Results
prompt - "How would you best describe this image? "
-- The photograph shows a wooden dock floating on the water, with mountains in the background. It is an idyllic scene that captures both natural beauty and human-made structures like docks at their most serene state of being surrounded by nature's wonders such as lakes or oceans (in case it isn’t just any body). This type of setting can be found all over North America where there are numerous bodies of freshwater available for recreational activities including fishing from piers near these locations; however, they also provide opportunities to observe wildlife