Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Llava to transformers #25789

Closed
wants to merge 107 commits into from
Closed

Adding Llava to transformers #25789

wants to merge 107 commits into from

Conversation

shauray8
Copy link
Contributor

@shauray8 shauray8 commented Aug 28, 2023

What does this PR do?

Adds LLAVA to transformers.
author - https://github.com/haotian-liu/LLaVA
hub - https://huggingface.co/shauray/Llava-Llama-2-7B-hf

Fixes # (issue)
#25060

Before submitting

Who can review?

@ArthurZucker @amyeroberts
@younesbelkada

Results

image
prompt - "How would you best describe this image? "


-- The photograph shows a wooden dock floating on the water, with mountains in the background. It is an idyllic scene that captures both natural beauty and human-made structures like docks at their most serene state of being surrounded by nature's wonders such as lakes or oceans (in case it isn’t just any body). This type of setting can be found all over North America where there are numerous bodies of freshwater available for recreational activities including fishing from piers near these locations; however, they also provide opportunities to observe wildlife

@shauray8
Copy link
Contributor Author

@ArthurZucker Right now I've added Llava support directly to the MPT model as LlavaMptForCausalLM. Do you think I should add llava as a separate model or is this good enough?

@shauray8
Copy link
Contributor Author

And there's no preprocessor_config unfortunately, so do I go about making one and push it to a new hugging face repo or just integrate all the CLIP preprocessing and Tokenize in the class itself?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! You should probably take inspiration from what was done with other composition models like Blip2 or encodec!
You need to create a new folder for this new model too! Following this

@shauray8
Copy link
Contributor Author

Okay, on it! 🫡

Comment on lines +49 to +52
>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf"

>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf"
>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> checkpoint = "shauray/Llava-Llama-2-7B-hf"
>>> model = LlavaForCausalLM.from_pretrained(checkpoint)
>>> processor = LlavaProcessor.from_pretrained(checkpoint)



LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"shauray/Llava-Llama-2-7B-hf": "https://huggingface.co/shauray/Llava-Llama-2-7B-hf/resolve/main/config.json",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be updated once transferred

Comment on lines 34 to 36
This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the LLaMA-7B.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the LLaMA-7B.
This is the configuration class to store the configuration of a [`LlavaTextModel`]. It is used to instantiate a LLaVa text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the LLaMA-7B.

Comment on lines 44 to 45
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`LlamaModel`]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`LlamaModel`]
Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`LlavaTextModel`].

Comment on lines 142 to 144
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new model so we don't need to add backward compatible arguments


class LlavaVisionConfig(PretrainedConfig):
"""
This is the configuration class to store the configuration of a [`MptModel`]. It is used to instantiate a Mpt model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MptModel?

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked a bit into the design of this PR (as well as the LLaVa paper). First of all, I really appreciate the effort you're making into integrating it into the library 🙏 already some nice work.

However, for the model to get integrated into the library, there are some changes to be made. Specifically, I see you're using the vision_model (CLIP) inside the preprocessor class. This is very different from all other models in the Transformers library, and not compliant to its design. What should actually be done is defining something along the lines of:

class LlavaModel(config):
     def __init__(self, config):

        self.vision_model = LlavaVisionModel(config.vision_config)
        self.projection_layer = nn.Linear(...)
        self.text_model = AutoModel(config.text_config)

for the base model (i.e. LLaVa without language modeling head on top), and then the head model:

class LlavaForCausalLM(config):
    def __init__(self, config):
             
          self.model = LLavaModel(config)
          self.lm_head = nn.Linear(...)

i.e. the vision_model is a PyTorch model, hence it needs to be part of the PyTorch implementation of LLaVa. The LlavaProcessor class should combine a CLIPImageProcessor and a LlamaTokenizer, which takes in text and images and produces input_ids, pixel_values which are the inputs to the model. Refer to implementations like BLIP, BLIP-2 as examples of other multimodal models which also leverage CLIP as vision encoder, combined with a language model.

The LlavaVisionConfig then includes all attributes regarding the vision encoder (very similar to Blip2VisionConfig). Since the language model is just LLaMa as a decoder-only model, one can leverage the AutoModel class to support any decoder-only LLM (this was also done for BLIP-2 - see here), and specify any AutoConfig as text config (see BLIP-2 as example). Additional attributes, like things regarding the projection layers, can be defined as part of LlavaConfig.

@shauray8
Copy link
Contributor Author

I've looked a bit into the design of this PR (as well as the LLaVa paper). First of all, I really appreciate the effort you're making into integrating it into the library 🙏 already some nice work.

However, for the model to get integrated into the library, there are some changes to be made. Specifically, I see you're using the vision_model (CLIP) inside the preprocessor class. This is very different from all other models in the Transformers library, and not compliant to its design. What should actually be done is defining something along the lines of:

class LlavaModel(config):
     def __init__(self, config):

        self.vision_model = LlavaVisionModel(config.vision_config)
        self.projection_layer = nn.Linear(...)
        self.text_model = AutoModelForCausalLM(config.text_config)

for the base model (i.e. LLaVa without language modeling head on top), and then the head model:

class LlavaForCausalLM(config):
    def __init__(self, config):
             
          self.model = LLavaModel(config)
          self.lm_head = nn.Linear(...)

i.e. the vision_model is a PyTorch model, hence it needs to be part of the PyTorch implementation of LLaVa. The LlavaProcessor class should combine a CLIPImageProcessor and a LlamaTokenizer, which takes in text and images and produces input_ids, pixel_values which are the inputs to the model. Refer to implementations like BLIP, BLIP-2 as examples of other multimodal models which also leverage CLIP as vision encoder, combined with a language model.

The LlavaVisionConfig then includes all attributes regarding the vision encoder (very similar to Blip2VisionConfig). Since the language model is just LLaMa as a decoder-only model, one can leverage the AutoModelForCausalLM class to support any decoder-only LLM (this was also done for BLIP-2 - see here), and specify any AutoConfig as text config (see BLIP-2 as example). Additional attributes, like things regarding the projection layers, can be defined as part of LlavaConfig.

Thank you @NielsRogge for the review, I had my doubts regarding this, I'll make all the necessary changes as soon as I can.

@shauray8
Copy link
Contributor Author

To make sure I understand everything, rather than having a LlavaTextModel for LLaMA I should have it through AutoConfig and basically copy all the code from CLIP for LlavaVisionModel and have a LlavaModel for a bare base model.

@NielsRogge
Copy link
Contributor

Yes, most importantly is to remove the vision encoder from the preprocessor class and instead make it part of the model.

Copy link
Contributor

@rafaelpadilla rafaelpadilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Noticed some other nits and improvement opportunities.
@NielsRogge also included some relevant points that should be addressed.

def from_llava_configs(
cls,
text_config: PretrainedConfig,
#text_config: LlavaTextConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the forward in Blip2Model (here) as reference for these docstrings.

Use the LLAVA_INPUTS_DOCSTRING to include the args. Here you can just leave the examples, as done in Blip2 (here)

pixel_values: Optional[torch.FloatTensor] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here regarding the docstrings. Use the forward of Blip2ForConditionalGeneration (here) as reference. Please, leave just the examples here.

return_token_type_ids=return_token_type_ids,
return_length=return_length,
verbose=verbose,
# return_tensors=return_tensors,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not used, please, delete.

for chunk in text.split("<image>")
]

def insert_separator(X, sep):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you consider renaming X to a more descriptive variable name? Maybe word, token?

Comment on lines 157 to 159
#vision_encoding = self.vision_model(image_encoding, output_hidden_states=True)
#image_features = vision_encoding.hidden_states[-2]
#image_features = image_features[:, 1:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove any commented-out code if it's not being utilized

result.paste(pil_img, ((height - width) // 2, 0))
return result

def feature_select(image_forward_outs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing self?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really being used anywhere?

Comment on lines +66 to +67
self.DEFAULT_IMAGE_TOKEN = "<image>"
self.IMAGE_TOKEN_INDEX = -200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from PEP: "Constants are usually defined on a module level and written in all capital letters with underscores separating words."
So, it would be better to have it as a module constant as done in IDEFICS here

@shauray8
Copy link
Contributor Author

@rafaelpadilla I'm still working on it, I'll let you know when it's ready for review

@lz1oceani
Copy link
Contributor

I also want to contribute to llava implementation. But I have a question, why we need to copy the vision encoder part but use AutoModel for language model part? e.g., blip, blip_2...

@iocuydi
Copy link

iocuydi commented Nov 14, 2023

Hi @shauray8, I'm working on doing some Llava training experiments and hopefully contributing.

Could you share any guidance on how to test this Llava implementation in its current state starting from scratch from just pytorch weights (https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main) and CLIP?
Like which of the weight conversion scripts to run in what order, and do inference? I've been trying the examples in docs and in the tests, but they seem to yield NaN tensors or various other errors due to processor inconsistency, etc.
Thank you!

@amyeroberts
Copy link
Collaborator

Hi @shauray8 any updates on this model addition? We'd like to have this model merge in within the next few weeks - is this something that would fit in your timeline?

@shauray8
Copy link
Contributor Author

@amyeroberts I'm done with the architectural changes @NielsRogge suggested, uploading new weights for LLaVa and LLaVa 1.5, writing new tests and documentations could take up 2-3 days as I'm pretty caught up with my placements

@amyeroberts
Copy link
Collaborator

Hi @shauray8 - glad to hear the arch changes are done! I can see that there's still outstanding suggestions from @rafaelpadilla's PR which will also need to be addressed alongside tests etc.

As there are currently 3 in-progress model PRs - #25001, #26360, #25789 - all of which we'd like to have in the library soon, and you mention you're busy with placements I propose that you continue with one and someone else can help finish off the other PRs. As Llava is the most complete and has already had some reviews this is the one I suggest you focus on.

Let us know if you need any help!

@younesbelkada
Copy link
Contributor

Hi @shauray8 !
Great work on the PR ! I am super excited about this architecture and I am happy to help you finishing up the PR by taking it over, or creating a new PR to add this architecture. I'll make sure to add you as the main author of this contribution as you did most of the work. Let me know how does that sound for you

@shauray8
Copy link
Contributor Author

@younesbelkada Thank you for the positive feedback! I appreciate your willingness to help in completing the PR. Opening a new PR would be a much more cleaner way of doing it but I don't mind the approach. I'm open to discussing any details or providing additional information you might need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants