Adding Llava to transformers #25789

shauray8 · 2023-08-28T01:06:32Z

What does this PR do?

Adds LLAVA to transformers.
author - https://github.com/haotian-liu/LLaVA
hub - https://huggingface.co/shauray/Llava-Llama-2-7B-hf

Fixes # (issue)
#25060

Before submitting

Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker @amyeroberts
@younesbelkada

Results

prompt - "How would you best describe this image? "

-- The photograph shows a wooden dock floating on the water, with mountains in the background. It is an idyllic scene that captures both natural beauty and human-made structures like docks at their most serene state of being surrounded by nature's wonders such as lakes or oceans (in case it isn’t just any body). This type of setting can be found all over North America where there are numerous bodies of freshwater available for recreational activities including fishing from piers near these locations; however, they also provide opportunities to observe wildlife

shauray8 · 2023-08-28T01:09:45Z

@ArthurZucker Right now I've added Llava support directly to the MPT model as LlavaMptForCausalLM. Do you think I should add llava as a separate model or is this good enough?

shauray8 · 2023-08-28T06:40:14Z

And there's no preprocessor_config unfortunately, so do I go about making one and push it to a new hugging face repo or just integrate all the CLIP preprocessing and Tokenize in the class itself?

ArthurZucker

Hey! You should probably take inspiration from what was done with other composition models like Blip2 or encodec!
You need to create a new folder for this new model too! Following this

shauray8 · 2023-08-28T15:09:59Z

Okay, on it! 🫡

NielsRogge · 2023-10-12T18:29:00Z

docs/source/en/model_doc/llava.md

+>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf"
+
+>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
+>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)


Suggested change

>>> PATH_TO_CONVERTED_WEIGHTS = "shauray/Llava-Llama-2-7B-hf"

>>> model = LlavaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)

>>> processor = LlavaProcessor.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)

>>> checkpoint = "shauray/Llava-Llama-2-7B-hf"

>>> model = LlavaForCausalLM.from_pretrained(checkpoint)

>>> processor = LlavaProcessor.from_pretrained(checkpoint)

NielsRogge · 2023-10-12T18:29:26Z

src/transformers/models/llava/configuration_llava.py

+
+
+LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "shauray/Llava-Llama-2-7B-hf": "https://huggingface.co/shauray/Llava-Llama-2-7B-hf/resolve/main/config.json",


to be updated once transferred

NielsRogge · 2023-10-12T18:30:57Z

src/transformers/models/llava/configuration_llava.py

+    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the LLaMA-7B.


Suggested change

This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA

model according to the specified arguments, defining the model architecture. Instantiating a configuration with the

defaults will yield a similar configuration to that of the LLaMA-7B.

This is the configuration class to store the configuration of a [`LlavaTextModel`]. It is used to instantiate a LLaVa text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration with the

defaults will yield a similar configuration to that of the LLaMA-7B.

NielsRogge · 2023-10-12T18:31:18Z

src/transformers/models/llava/configuration_llava.py

+            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`LlamaModel`]


Suggested change

Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the

`inputs_ids` passed when calling [`LlamaModel`]

Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the

`inputs_ids` passed when calling [`LlavaTextModel`].

NielsRogge · 2023-10-12T18:32:00Z

src/transformers/models/llava/configuration_llava.py

+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads


This is a new model so we don't need to add backward compatible arguments

NielsRogge · 2023-10-12T18:35:02Z

src/transformers/models/llava/configuration_llava.py

+
+class LlavaVisionConfig(PretrainedConfig):
+    """
+    This is the configuration class to store the configuration of a [`MptModel`]. It is used to instantiate a Mpt model


NielsRogge

I've looked a bit into the design of this PR (as well as the LLaVa paper). First of all, I really appreciate the effort you're making into integrating it into the library 🙏 already some nice work.

However, for the model to get integrated into the library, there are some changes to be made. Specifically, I see you're using the vision_model (CLIP) inside the preprocessor class. This is very different from all other models in the Transformers library, and not compliant to its design. What should actually be done is defining something along the lines of:

class LlavaModel(config):
     def __init__(self, config):

        self.vision_model = LlavaVisionModel(config.vision_config)
        self.projection_layer = nn.Linear(...)
        self.text_model = AutoModel(config.text_config)

for the base model (i.e. LLaVa without language modeling head on top), and then the head model:

class LlavaForCausalLM(config):
    def __init__(self, config):
             
          self.model = LLavaModel(config)
          self.lm_head = nn.Linear(...)

i.e. the vision_model is a PyTorch model, hence it needs to be part of the PyTorch implementation of LLaVa. The LlavaProcessor class should combine a CLIPImageProcessor and a LlamaTokenizer, which takes in text and images and produces input_ids, pixel_values which are the inputs to the model. Refer to implementations like BLIP, BLIP-2 as examples of other multimodal models which also leverage CLIP as vision encoder, combined with a language model.

The LlavaVisionConfig then includes all attributes regarding the vision encoder (very similar to Blip2VisionConfig). Since the language model is just LLaMa as a decoder-only model, one can leverage the AutoModel class to support any decoder-only LLM (this was also done for BLIP-2 - see here), and specify any AutoConfig as text config (see BLIP-2 as example). Additional attributes, like things regarding the projection layers, can be defined as part of LlavaConfig.

shauray8 · 2023-10-13T07:10:51Z

I've looked a bit into the design of this PR (as well as the LLaVa paper). First of all, I really appreciate the effort you're making into integrating it into the library 🙏 already some nice work.

However, for the model to get integrated into the library, there are some changes to be made. Specifically, I see you're using the vision_model (CLIP) inside the preprocessor class. This is very different from all other models in the Transformers library, and not compliant to its design. What should actually be done is defining something along the lines of:
class LlavaModel(config):
     def __init__(self, config):

        self.vision_model = LlavaVisionModel(config.vision_config)
        self.projection_layer = nn.Linear(...)
        self.text_model = AutoModelForCausalLM(config.text_config)
for the base model (i.e. LLaVa without language modeling head on top), and then the head model:
class LlavaForCausalLM(config):
    def __init__(self, config):
             
          self.model = LLavaModel(config)
          self.lm_head = nn.Linear(...)
i.e. the vision_model is a PyTorch model, hence it needs to be part of the PyTorch implementation of LLaVa. The LlavaProcessor class should combine a CLIPImageProcessor and a LlamaTokenizer, which takes in text and images and produces input_ids, pixel_values which are the inputs to the model. Refer to implementations like BLIP, BLIP-2 as examples of other multimodal models which also leverage CLIP as vision encoder, combined with a language model.

The LlavaVisionConfig then includes all attributes regarding the vision encoder (very similar to Blip2VisionConfig). Since the language model is just LLaMa as a decoder-only model, one can leverage the AutoModelForCausalLM class to support any decoder-only LLM (this was also done for BLIP-2 - see here), and specify any AutoConfig as text config (see BLIP-2 as example). Additional attributes, like things regarding the projection layers, can be defined as part of LlavaConfig.

Thank you @NielsRogge for the review, I had my doubts regarding this, I'll make all the necessary changes as soon as I can.

shauray8 · 2023-10-13T08:06:37Z

To make sure I understand everything, rather than having a LlavaTextModel for LLaMA I should have it through AutoConfig and basically copy all the code from CLIP for LlavaVisionModel and have a LlavaModel for a bare base model.

NielsRogge · 2023-10-13T08:09:54Z

Yes, most importantly is to remove the vision encoder from the preprocessor class and instead make it part of the model.

rafaelpadilla

Thanks!
Noticed some other nits and improvement opportunities.
@NielsRogge also included some relevant points that should be addressed.

rafaelpadilla · 2023-10-13T15:31:11Z

src/transformers/models/llava/configuration_llava.py

+    def from_llava_configs(
+        cls,
+        text_config: PretrainedConfig,
+        #text_config: LlavaTextConfig,


rafaelpadilla · 2023-10-13T15:53:28Z

src/transformers/models/llava/modeling_llava.py

+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""
+        Args:


You can use the forward in Blip2Model (here) as reference for these docstrings.

Use the LLAVA_INPUTS_DOCSTRING to include the args. Here you can just leave the examples, as done in Blip2 (here)

rafaelpadilla · 2023-10-13T15:56:05Z

src/transformers/models/llava/modeling_llava.py

+        pixel_values: Optional[torch.FloatTensor] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        r"""


Same thing here regarding the docstrings. Use the forward of Blip2ForConditionalGeneration (here) as reference. Please, leave just the examples here.

rafaelpadilla · 2023-10-13T15:59:20Z

src/transformers/models/llava/processing_llava.py

+                    return_token_type_ids=return_token_type_ids,
+                    return_length=return_length,
+                    verbose=verbose,
+                    # return_tensors=return_tensors,


If not used, please, delete.

rafaelpadilla · 2023-10-13T16:01:49Z

src/transformers/models/llava/processing_llava.py

+                for chunk in text.split("<image>")
+            ]
+
+            def insert_separator(X, sep):


Could you consider renaming X to a more descriptive variable name? Maybe word, token?

rafaelpadilla · 2023-10-13T16:03:10Z

src/transformers/models/llava/processing_llava.py

+            #vision_encoding = self.vision_model(image_encoding, output_hidden_states=True)
+            #image_features = vision_encoding.hidden_states[-2]
+            #image_features = image_features[:, 1:]


Please, remove any commented-out code if it's not being utilized

rafaelpadilla · 2023-10-13T16:03:46Z

src/transformers/models/llava/processing_llava.py

+            result.paste(pil_img, ((height - width) // 2, 0))
+            return result
+
+    def feature_select(image_forward_outs):


missing self?

Is it really being used anywhere?

rafaelpadilla · 2023-10-13T16:13:29Z

src/transformers/models/llava/processing_llava.py

+        self.DEFAULT_IMAGE_TOKEN = "<image>"
+        self.IMAGE_TOKEN_INDEX = -200


from PEP: "Constants are usually defined on a module level and written in all capital letters with underscores separating words."
So, it would be better to have it as a module constant as done in IDEFICS here

shauray8 · 2023-10-13T17:16:16Z

@rafaelpadilla I'm still working on it, I'll let you know when it's ready for review

lz1oceani · 2023-11-03T23:11:15Z

I also want to contribute to llava implementation. But I have a question, why we need to copy the vision encoder part but use AutoModel for language model part? e.g., blip, blip_2...

… llava

iocuydi · 2023-11-14T08:38:18Z

Hi @shauray8, I'm working on doing some Llava training experiments and hopefully contributing.

Could you share any guidance on how to test this Llava implementation in its current state starting from scratch from just pytorch weights (https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main) and CLIP?
Like which of the weight conversion scripts to run in what order, and do inference? I've been trying the examples in docs and in the tests, but they seem to yield NaN tensors or various other errors due to processor inconsistency, etc.
Thank you!

amyeroberts · 2023-11-15T19:36:54Z

Hi @shauray8 any updates on this model addition? We'd like to have this model merge in within the next few weeks - is this something that would fit in your timeline?

shauray8 · 2023-11-16T10:58:24Z

@amyeroberts I'm done with the architectural changes @NielsRogge suggested, uploading new weights for LLaVa and LLaVa 1.5, writing new tests and documentations could take up 2-3 days as I'm pretty caught up with my placements

amyeroberts · 2023-11-17T20:05:14Z

Hi @shauray8 - glad to hear the arch changes are done! I can see that there's still outstanding suggestions from @rafaelpadilla's PR which will also need to be addressed alongside tests etc.

As there are currently 3 in-progress model PRs - #25001, #26360, #25789 - all of which we'd like to have in the library soon, and you mention you're busy with placements I propose that you continue with one and someone else can help finish off the other PRs. As Llava is the most complete and has already had some reviews this is the one I suggest you focus on.

Let us know if you need any help!

younesbelkada · 2023-11-21T16:11:44Z

Hi @shauray8 !
Great work on the PR ! I am super excited about this architecture and I am happy to help you finishing up the PR by taking it over, or creating a new PR to add this architecture. I'll make sure to add you as the main author of this contribution as you did most of the work. Let me know how does that sound for you

shauray8 · 2023-11-22T03:40:50Z

@younesbelkada Thank you for the positive feedback! I appreciate your willingness to help in completing the PR. Opening a new PR would be a much more cleaner way of doing it but I don't mind the approach. I'm open to discussing any details or providing additional information you might need.

shauray8 added 3 commits August 27, 2023 21:43

initial

2d77eb4

some some

d2fad3e

cannot use image processor

13a06c0

ArthurZucker reviewed Aug 28, 2023

View reviewed changes

shauray8 added 23 commits August 29, 2023 01:06

processor

284a994

more arch

1efdea8

llava

eed85d4

testing colab

4114615

shape

a98d49c

replace

fbf041a

on hf repo

530059e

might work

1262761

almost there

b0aa991

almost

79ffdcc

docs

332ca52

tests

67ca520

tests and docs

ae9b107

testing

753c536

making tests work

279bac3

make

0248581

still testing

63b1ee2

done

b024d94

more

be77707

make

30f88a5

Merge branch 'main' into llava

bba52b9

more tests

825a5b5

more tests

9b80ba4

NielsRogge reviewed Oct 12, 2023

View reviewed changes

clip model integration, ditching llama

f52d10a

rafaelpadilla reviewed Oct 13, 2023

View reviewed changes

shauray8 added 3 commits October 13, 2023 22:47

convert

9485c6f

converting 3rd

8cad3e5

weights converted

e1943a4

younesbelkada mentioned this pull request Nov 2, 2023

Add LLaVA model to transformers #27221

Closed

5 tasks

shauray8 and others added 4 commits November 6, 2023 17:01

Merge branch 'main' into llava

a78db23

merge

471a222

Merge branch 'llava' of https://github.com/shauray8/transformers into…

4a6bbc7

… llava

fixing modeling and documentation

b62315d

younesbelkada mentioned this pull request Nov 22, 2023

[Llava] Add Llava to transformers #27662

Merged

younesbelkada closed this in #27662 Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Llava to transformers #25789

Adding Llava to transformers #25789

shauray8 commented Aug 28, 2023 •

edited

Loading

shauray8 commented Aug 28, 2023

shauray8 commented Aug 28, 2023

ArthurZucker left a comment

shauray8 commented Aug 28, 2023

NielsRogge Oct 12, 2023

NielsRogge Oct 12, 2023

NielsRogge Oct 12, 2023

NielsRogge Oct 12, 2023

NielsRogge Oct 12, 2023

NielsRogge Oct 12, 2023

NielsRogge left a comment •

edited

Loading

shauray8 commented Oct 13, 2023

shauray8 commented Oct 13, 2023

NielsRogge commented Oct 13, 2023

rafaelpadilla left a comment

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

rafaelpadilla Oct 13, 2023

shauray8 commented Oct 13, 2023

lz1oceani commented Nov 3, 2023

iocuydi commented Nov 14, 2023 •

edited

Loading

amyeroberts commented Nov 15, 2023

shauray8 commented Nov 16, 2023

amyeroberts commented Nov 17, 2023

younesbelkada commented Nov 21, 2023

shauray8 commented Nov 22, 2023



		LLAVA_PRETRAINED_CONFIG_ARCHIVE_MAP = {
		"shauray/Llava-Llama-2-7B-hf": "https://huggingface.co/shauray/Llava-Llama-2-7B-hf/resolve/main/config.json",

		Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
		`inputs_ids` passed when calling [`LlamaModel`]

		self.DEFAULT_IMAGE_TOKEN = "<image>"
		self.IMAGE_TOKEN_INDEX = -200

Adding Llava to transformers #25789

Adding Llava to transformers #25789

Conversation

shauray8 commented Aug 28, 2023 • edited Loading

What does this PR do?

Before submitting

Who can review?

Results

shauray8 commented Aug 28, 2023

shauray8 commented Aug 28, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

shauray8 commented Aug 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NielsRogge left a comment • edited Loading

Choose a reason for hiding this comment

shauray8 commented Oct 13, 2023

shauray8 commented Oct 13, 2023

NielsRogge commented Oct 13, 2023

rafaelpadilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shauray8 commented Oct 13, 2023

lz1oceani commented Nov 3, 2023

iocuydi commented Nov 14, 2023 • edited Loading

amyeroberts commented Nov 15, 2023

shauray8 commented Nov 16, 2023

amyeroberts commented Nov 17, 2023

younesbelkada commented Nov 21, 2023

shauray8 commented Nov 22, 2023

shauray8 commented Aug 28, 2023 •

edited

Loading

NielsRogge left a comment •

edited

Loading

iocuydi commented Nov 14, 2023 •

edited

Loading