Add llama3-llava-next-8b to llava_next conversion script #31395

jamt9000 · 2024-06-12T21:50:47Z

What does this PR do?

Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo.

I've confirmed the model seems to convert without errors and the output seems similar (but not identical) to the inference example from the LLaVA-NeXT repo, however I'm nor sure if the logic around the added tokens is correct.

The batched generation also seems to be off:

Batched generation...
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
['system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nWhat is shown in this image?assistant\n\n\nThe image appears to be a radar chart, also known as a spider chart or a web chart', '[INST] \nHow many cats are there? [/INST]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']

Fixes #31394

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@NielsRogge @zucchini-nlp

Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo.

zucchini-nlp

Hey! Thanks so much for adding support for new checkpoints ❤️

Just a few comments:

Can you also add the chat format here in the docs to let users know which format to use. We don't have chat_template yet so it's better to indicate it in the docs for now
Weird that batched generation didn't work. Can you verify that tokenizer.padding_side="left" or maybe it's the incorrect prompt format?
I am wondering if we need to add the other checkpoints, of 72b and 110b size. This is more of a question to @NielsRogge

NielsRogge · 2024-06-13T08:24:29Z

Yes would be great to have all checkpoints converted :)

NielsRogge · 2024-06-13T11:53:25Z

Please also make sure to verify the outputs of the model against the original implementation, also on a logits level (ideally with torch.assertallclose(original_logits[0,:3,:3], huggingface_logits[0,:3,:3]) on a dummy input). Getting sensible output is usually not enough

jamt9000 · 2024-06-15T18:58:59Z

Currently doing a bunch of debugging to try to get near-identical logits. Findings so far are:

I'd added <|begin_of_text|> in the prompt example, but it looks like this gets added automatically in the HF processor, so the input_ids had two begin of text ids. I've also checked the 5-crop image preprocessing is completely identical.

After fixing the begin of text the logits are:

# LLaVA-NeXT repo
tensor([[ -3.9648,   1.1396,   3.3145],
        [ -5.3398,  -1.5537,  -1.9512],
        [-12.3828, -10.6797,  -9.3047]], device='cuda:0',
       grad_fn=<SliceBackward0>)

# HF LlavaNextForConditionalGeneration
tensor([[ -3.9648,   1.1396,   3.3145],
        [ -5.3594,  -1.5654,  -1.9619],
        [-12.3750, -10.6797,  -9.3125]], device='cuda:2', dtype=torch.float32)

Which is close but not within the atol=1e-4 in the script. However the generated text is now identical (and matches what is in the quickstart example):

The image shows a radar chart, also known as a spider chart or a web chart, which is a type of graph used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nIn this particular radar chart, there are several axes labeled with different variables, such as "MM-Vet," "LL

This token gets added automatically, so it should not be included in the prompt example.

jamt9000 · 2024-06-16T14:43:11Z

Also making a note that LLama3 and Qwen actually have extra unused space in the embeddings, so we don't need to resize.

For LLama3 they are included in the tokenizer as <|reserved_special_token_0|> etc (you can see them all in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json) and from the discussion here meta-llama/llama3#77 it looks like they are available for use (although renaming a reserved token to <image> might involve some hacky stuff: #27974 (comment))

For Qwen, the unused space isn't already allocated in the tokenizer, but len(tokenizer)==151646 whereas config.text_config.vocab_size==152064 due to padding to a multiple of 128 QwenLM/Qwen#553

Adds the Qwen-based LLaVA-Next models to the conversion script, along with changes to load the models on multiple GPUs for inference.

whyiug · 2024-06-28T08:42:42Z

hey, i wonder why this PR is in pending.

zucchini-nlp · 2024-06-28T08:49:07Z

Sorry, totally forgot about this PR. I think the only thing left now is to get batched generation fixed, if not done yet, and create section in docs for new models, as outlined above

@jamt9000 let me know if you're stuck and need help or ping me if it's ready for review :)

jamt9000 · 2024-06-30T22:01:49Z

I've added the prompt examples to the docs and checked that with the correct prompt and padding_side = "left" the batched generation is working (but with padding_side = "right" it still doesn't generate any answer to the cats question).

[
'system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nWhat is shown in this image?assistant\n\n\nThe image shows a radar chart, also known as a spider chart or a web chart, which',
'system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nHow many cats are there?assistant\n\n\nThere are two cats in the image.'
]

What I'm still stuck on is:

I'm unsure if the llama3 logits need to be within a tighter tolerance to the reference
Should the batched generation be working with padding=right? Why does it say "Setting pad_token_id to eos_token_id:128009 for open-end generation." when there's a padding token added?
Whether it's necessary to resize the vocabulary (or if it could use a reserved token intead Add llama3-llava-next-8b to llava_next conversion script #31395 (comment))

zucchini-nlp · 2024-07-08T09:25:21Z

@jamt9000 again sorry for delayed reply, was off on a vacation. Regarding the questions:

Yes, padding_side='left' for batched inference is exactly what we need, as right padding is used only in training
if we can make use of reserved tokens, then it's not necessary to resize vocab embeddings. Just make sure that tokenizer and model both have correct image token ids in their config. Simple test comparing with original implementation should be okay
Hmm, from the prev example seems like the higher logit difference is a result of using different dtypes in original and HF implementation. Afaik the original always uses bf16, so might interesting to check. In any case, the current state logits difference look okay to me

zucchini-nlp

I left some comments. Let me know if you have bandwidth to address them, otherwise I can finish the PR for you and merge it next week :)

zucchini-nlp · 2024-07-15T06:12:48Z

docs/source/en/model_doc/llava_next.md

+llama3-llava-next-8b-hf requires the following format:
+
+```bash
+"<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat is shown in this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+```
+
+llava-next-72b-hf and llava-next-110b-hf require the following format:
+
+```bash
+"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n"
+```
+


Great job! ❤️

We're trying to move to using processor.apply_chat_template(), so I will also add a chat_template for this models when uploading the weights

zucchini-nlp · 2024-07-15T06:23:23Z

src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py

+        # For these big models need to do multi-gpu inference, so reload the model with device_map="auto" in order to use accelerate
+        # Is there a way to do this without saving and reloading the model?
+        model.save_pretrained("/tmp/llava_qwen")
+        model = LlavaNextForConditionalGeneration.from_pretrained("/tmp/llava_qwen", device_map="auto")


Hmm, imo it's okay to save and load the weights back. I would even say that pytorch_dump_folder_path doesn't have to be optional, so we save it there and then load back for inference

zucchini-nlp · 2024-07-15T06:24:30Z

src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py

-        text=[prompt, "[INST] <image>\nHow many cats are there? [/INST]"],
+        text=[prompt, cats_prompt],


This can be chnaged to [prompt, prompt] so we don't have to manually match chat templates for the second prompt

jamt9000 · 2024-07-15T10:29:33Z

@zucchini-nlp Thanks! I'm on vacation this week, so do go ahead if you'd like to finish the PR yourself!

HuggingFaceDocBuilderDev · 2024-07-18T08:37:18Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-07-19T10:39:16Z

Ready for review, made final changes and uploaded the weights on the Hub, along with their chat templates

amyeroberts

Thanks for adding!

Some general comments e.g. not pushing to the hub automatically

src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py

amyeroberts · 2024-07-19T16:43:22Z

src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py

        ],
        required=False,
    )
    parser.add_argument(
-        "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
+        "--pytorch_dump_folder_path", type=str, required=True, help="Path to the output PyTorch model directory."


Why make this required?

We want to save the model and then load again, so that it's loaded by automatically inferring how to split weights on each device/cpu/disk depending on how much memory available.

Especially required in big models that we're adding, they need multi-gpu inference but we can't init model from config in multi-gpu setting

src/transformers/models/llava_next/convert_llava_next_weights_to_hf.py

…to_hf.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

zucchini-nlp · 2024-07-22T05:18:23Z

@amyeroberts ready for review, addresses the comments

amyeroberts

Thanks for adding and iterating!

zucchini-nlp · 2024-07-23T05:13:34Z

Thanks @jamt9000 💛 The converted model weights are already uploaded to llava-hf org in the hub!

Add llama3-llava-next-8b to llava_next conversion script

d8597d5

Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo.

jamt9000 mentioned this pull request Jun 12, 2024

Add llama3-llava-next-8b to convert_llava_next_weights_to_hf.py #31394

Closed

zucchini-nlp reviewed Jun 13, 2024

View reviewed changes

zucchini-nlp requested a review from NielsRogge June 13, 2024 04:56

Exclude <|begin_of_text|> from prompt example

57413fd

This token gets added automatically, so it should not be included in the prompt example.

Add llava-next-72b and llava-next-110b

7367867

Adds the Qwen-based LLaVA-Next models to the conversion script, along with changes to load the models on multiple GPUs for inference.

jamt9000 added 2 commits June 30, 2024 20:58

Add llama3 and qwen prompt formats to docs

5146563

Chat prompt and padding side left for llama3 batched

b115763

zucchini-nlp reviewed Jul 15, 2024

View reviewed changes

zucchini-nlp added 2 commits July 19, 2024 12:34

update

fb1e501

Merge branch 'main' into add-llava-llama3-8b-to-conversion-script

7895915

zucchini-nlp requested a review from amyeroberts July 19, 2024 10:38

amyeroberts reviewed Jul 19, 2024

View reviewed changes

zucchini-nlp and others added 4 commits July 22, 2024 09:59

Update src/transformers/models/llava_next/convert_llava_next_weights_…

79e5a55

…to_hf.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

Update src/transformers/models/llava_next/convert_llava_next_weights_…

9b6ba32

…to_hf.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

remove code

ab028a1

better naming

b4c645a

zucchini-nlp requested a review from amyeroberts July 22, 2024 05:18

amyeroberts approved these changes Jul 22, 2024

View reviewed changes

zucchini-nlp merged commit 251a240 into huggingface:main Jul 23, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add llama3-llava-next-8b to llava_next conversion script #31395

Add llama3-llava-next-8b to llava_next conversion script #31395

jamt9000 commented Jun 12, 2024 •

edited

Loading

zucchini-nlp left a comment

NielsRogge commented Jun 13, 2024

NielsRogge commented Jun 13, 2024

jamt9000 commented Jun 15, 2024 •

edited

Loading

jamt9000 commented Jun 16, 2024 •

edited

Loading

whyiug commented Jun 28, 2024

zucchini-nlp commented Jun 28, 2024

jamt9000 commented Jun 30, 2024

zucchini-nlp commented Jul 8, 2024

zucchini-nlp left a comment

zucchini-nlp Jul 15, 2024

zucchini-nlp Jul 15, 2024

zucchini-nlp Jul 15, 2024

jamt9000 commented Jul 15, 2024

HuggingFaceDocBuilderDev commented Jul 18, 2024

zucchini-nlp commented Jul 19, 2024

amyeroberts left a comment

amyeroberts Jul 19, 2024

zucchini-nlp Jul 22, 2024

zucchini-nlp commented Jul 22, 2024

amyeroberts left a comment

zucchini-nlp commented Jul 23, 2024

		text=[prompt, "[INST] <image>\nHow many cats are there? [/INST]"],
		text=[prompt, cats_prompt],

Add llama3-llava-next-8b to llava_next conversion script #31395

Add llama3-llava-next-8b to llava_next conversion script #31395

Conversation

jamt9000 commented Jun 12, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

zucchini-nlp left a comment

Choose a reason for hiding this comment

NielsRogge commented Jun 13, 2024

NielsRogge commented Jun 13, 2024

jamt9000 commented Jun 15, 2024 • edited Loading

jamt9000 commented Jun 16, 2024 • edited Loading

whyiug commented Jun 28, 2024

zucchini-nlp commented Jun 28, 2024

jamt9000 commented Jun 30, 2024

zucchini-nlp commented Jul 8, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Jul 15, 2024

Choose a reason for hiding this comment

zucchini-nlp Jul 15, 2024

Choose a reason for hiding this comment

zucchini-nlp Jul 15, 2024

Choose a reason for hiding this comment

jamt9000 commented Jul 15, 2024

HuggingFaceDocBuilderDev commented Jul 18, 2024

zucchini-nlp commented Jul 19, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts Jul 19, 2024

Choose a reason for hiding this comment

zucchini-nlp Jul 22, 2024

Choose a reason for hiding this comment

zucchini-nlp commented Jul 22, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Jul 23, 2024

jamt9000 commented Jun 12, 2024 •

edited

Loading

jamt9000 commented Jun 15, 2024 •

edited

Loading

jamt9000 commented Jun 16, 2024 •

edited

Loading