Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add llama3-llava-next-8b to llava_next conversion script #31395

Merged

Conversation

jamt9000
Copy link
Contributor

@jamt9000 jamt9000 commented Jun 12, 2024

What does this PR do?

Adds support for the lmms-lab/llama3-llava-next-8b model to the convert_llava_next_weights_to_hf.py script, along with an example prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT repo.

I've confirmed the model seems to convert without errors and the output seems similar (but not identical) to the inference example from the LLaVA-NeXT repo, however I'm nor sure if the logic around the added tokens is correct.

The batched generation also seems to be off:

Batched generation...
Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.
['system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nWhat is shown in this image?assistant\n\n\nThe image appears to be a radar chart, also known as a spider chart or a web chart', '[INST] \nHow many cats are there? [/INST]\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n']

Fixes #31394

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@NielsRogge @zucchini-nlp

Adds support for the lmms-lab/llama3-llava-next-8b model to the
convert_llava_next_weights_to_hf.py script, along with an example
prompt generated from the llava_llama_3 conv_template in the LLaVA-NeXT
repo.
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks so much for adding support for new checkpoints ❤️

Just a few comments:

  • Can you also add the chat format here in the docs to let users know which format to use. We don't have chat_template yet so it's better to indicate it in the docs for now
  • Weird that batched generation didn't work. Can you verify that tokenizer.padding_side="left" or maybe it's the incorrect prompt format?
  • I am wondering if we need to add the other checkpoints, of 72b and 110b size. This is more of a question to @NielsRogge

@zucchini-nlp zucchini-nlp requested a review from NielsRogge June 13, 2024 04:56
@NielsRogge
Copy link
Contributor

Yes would be great to have all checkpoints converted :)

@NielsRogge
Copy link
Contributor

Please also make sure to verify the outputs of the model against the original implementation, also on a logits level (ideally with torch.assertallclose(original_logits[0,:3,:3], huggingface_logits[0,:3,:3]) on a dummy input). Getting sensible output is usually not enough

@jamt9000
Copy link
Contributor Author

jamt9000 commented Jun 15, 2024

Currently doing a bunch of debugging to try to get near-identical logits. Findings so far are:

I'd added <|begin_of_text|> in the prompt example, but it looks like this gets added automatically in the HF processor, so the input_ids had two begin of text ids. I've also checked the 5-crop image preprocessing is completely identical.

After fixing the begin of text the logits are:

# LLaVA-NeXT repo
tensor([[ -3.9648,   1.1396,   3.3145],
        [ -5.3398,  -1.5537,  -1.9512],
        [-12.3828, -10.6797,  -9.3047]], device='cuda:0',
       grad_fn=<SliceBackward0>)

# HF LlavaNextForConditionalGeneration
tensor([[ -3.9648,   1.1396,   3.3145],
        [ -5.3594,  -1.5654,  -1.9619],
        [-12.3750, -10.6797,  -9.3125]], device='cuda:2', dtype=torch.float32)

Which is close but not within the atol=1e-4 in the script. However the generated text is now identical (and matches what is in the quickstart example):

The image shows a radar chart, also known as a spider chart or a web chart, which is a type of graph used to display multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. Each axis represents a different variable, and the values are plotted along each axis and connected to form a polygon.\n\nIn this particular radar chart, there are several axes labeled with different variables, such as "MM-Vet," "LL

This token gets added automatically, so it should not be included in the
prompt example.
@jamt9000
Copy link
Contributor Author

jamt9000 commented Jun 16, 2024

Also making a note that LLama3 and Qwen actually have extra unused space in the embeddings, so we don't need to resize.

For LLama3 they are included in the tokenizer as <|reserved_special_token_0|> etc (you can see them all in https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/blob/main/tokenizer_config.json) and from the discussion here meta-llama/llama3#77 it looks like they are available for use (although renaming a reserved token to <image> might involve some hacky stuff: #27974 (comment))

For Qwen, the unused space isn't already allocated in the tokenizer, but len(tokenizer)==151646 whereas config.text_config.vocab_size==152064 due to padding to a multiple of 128 QwenLM/Qwen#553

Adds the Qwen-based LLaVA-Next models to the conversion script, along
with changes to load the models on multiple GPUs for inference.
@whyiug
Copy link

whyiug commented Jun 28, 2024

hey, i wonder why this PR is in pending.

@zucchini-nlp
Copy link
Member

Sorry, totally forgot about this PR. I think the only thing left now is to get batched generation fixed, if not done yet, and create section in docs for new models, as outlined above

@jamt9000 let me know if you're stuck and need help or ping me if it's ready for review :)

@jamt9000
Copy link
Contributor Author

I've added the prompt examples to the docs and checked that with the correct prompt and padding_side = "left" the batched generation is working (but with padding_side = "right" it still doesn't generate any answer to the cats question).

[
'system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nWhat is shown in this image?assistant\n\n\nThe image shows a radar chart, also known as a spider chart or a web chart, which',
'system\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.user\n\n\nHow many cats are there?assistant\n\n\nThere are two cats in the image.'
]

What I'm still stuck on is:

  • I'm unsure if the llama3 logits need to be within a tighter tolerance to the reference
  • Should the batched generation be working with padding=right? Why does it say "Setting pad_token_id to eos_token_id:128009 for open-end generation." when there's a padding token added?
  • Whether it's necessary to resize the vocabulary (or if it could use a reserved token intead Add llama3-llava-next-8b to llava_next conversion script #31395 (comment))

@zucchini-nlp
Copy link
Member

@jamt9000 again sorry for delayed reply, was off on a vacation. Regarding the questions:

  • Yes, padding_side='left' for batched inference is exactly what we need, as right padding is used only in training
  • if we can make use of reserved tokens, then it's not necessary to resize vocab embeddings. Just make sure that tokenizer and model both have correct image token ids in their config. Simple test comparing with original implementation should be okay
  • Hmm, from the prev example seems like the higher logit difference is a result of using different dtypes in original and HF implementation. Afaik the original always uses bf16, so might interesting to check. In any case, the current state logits difference look okay to me

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments. Let me know if you have bandwidth to address them, otherwise I can finish the PR for you and merge it next week :)

Comment on lines 69 to 80
llama3-llava-next-8b-hf requires the following format:

```bash
"<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|><|start_header_id|>user<|end_header_id|>\n\n<image>\nWhat is shown in this image?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
```

llava-next-72b-hf and llava-next-110b-hf require the following format:

```bash
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|>\n<|im_start|>assistant\n"
```

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! ❤️

We're trying to move to using processor.apply_chat_template(), so I will also add a chat_template for this models when uploading the weights

Comment on lines 190 to 193
# For these big models need to do multi-gpu inference, so reload the model with device_map="auto" in order to use accelerate
# Is there a way to do this without saving and reloading the model?
model.save_pretrained("/tmp/llava_qwen")
model = LlavaNextForConditionalGeneration.from_pretrained("/tmp/llava_qwen", device_map="auto")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, imo it's okay to save and load the weights back. I would even say that pytorch_dump_folder_path doesn't have to be optional, so we save it there and then load back for inference

Comment on lines 284 to 343
text=[prompt, "[INST] <image>\nHow many cats are there? [/INST]"],
text=[prompt, cats_prompt],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be chnaged to [prompt, prompt] so we don't have to manually match chat templates for the second prompt

@jamt9000
Copy link
Contributor Author

@zucchini-nlp Thanks! I'm on vacation this week, so do go ahead if you'd like to finish the PR yourself!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp zucchini-nlp requested a review from amyeroberts July 19, 2024 10:38
@zucchini-nlp
Copy link
Member

Ready for review, made final changes and uploaded the weights on the Hub, along with their chat templates

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Some general comments e.g. not pushing to the hub automatically

],
required=False,
)
parser.add_argument(
"--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory."
"--pytorch_dump_folder_path", type=str, required=True, help="Path to the output PyTorch model directory."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why make this required?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to save the model and then load again, so that it's loaded by automatically inferring how to split weights on each device/cpu/disk depending on how much memory available.

Especially required in big models that we're adding, they need multi-gpu inference but we can't init model from config in multi-gpu setting

zucchini-nlp and others added 4 commits July 22, 2024 09:59
…to_hf.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
…to_hf.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@zucchini-nlp
Copy link
Member

@amyeroberts ready for review, addresses the comments

@zucchini-nlp zucchini-nlp requested a review from amyeroberts July 22, 2024 05:18
Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding and iterating!

@zucchini-nlp zucchini-nlp merged commit 251a240 into huggingface:main Jul 23, 2024
8 checks passed
@zucchini-nlp
Copy link
Member

Thanks @jamt9000 💛 The converted model weights are already uploaded to llava-hf org in the hub!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add llama3-llava-next-8b to convert_llava_next_weights_to_hf.py
6 participants