Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mllama: update docs #34334

Merged
merged 4 commits into from
Oct 30, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions docs/source/en/model_doc/mllama.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,32 @@ The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a
- The text passed to the processor should have the `"<|image|>"` tokens where the images should be inserted.
- The processor has its own `apply_chat_template` method to convert chat messages to text that can then be passed as text to the processor.


<Tip warning={true}>

Mllama has an extra token used as a placeholder for image positions in the text. It means that input ids and an input embedding layer will have an extra token. But since the weights for input and output embeddings are not tied, the `lm_head` layer has one less token and will fail if you want to calculate loss on image tokens or apply some logit processors. In case you are training, make sure to mask out special `"<|image|>"` tokens in the `labels` as the model should not be trained on predicting them.

Otherwise if you see CUDA-side index erros when generating, use the below code to expand the `lm_head` by one more token.


```python
pre_expansion_embeddings = model.language_model.lm_head.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T @ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(mu, covariance_matrix=1e-5 * sigma)


num_new_tokens = 1 # 1 for the `"<|image|>"` token
lm_head_weights = model.language_model.lm_head.weight

new_token_embedding = torch.stack(tuple(dist.sample() for _ in range(num_new_tokens)), dim=0).to(device=lm_head_weights.device, dtype=lm_head_weights.dtype)
lm_head_weights.data = torch.cat([lm_head_weights.data, new_token_embedding], dim=0)
lm_head_weights.num_embeddings = lm_head_weights.data.shape[0]
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should already be done internally if you use the correct flag for resize token embedding

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I see we don;t let users to specify which embeddings to resize and use input_embeddings by default. In case weights are tied (not case of mllama) we also resize output embeddings

Or you mean there is another method similar to resize_token_embeddings? Might have overlooked that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the output of get_input_embeddings which should always be the input embedding and by default the output embedding from get_output_emebdding is resized when you are tied. But you are right, you can't only resize the lm head.

Tho ther might be some util function you can re-use no? 🤗 Feel free to merge!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, there was a way to hide all the ugly code in private methods

</Tip>


## Usage Example

#### Instruct model
Expand Down
Loading