Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

Closed
huanglizhuo opened this issue Aug 18, 2024 · 9 comments
Closed

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

huanglizhuo opened this issue Aug 18, 2024 · 9 comments
Assignees
Labels
question Question about the usage

Comments

@huanglizhuo
Copy link
Contributor

❓ General Questions

I am trying to run Llama-3.1-Minitron-4B-Width-Base, in the readme they mention:

Pull requests to support this model in Hugging Face Transformers are currently under review (#32495 and #32502) and are expected to be merged soon. In the meantime, please follow the installation instructions below:


# Fetch PR 32502
$ git clone -b suhara/llama-kv-channels --single-branch https://github.com/suhara/transformers.git && cd transformers

# Fetch changes from PR 32495
$ git fetch https://github.com/suiyoubi/transformers.git aot/head_dim_rope && git cherry-pick FETCH_HEAD --strategy-option theirs

# Install transformers
$ pip install -e .

After check the PR mentioned in above I found that head_dim already supported by mlc-llm, and looks like assert self.head_dim * self.num_attention_heads == self.hidden_size from the llama_model.py is not required? so I did below steps:

  1. remove assert self.head_dim * self.num_attention_heads == self.hidden_size from the llama_model.py
  2. build mlc_llm from source and follow steps here: https://llm.mlc.ai/docs/compilation/compile_models.html#compile-model-libraries to convert the weight and compile model library
  3. verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

I think there must be some wrong understanding from myside, can anyone help give hint which direction should I check to run Llama-3.1-Minitron-4B-Width-Base model?

@huanglizhuo huanglizhuo added the question Question about the usage label Aug 18, 2024
@suhara
Copy link

suhara commented Aug 18, 2024

@huanglizhuo

I think you refer to this block.
https://github.com/suhara/transformers/blob/e0af55227f022c535b5e71ebc89257956cede8bf/src/transformers/models/llama/modeling_llama.py#L350-L354

        if self.head_dim is None and (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )

Because of L343, which defines self.head_dim, the block will never be used. I'll remove it from the branch. But it should work even with this block.

FYI, PR #32495 has been merged now

  1. verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page.
https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base

@huanglizhuo
Copy link
Contributor Author

huanglizhuo commented Aug 19, 2024

@suhara thank you for the checking, but the mlc-llm codebase has a similar code here

assert self.head_dim * self.num_attention_heads == self.hidden_size

But based on the PR you meneiton in https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base the self.head_dim * self.num_attention_heads will not equal self.hidden_size is my understanding correct or did I missed something?

@suhara
Copy link

suhara commented Aug 19, 2024

@huanglizhuo Your understanding is correct. The custom head_dim should be supported on the MLC side as well. Each inference engine (e.g., HF, Llama.cpp, MLC) should support the architecture.

I'm not familiar with MLC at all but do you think you can make necessary changes?

FYI, you can refer to the PR for HF.
https://github.com/huggingface/transformers/pull/32502/files

@huanglizhuo
Copy link
Contributor Author

huanglizhuo commented Aug 19, 2024

@suhara Thank you for the confirm, I actually check your PR for HF, and head_dim was already supported by MLC, the only blocker seems to be the below line,

assert self.head_dim * self.num_attention_heads == self.hidden_size

I removed it as I mention here, but the chat output is nonsense. Let me read MLC code more carefully and see if I can find where did I miss.

@huanglizhuo
Copy link
Contributor Author

huanglizhuo commented Aug 19, 2024

--quantization QUANTIZATION_MODE
The quantization mode we use to compile.
See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.
We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

by the way when I do the weight convert I use --quantization q4f16_1, is there any possible that the quantization might cause the model output text to be nonsense?

@huanglizhuo
Copy link
Contributor Author

will try convert weight without --quantization q4f16_1 see if there is any different

@huanglizhuo
Copy link
Contributor Author

tried with --quantization q4f32_1, it still give nonsense response 😢
There must be something missing, will try to read more about mlc source code

@YiyanZhai YiyanZhai self-assigned this Aug 19, 2024
@YiyanZhai
Copy link
Contributor

Hi @huanglizhuo, thank you for bringing this issue to our attention.

Removal of assert self.head_dim * self.num_attention_heads == self.hidden_size from llama_model.py is now updated in PR #2848.

However, we've encountered similar observation when running inference on the Llama-3.1-Minitron-4B-Width-Base model using Hugging Face's transformers library directly:
Screenshot 2024-08-23 at 21 34 14
The model appears to produce reasonable output initially, and after generating a certain number of tokens, the output begins to lose coherence. This behavior suggests that while removing the assertion allows the model to run, it is not optimized for or possibly not intended for open-ended text generation or conversational tasks.

@huanglizhuo
Copy link
Contributor Author

@YiyanZhai thank you for the update, then the issue is actually due to

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question about the usage
Projects
None yet
Development

No branches or pull requests

4 participants