[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

huanglizhuo · 2024-08-18T11:46:28Z

❓ General Questions

I am trying to run Llama-3.1-Minitron-4B-Width-Base, in the readme they mention:

Pull requests to support this model in Hugging Face Transformers are currently under review (#32495 and #32502) and are expected to be merged soon. In the meantime, please follow the installation instructions below:


# Fetch PR 32502
$ git clone -b suhara/llama-kv-channels --single-branch https://github.com/suhara/transformers.git && cd transformers

# Fetch changes from PR 32495
$ git fetch https://github.com/suiyoubi/transformers.git aot/head_dim_rope && git cherry-pick FETCH_HEAD --strategy-option theirs

# Install transformers
$ pip install -e .

After check the PR mentioned in above I found that head_dim already supported by mlc-llm, and looks like assert self.head_dim * self.num_attention_heads == self.hidden_size from the llama_model.py is not required? so I did below steps:

remove assert self.head_dim * self.num_attention_heads == self.hidden_size from the llama_model.py
build mlc_llm from source and follow steps here: https://llm.mlc.ai/docs/compilation/compile_models.html#compile-model-libraries to convert the weight and compile model library
verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

I think there must be some wrong understanding from myside, can anyone help give hint which direction should I check to run Llama-3.1-Minitron-4B-Width-Base model?

The text was updated successfully, but these errors were encountered:

suhara · 2024-08-18T18:33:06Z

@huanglizhuo

I think you refer to this block.
https://github.com/suhara/transformers/blob/e0af55227f022c535b5e71ebc89257956cede8bf/src/transformers/models/llama/modeling_llama.py#L350-L354

        if self.head_dim is None and (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )

Because of L343, which defines self.head_dim, the block will never be used. I'll remove it from the branch. But it should work even with this block.

FYI, PR #32495 has been merged now

verify output and chat, but the chat response is nonsense, sometimes mix different language with nonsense content.

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page.
https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base

huanglizhuo · 2024-08-19T02:52:57Z

@suhara thank you for the checking, but the mlc-llm codebase has a similar code here

mlc-llm/python/mlc_llm/model/llama/llama_model.py

Line 87 in 0c0c7a6

assert self.head_dim * self.num_attention_heads == self.hidden_size

But based on the PR you meneiton in https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base the self.head_dim * self.num_attention_heads will not equal self.hidden_size is my understanding correct or did I missed something?

suhara · 2024-08-19T06:42:04Z

@huanglizhuo Your understanding is correct. The custom head_dim should be supported on the MLC side as well. Each inference engine (e.g., HF, Llama.cpp, MLC) should support the architecture.

I'm not familiar with MLC at all but do you think you can make necessary changes?

FYI, you can refer to the PR for HF.
https://github.com/huggingface/transformers/pull/32502/files

huanglizhuo · 2024-08-19T09:10:57Z

@suhara Thank you for the confirm, I actually check your PR for HF, and head_dim was already supported by MLC, the only blocker seems to be the below line,

mlc-llm/python/mlc_llm/model/llama/llama_model.py

Line 87 in 0c0c7a6

assert self.head_dim * self.num_attention_heads == self.hidden_size

I removed it as I mention here, but the chat output is nonsense. Let me read MLC code more carefully and see if I can find where did I miss.

huanglizhuo · 2024-08-19T09:15:27Z

--quantization QUANTIZATION_MODE
The quantization mode we use to compile.
See Quantization Mode for more information. Available options are: q0f16, q0f32, q3f16_1, q4f16_1, q4f32_1, and q4f16_awq.
We encourage you to use 4-bit quantization, as the text generated by 3-bit quantized models may have bad quality depending on the model.

by the way when I do the weight convert I use --quantization q4f16_1, is there any possible that the quantization might cause the model output text to be nonsense?

huanglizhuo · 2024-08-19T09:16:34Z

will try convert weight without --quantization q4f16_1 see if there is any different

huanglizhuo · 2024-08-19T12:31:06Z

tried with --quantization q4f32_1, it still give nonsense response 😢
There must be something missing, will try to read more about mlc source code

YiyanZhai · 2024-08-24T01:42:12Z

Hi @huanglizhuo, thank you for bringing this issue to our attention.

Removal of assert self.head_dim * self.num_attention_heads == self.hidden_size from llama_model.py is now updated in PR #2848.

However, we've encountered similar observation when running inference on the Llama-3.1-Minitron-4B-Width-Base model using Hugging Face's transformers library directly:

The model appears to produce reasonable output initially, and after generating a certain number of tokens, the output begins to lose coherence. This behavior suggests that while removing the assertion allows the model to run, it is not optimized for or possibly not intended for open-ended text generation or conversational tasks.

huanglizhuo · 2024-08-27T16:05:41Z

@YiyanZhai thank you for the update, then the issue is actually due to

The model is a Base model, not an instruct model. It still may have minimal conversational ability. Please posting a question/request in the HF model hub page.

huanglizhuo added the question Question about the usage label Aug 18, 2024

YiyanZhai self-assigned this Aug 19, 2024

MasterJH5574 closed this as completed Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

huanglizhuo commented Aug 18, 2024

suhara commented Aug 18, 2024

huanglizhuo commented Aug 19, 2024 •

edited

Loading

suhara commented Aug 19, 2024

huanglizhuo commented Aug 19, 2024 •

edited

Loading

huanglizhuo commented Aug 19, 2024 •

edited

Loading

huanglizhuo commented Aug 19, 2024

huanglizhuo commented Aug 19, 2024

YiyanZhai commented Aug 24, 2024

huanglizhuo commented Aug 27, 2024

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

[Question] how to run Llama-3.1-Minitron-4B-Width-Base #2820

Comments

huanglizhuo commented Aug 18, 2024

❓ General Questions

suhara commented Aug 18, 2024

huanglizhuo commented Aug 19, 2024 • edited Loading

suhara commented Aug 19, 2024

huanglizhuo commented Aug 19, 2024 • edited Loading

huanglizhuo commented Aug 19, 2024 • edited Loading

huanglizhuo commented Aug 19, 2024

huanglizhuo commented Aug 19, 2024

YiyanZhai commented Aug 24, 2024

huanglizhuo commented Aug 27, 2024

huanglizhuo commented Aug 19, 2024 •

edited

Loading

huanglizhuo commented Aug 19, 2024 •

edited

Loading

huanglizhuo commented Aug 19, 2024 •

edited

Loading