Support Mistral-Nemo-Instruct-2407 128K #8577

mirek190 · 2024-07-18T19:59:40Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Any plans for support Mistral-Nemo-Instruct-2407 128K ?

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

Motivation

enhancement

Possible Implementation

No response

0wwafa · 2024-07-18T20:04:30Z

yes, please. this one is going to be good and soon finetunes will start to popup...

0wwafa · 2024-07-18T20:05:07Z

And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

delphijb · 2024-07-18T20:58:06Z

I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance

stduhpf · 2024-07-18T21:03:33Z

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

EliEron · 2024-07-18T21:41:36Z

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.

For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.

iamlemec · 2024-07-18T22:05:36Z

I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.

The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here #8576 (comment)). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:

PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).

netrunnereve · 2024-07-19T01:17:25Z

If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

iamlemec · 2024-07-19T21:13:41Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

mirek190 · 2024-07-19T23:21:09Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Can't wait to test ;)

foldl · 2024-07-20T09:31:14Z

Also, for those who are interested, chatllm.cpp supports this.

stduhpf · 2024-07-20T13:23:49Z

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Seems to work perfectly so far. Nice job.

* llama : Added support for Tekken pre-tokenizer (#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

muhammadyusuf-kurbonov · 2024-07-20T16:00:11Z

#8579 is merged

legraphista · 2024-07-20T16:30:29Z

Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:

llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model

Looks like there's a shape mismatch.

According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10

mirek190 · 2024-07-20T18:30:46Z

where I can find a proper gguf?

maziyarpanahi · 2024-07-20T19:28:54Z

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

stduhpf · 2024-07-20T19:37:11Z

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

iamlemec · 2024-07-20T19:44:25Z

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

maziyarpanahi · 2024-07-20T19:44:43Z

Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:
llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range
Could you please let me know if I am missing something here?
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.

mirek190 · 2024-07-20T20:46:00Z

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model

You version also doesn't work

cwillu · 2024-07-20T21:31:51Z

@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.

foldl · 2024-07-24T01:16:14Z

@mirek190 the point is that with SWA, 128k context length won't blow up your memory.

ehartford · 2024-07-24T09:31:25Z

I try to do ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf
I get NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

ehartford · 2024-07-24T15:38:49Z

I already saw this line in convert_hf_to_gguf_update.py

{"name": "tekken",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },

I already executed convert_hf_to_gguf_update.py

but still it doesn't work.

As a guess, I tried this:

python ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf --model-name tekken

But that didnt' work either

iamlemec · 2024-07-24T17:31:38Z

@ehartford They made some changes to tokenizer_config.json a day or two after release. See the commit here: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6. Updating yours similarly should make the checksum match up.

ehartford · 2024-07-24T18:33:12Z

Thanks I'll do that

rkinas · 2024-07-25T18:19:56Z

Hi @ehartford , have you managed to find solution for converting fine-tuned Mistral-Nemo to 16bit gguf? I encountered the same problem you described.

ehartford · 2024-07-25T18:49:34Z

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

* llama : Added support for Tekken pre-tokenizer (ggerganov#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

AllyourBaseBelongToUs · 2024-07-29T07:45:16Z

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

Hi Eric, love your Dolphin models.

Are the Hugging Face GGUF models not updated?

Causing the error: pre-tokenizer type ‘dolphin12b not recognized.

The model files also show the dolphin12b tokenizer instead of Tekken :(

Any way we can help you there?

P.S.

Your models work better with your systemprompts instead of jailbreaks <3

ehartford · 2024-07-29T18:13:29Z

it works for me, and on ollama too
https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

which I created from this quant https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf

Though I am not a llama.cpp expert.

I have got what I needed from this effort. I am happy to take PRs though.

AllyourBaseBelongToUs · 2024-08-08T15:11:27Z

it works for me, and on ollama too https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

it works on LM studio too when we use hex Editor to change the pre_tokenizer from "dolphin12b" to "Tekken"

though not with Llama_cpp_python itself :/

How long would it take to quantize it ourselves?

EDIT: thank you so much for changing the pre tokenizer in all your GGUF uploads on HF <3

you're the best!!!

By the way is there any volunteer work we can do for you?

ehartford · 2024-08-09T02:49:52Z

Yes definitely - I'm totally overwhelmed

rmusser01 · 2024-08-17T02:37:29Z

Yes definitely - I'm totally overwhelmed

Would you mind elaborating on what might be helpful for you/how people can help?

netrunnereve · 2024-08-30T17:04:45Z

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

Djip007 · 2024-08-31T23:47:39Z

I can't find what Mistral use and how (can be nice to know!!!)
The only thing I can find is:
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8/tree/main?show_file_info=model-00001-of-00003.safetensors
That use E4M3 + weight_scale&input_scale

netrunnereve · 2024-09-01T16:34:28Z

That seems to be a vLLM/Neural Magic quant format, which they also use for Llama as well. It's a generic quantization algorithm like our Q8_0 and isn't necessarily the format Mistral trained with.

HyunjoonCho · 2024-09-12T06:44:39Z

I am not sure whether this would be the proper thread to ask this - truly sorry if not,
is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

EliEron · 2024-09-12T11:31:15Z

I am not sure whether this would be the proper thread to ask this - truly sorry if not, is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

Ollama is an entirely seperate project from llama.cpp. While they use llama.cpp for inference there is no official partnership between the projects so this is indeed the wrong place to ask.

I'd suggest posting in the Ollama repo instead if you feel strongly about it, but i suspect you won't get a lot of traction. Base models aren't usually considered a high priority, and most just use third party uploads.

Djip007 · 2024-10-19T00:55:18Z

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

I still don't know what they used. But, I'm starting to see what you can get with the different formats. fp8

And for now the "best" is with E3M4 (yes I know it is not GPU native 😎 )

Djip007 · 2024-10-27T12:48:52Z

OK I have a POC for base FP8 support: #10055

netrunnereve · 2024-10-28T03:07:29Z

@Djip007 Very nice. I actually played around with FP8 last month by imagining that the Q8_0 int8 values were actually FP8 and writing ggml_vec_dot routines for it. It ended up being much slower than Q8_0 as I had to:

Turn everything into F32 or F16 first using either bit fiddling or a LUT
Do the math as F32 (which sucks compared to say Q8_0, where the dot product can be computed much more efficiently using an 8-bit madd).

I'm not sure how fast the code in your PR runs with OMP SIMD but I'm reading through your llamafile discussion Mozilla-Ocho/llamafile#549. Fully optimized and with AVX512 it looks like you're getting Q8_0 performance with FP8.

Djip007 · 2024-10-30T23:59:45Z

Fully optimized and with AVX512 it looks like you're getting Q8_0 performance with FP8.

I am faster on prompt-processing that Q8_0 😉 (on a zen4 CPU 42pp vs 67pp)
But to be fair we can go faster with the Q8 (on CPU). But we need more "complicated" matmul. And possible new backend.
hop to see a avx512_fp8_dot4 on zen6 ;)

For the actual PR omp simd help (but gcc do not vectorise the fp8_to_float correctly.), And it is only using vec_dot, not sgemm so far from what we can have.

look https://www.phoronix.com/news/Intel-GCC-Diamond-Rapids-ISA they will have AMX_FP8 ...

netrunnereve · 2024-11-01T03:36:02Z

Well to be fair you've got hardware BF16 support and the dot product instructions which I think help a lot. For older computers with only AVX or AVX2 support there's a huge advantage to Q8_0 since you can work on the 8 bit weights using integer operations.

As for FP8 support yeah that's probably coming to CPU eventually, hence why I support getting this ready in llama.cpp. The next step is FP4 and Nvidia's already implementing that in their latest GPUs.

github-actions · 2024-12-16T01:07:48Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

mitar · 2024-12-16T09:47:04Z

Still relevant though.

mirek190 added the enhancement New feature or request label Jul 18, 2024

HanClinto mentioned this issue Jul 18, 2024

Feature Request: Nemotron-4-340B-Instruct Support #7966

Closed

4 tasks

HanClinto mentioned this issue Jul 18, 2024

WIP for adding support for Tekken tokenizer needed for Mistral NeMo #8578

Closed

4 tasks

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024

llama : Added support for Viking pre-tokenizer (ggerganov#8577)

b76150c

m18coppola mentioned this issue Jul 18, 2024

llama : Added support for Tekken pre-tokenizer (#8577) #8579

Merged

4 tasks

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024

llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

003fcaf

rick-github mentioned this issue Jul 18, 2024

Mistral Nemo Please! ollama/ollama#5777

Closed

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 19, 2024

llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

7fc8505

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

iamlemec mentioned this issue Jul 20, 2024

Mistral Nemo inference support (#8577) #8604

Merged

4 tasks

brankoradovanovic-mcom mentioned this issue Oct 12, 2024

Upstage Solar Pro Preview model is not supported nomic-ai/gpt4all#2960

Open

github-actions bot added the stale label Oct 13, 2024

github-actions bot removed the stale label Oct 19, 2024

github-actions bot added the stale label Dec 2, 2024

github-actions bot closed this as completed Dec 16, 2024

Support Mistral-Nemo-Instruct-2407 128K #8577

Support Mistral-Nemo-Instruct-2407 128K #8577

Comments

mirek190 commented Jul 18, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

0wwafa commented Jul 18, 2024

0wwafa commented Jul 18, 2024

delphijb commented Jul 18, 2024 • edited Loading

stduhpf commented Jul 18, 2024

EliEron commented Jul 18, 2024 • edited Loading

iamlemec commented Jul 18, 2024

netrunnereve commented Jul 19, 2024

iamlemec commented Jul 19, 2024 • edited Loading

mirek190 commented Jul 19, 2024

foldl commented Jul 20, 2024

stduhpf commented Jul 20, 2024

muhammadyusuf-kurbonov commented Jul 20, 2024

legraphista commented Jul 20, 2024

mirek190 commented Jul 20, 2024

maziyarpanahi commented Jul 20, 2024

stduhpf commented Jul 20, 2024

iamlemec commented Jul 20, 2024

maziyarpanahi commented Jul 20, 2024

mirek190 commented Jul 20, 2024

cwillu commented Jul 20, 2024

foldl commented Jul 24, 2024

ehartford commented Jul 24, 2024

ehartford commented Jul 24, 2024

iamlemec commented Jul 24, 2024

ehartford commented Jul 24, 2024

rkinas commented Jul 25, 2024

ehartford commented Jul 25, 2024

AllyourBaseBelongToUs commented Jul 29, 2024

ehartford commented Jul 29, 2024 • edited Loading

AllyourBaseBelongToUs commented Aug 8, 2024

ehartford commented Aug 9, 2024

rmusser01 commented Aug 17, 2024

netrunnereve commented Aug 30, 2024

Djip007 commented Aug 31, 2024

netrunnereve commented Sep 1, 2024

HyunjoonCho commented Sep 12, 2024

EliEron commented Sep 12, 2024

Djip007 commented Oct 19, 2024

Djip007 commented Oct 27, 2024

netrunnereve commented Oct 28, 2024

Djip007 commented Oct 30, 2024 • edited Loading

netrunnereve commented Nov 1, 2024

github-actions bot commented Dec 16, 2024

mitar commented Dec 16, 2024

delphijb commented Jul 18, 2024 •

edited

Loading

EliEron commented Jul 18, 2024 •

edited

Loading

iamlemec commented Jul 19, 2024 •

edited

Loading

ehartford commented Jul 29, 2024 •

edited

Loading

Djip007 commented Oct 30, 2024 •

edited

Loading