Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Mistral-Nemo-Instruct-2407 128K #8577

Closed
4 tasks done
mirek190 opened this issue Jul 18, 2024 · 61 comments
Closed
4 tasks done

Support Mistral-Nemo-Instruct-2407 128K #8577

mirek190 opened this issue Jul 18, 2024 · 61 comments
Labels
enhancement New feature or request stale

Comments

@mirek190
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Any plans for support Mistral-Nemo-Instruct-2407 128K ?

https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407

Motivation

enhancement

Possible Implementation

No response

@mirek190 mirek190 added the enhancement New feature or request label Jul 18, 2024
@0wwafa
Copy link

0wwafa commented Jul 18, 2024

yes, please. this one is going to be good and soon finetunes will start to popup...

@0wwafa
Copy link

0wwafa commented Jul 18, 2024

And this: https://huggingface.co/nvidia/Mistral-NeMo-12B-Instruct

@delphijb
Copy link

delphijb commented Jul 18, 2024

I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance

@stduhpf
Copy link
Contributor

stduhpf commented Jul 18, 2024

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

@EliEron
Copy link

EliEron commented Jul 18, 2024

They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works.

The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement.

For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time.

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 18, 2024
@iamlemec
Copy link
Collaborator

I actually think the tokenizer might not be too different from others. It's listed as GPT2Tokenizer in the tokenizer_config.json and it has a pre-tokenizer of the usual form. I was able to add it in the standard fashion with pre-tokenizer and the update script.

The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here #8576 (comment)). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:

PROMPT: What is the capital of South Korea in Hangul?
RESPONSE: The capital of South Korea in Hangul is 서울 (Seoul).

@netrunnereve
Copy link
Collaborator

If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways.

m18coppola added a commit to m18coppola/llama.cpp that referenced this issue Jul 19, 2024
Removed uneeded `vocab.tokenizer_clean_spaces` assignment
@iamlemec
Copy link
Collaborator

iamlemec commented Jul 19, 2024

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

@mirek190
Copy link
Author

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Can't wait to test ;)

@foldl
Copy link
Contributor

foldl commented Jul 20, 2024

Also, for those who are interested, chatllm.cpp supports this.

@stduhpf
Copy link
Contributor

stduhpf commented Jul 20, 2024

For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it.

Seems to work perfectly so far. Nice job.

ggerganov added a commit that referenced this issue Jul 20, 2024
* llama : Added support for Tekken pre-tokenizer (#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@muhammadyusuf-kurbonov
Copy link

#8579 is merged

@legraphista
Copy link
Contributor

Just quantized Mistral-Nemo-Instruct and trying to run it I get the following error:

llm_load_tensors: ggml ctx size =    0.17 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model './Mistral-Nemo-Instruct-2407.Q8_0.gguf'
main: error: unable to load model

Looks like there's a shape mismatch.

According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10

@mirek190
Copy link
Author

where I can find a proper gguf?

@maziyarpanahi
Copy link

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

@stduhpf
Copy link
Contributor

stduhpf commented Jul 20, 2024

Hi @legraphista

I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

@iamlemec
Copy link
Collaborator

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

@maziyarpanahi
Copy link

Hi @legraphista
I have a new build from the main branch with the new PR merged, I am also using convert_hf_to_gguf.py. But I am getting this error:

llama.cpp/gguf-py/gguf/metadata.py", line 233, in get_model_id_components
    if at_start and ((len(t) == 0 and part[0].isalpha()) or "version" in t):
IndexError: string index out of range

Could you please let me know if I am missing something here?

Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork.

I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering.

@mirek190
Copy link
Author

@mirek190 I have a Q5_K of Instruct here: https://huggingface.co/CompendiumLabs/mistral-nemo-instruct-2407-gguf. Happy to add more variants if needed.

ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.34 MiB
llama_model_load: error loading model: check_tensor_dims: tensor 'blk.0.attn_q.weight' has wrong shape; expected  5120,  5120, got  5120,  4096,     1,     1
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'models/new3/mistral-nemo-instruct-q5_k.gguf'
main: error: unable to load model

You version also doesn't work

@cwillu
Copy link

cwillu commented Jul 20, 2024

@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable.

@foldl
Copy link
Contributor

foldl commented Jul 24, 2024

@mirek190 the point is that with SWA, 128k context length won't blow up your memory.

@ehartford
Copy link

I try to do ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf
I get NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre()

@ehartford
Copy link

I already saw this line in convert_hf_to_gguf_update.py

{"name": "tekken",         "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/mistralai/Mistral-Nemo-Base-2407", },

I already executed convert_hf_to_gguf_update.py

but still it doesn't work.

As a guess, I tried this:

python ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf --model-name tekken

But that didnt' work either

@iamlemec
Copy link
Collaborator

@ehartford They made some changes to tokenizer_config.json a day or two after release. See the commit here: https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/commit/dac9c9e98f83322b32e32b48c118f079930772d6. Updating yours similarly should make the checksum match up.

@ehartford
Copy link

Thanks I'll do that

@rkinas
Copy link

rkinas commented Jul 25, 2024

Hi @ehartford , have you managed to find solution for converting fine-tuned Mistral-Nemo to 16bit gguf? I encountered the same problem you described.

@ehartford
Copy link

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

arthw pushed a commit to arthw/llama.cpp that referenced this issue Jul 27, 2024
* llama : Added support for Tekken pre-tokenizer (ggerganov#8577)

Removed uneeded `vocab.tokenizer_clean_spaces` assignment

* llama : fix order of pre-tokenizers

* * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces
* Updated chkhsh for Tekken tokenizer

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@AllyourBaseBelongToUs
Copy link

Yes I got past it by adding my model to the -update.py and running it then passing --model-name

Hi Eric, love your Dolphin models.

Are the Hugging Face GGUF models not updated?

Causing the error: pre-tokenizer type ‘dolphin12b not recognized.

The model files also show the dolphin12b tokenizer instead of Tekken :(

Any way we can help you there?

P.S.

Your models work better with your systemprompts instead of jailbreaks <3

@ehartford
Copy link

ehartford commented Jul 29, 2024

it works for me, and on ollama too
https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

which I created from this quant https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf

Though I am not a llama.cpp expert.

I have got what I needed from this effort. I am happy to take PRs though.

@AllyourBaseBelongToUs
Copy link

it works for me, and on ollama too https://ollama.com/CognitiveComputations/dolphin-mistral-nemo

it works on LM studio too when we use hex Editor to change the pre_tokenizer from "dolphin12b" to "Tekken"

though not with Llama_cpp_python itself :/

How long would it take to quantize it ourselves?

EDIT: thank you so much for changing the pre tokenizer in all your GGUF uploads on HF <3

you're the best!!!

By the way is there any volunteer work we can do for you?

@ehartford
Copy link

Yes definitely - I'm totally overwhelmed

@rmusser01
Copy link

Yes definitely - I'm totally overwhelmed

Would you mind elaborating on what might be helpful for you/how people can help?

@netrunnereve
Copy link
Collaborator

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

@Djip007
Copy link
Contributor

Djip007 commented Aug 31, 2024

I can't find what Mistral use and how (can be nice to know!!!)
The only thing I can find is:
https://huggingface.co/neuralmagic/Mistral-Nemo-Instruct-2407-FP8/tree/main?show_file_info=model-00001-of-00003.safetensors
That use E4M3 + weight_scale&input_scale

@netrunnereve
Copy link
Collaborator

That seems to be a vLLM/Neural Magic quant format, which they also use for Llama as well. It's a generic quantization algorithm like our Q8_0 and isn't necessarily the format Mistral trained with.

@HyunjoonCho
Copy link

I am not sure whether this would be the proper thread to ask this - truly sorry if not,
is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

@EliEron
Copy link

EliEron commented Sep 12, 2024

I am not sure whether this would be the proper thread to ask this - truly sorry if not, is there any future plan to support Mistral-NeMo-Base-2407 officially?

I found some user uploaded versions but could not find one from the official page.

Thanks!

Ollama is an entirely seperate project from llama.cpp. While they use llama.cpp for inference there is no official partnership between the projects so this is indeed the wrong place to ask.

I'd suggest posting in the Ollama repo instead if you feel strongly about it, but i suspect you won't get a lot of traction. Base models aren't usually considered a high priority, and most just use third party uploads.

@Djip007
Copy link
Contributor

Djip007 commented Oct 19, 2024

Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8.

I still don't know what they used. But, I'm starting to see what you can get with the different formats. fp8

And for now the "best" is with E3M4 (yes I know it is not GPU native 😎 )

@github-actions github-actions bot removed the stale label Oct 19, 2024
@Djip007
Copy link
Contributor

Djip007 commented Oct 27, 2024

OK I have a POC for base FP8 support: #10055

@netrunnereve
Copy link
Collaborator

@Djip007 Very nice. I actually played around with FP8 last month by imagining that the Q8_0 int8 values were actually FP8 and writing ggml_vec_dot routines for it. It ended up being much slower than Q8_0 as I had to:

  1. Turn everything into F32 or F16 first using either bit fiddling or a LUT
  2. Do the math as F32 (which sucks compared to say Q8_0, where the dot product can be computed much more efficiently using an 8-bit madd).

I'm not sure how fast the code in your PR runs with OMP SIMD but I'm reading through your llamafile discussion Mozilla-Ocho/llamafile#549. Fully optimized and with AVX512 it looks like you're getting Q8_0 performance with FP8.

@Djip007
Copy link
Contributor

Djip007 commented Oct 30, 2024

Fully optimized and with AVX512 it looks like you're getting Q8_0 performance with FP8.

I am faster on prompt-processing that Q8_0 😉 (on a zen4 CPU 42pp vs 67pp)
But to be fair we can go faster with the Q8 (on CPU). But we need more "complicated" matmul. And possible new backend.
hop to see a avx512_fp8_dot4 on zen6 ;)

For the actual PR omp simd help (but gcc do not vectorise the fp8_to_float correctly.), And it is only using vec_dot, not sgemm so far from what we can have.

look https://www.phoronix.com/news/Intel-GCC-Diamond-Rapids-ISA they will have AMX_FP8 ...

@netrunnereve
Copy link
Collaborator

Well to be fair you've got hardware BF16 support and the dot product instructions which I think help a lot. For older computers with only AVX or AVX2 support there's a huge advantage to Q8_0 since you can work on the 8 bit weights using integer operations.

As for FP8 support yeah that's probably coming to CPU eventually, hence why I support getting this ready in llama.cpp. The next step is FP4 and Nvidia's already implementing that in their latest GPUs.

@github-actions github-actions bot added the stale label Dec 2, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@mitar
Copy link

mitar commented Dec 16, 2024

Still relevant though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests