-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Mistral-Nemo-Instruct-2407 128K #8577
Comments
yes, please. this one is going to be good and soon finetunes will start to popup... |
I second the request. This model is likely to become the reference for the 7-12b segment, and finetuning's version will indeed appear rapidly. Thx in advance |
They claim it can be a drop-in replacement of Mistral 7B, so surely it shouldn't be too much work to make it work with ggml since Mistral 7B works. |
The issue is that it uses a custom tokenizer named Tekken. That's not an issue for any program that uses Transformers. As their tokenizer system supports the custom tokenizer. Which is why they call it a drop in replacement. For llama.cpp however the custom tokenizer has to be implemented manually. And implementing new tokenizers correctly is usually not easy. Gemma-2 and Llama-3's tokenizer for instance took quite a while to implement properly, and it took multiple attempts to do so as bugs were found over time. |
I actually think the tokenizer might not be too different from others. It's listed as The other issue is that the tensors shapes relating to attention are not the sizes expected by the current implementation of Mistral (see my other comment here #8576 (comment)). I was able to brute force hack it into at least running, and I'm getting sensible output, which makes me think the tokenizer is doing okay. For example:
|
If this model works well we should also try to add FP8 support into llama.cpp and make full use of the QAT. That will take more work to compute compared to Q8_0 without native FP8 support but it'll probably end up being memory bound anyways. |
Removed uneeded `vocab.tokenizer_clean_spaces` assignment
For those curious, I have a seemingly working (and now non-hacky) inference implementation here: https://github.com/iamlemec/llama.cpp/tree/mistral-nemo. Will push a PR for it once #8579 is merged and I can rebase off of it. |
Can't wait to test ;) |
Also, for those who are interested, chatllm.cpp supports this. |
Seems to work perfectly so far. Nice job. |
* llama : Added support for Tekken pre-tokenizer (#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
#8579 is merged |
Just quantized
Looks like there's a shape mismatch. According to the config file, the hidden size should be 5120 https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407/blob/main/config.json#L10 |
where I can find a proper gguf? |
Hi @legraphista I have a new build from the main branch with the new PR merged, I am also using
Could you please let me know if I am missing something here? |
Only the tokenizer support has been added by the PR. If you want to run the model, you çan use @iamlemec's fork. |
@mirek190 I have a |
I am actually trying to quantize it at the moment, since I saw it happened successfully here, so I was wondering. |
You version also doesn't work |
@mirek190 try running a make clean first; the project makefiles don't appear to be 100% reliable. |
@mirek190 the point is that with SWA, 128k context length won't blow up your memory. |
I try to do |
I already saw this line in convert_hf_to_gguf_update.py
I already executed convert_hf_to_gguf_update.py but still it doesn't work. As a guess, I tried this: python ./convert_hf_to_gguf.py /workspace/axolotl/dolphin-2.9.3-mistral-nemo-hf --model-name tekken But that didnt' work either |
@ehartford They made some changes to |
Thanks I'll do that |
Hi @ehartford , have you managed to find solution for converting fine-tuned Mistral-Nemo to 16bit gguf? I encountered the same problem you described. |
Yes I got past it by adding my model to the -update.py and running it then passing --model-name |
* llama : Added support for Tekken pre-tokenizer (ggerganov#8577) Removed uneeded `vocab.tokenizer_clean_spaces` assignment * llama : fix order of pre-tokenizers * * Tekken pre-tokenizer no longer uses clean_up_tokenization_spaces * Updated chkhsh for Tekken tokenizer --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Hi Eric, love your Dolphin models. Are the Hugging Face GGUF models not updated? Causing the error: pre-tokenizer type ‘dolphin12b not recognized. The model files also show the dolphin12b tokenizer instead of Tekken :( Any way we can help you there? P.S. Your models work better with your systemprompts instead of jailbreaks <3 |
it works for me, and on ollama too which I created from this quant https://huggingface.co/cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b-gguf Though I am not a llama.cpp expert. I have got what I needed from this effort. I am happy to take PRs though. |
it works on LM studio too when we use hex Editor to change the pre_tokenizer from "dolphin12b" to "Tekken" though not with Llama_cpp_python itself :/ How long would it take to quantize it ourselves? EDIT: thank you so much for changing the pre tokenizer in all your GGUF uploads on HF <3 you're the best!!! By the way is there any volunteer work we can do for you? |
Yes definitely - I'm totally overwhelmed |
Would you mind elaborating on what might be helpful for you/how people can help? |
Does anyone know if the FP8 QAT used by Nemo is in E4M3 or E5M2? My guess is E4M3 but I couldn't find info on that anywhere, with Mistral only saying that they used FP8. |
I can't find what Mistral use and how (can be nice to know!!!) |
That seems to be a vLLM/Neural Magic quant format, which they also use for Llama as well. It's a generic quantization algorithm like our Q8_0 and isn't necessarily the format Mistral trained with. |
I am not sure whether this would be the proper thread to ask this - truly sorry if not, I found some user uploaded versions but could not find one from the official page. Thanks! |
Ollama is an entirely seperate project from llama.cpp. While they use llama.cpp for inference there is no official partnership between the projects so this is indeed the wrong place to ask. I'd suggest posting in the Ollama repo instead if you feel strongly about it, but i suspect you won't get a lot of traction. Base models aren't usually considered a high priority, and most just use third party uploads. |
I still don't know what they used. But, I'm starting to see what you can get with the different formats. fp8 And for now the "best" is with E3M4 (yes I know it is not GPU native 😎 ) |
OK I have a POC for base FP8 support: #10055 |
@Djip007 Very nice. I actually played around with FP8 last month by imagining that the Q8_0 int8 values were actually FP8 and writing ggml_vec_dot routines for it. It ended up being much slower than Q8_0 as I had to:
I'm not sure how fast the code in your PR runs with OMP SIMD but I'm reading through your llamafile discussion Mozilla-Ocho/llamafile#549. Fully optimized and with AVX512 it looks like you're getting Q8_0 performance with FP8. |
I am faster on prompt-processing that Q8_0 😉 (on a zen4 CPU 42pp vs 67pp) For the actual PR omp simd help (but gcc do not vectorise the fp8_to_float correctly.), And it is only using vec_dot, not sgemm so far from what we can have. look https://www.phoronix.com/news/Intel-GCC-Diamond-Rapids-ISA they will have AMX_FP8 ... |
Well to be fair you've got hardware BF16 support and the dot product instructions which I think help a lot. For older computers with only AVX or AVX2 support there's a huge advantage to Q8_0 since you can work on the 8 bit weights using integer operations. As for FP8 support yeah that's probably coming to CPU eventually, hence why I support getting this ready in llama.cpp. The next step is FP4 and Nvidia's already implementing that in their latest GPUs. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Still relevant though. |
Prerequisites
Feature Description
Any plans for support Mistral-Nemo-Instruct-2407 128K ?
https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407
Motivation
enhancement
Possible Implementation
No response
The text was updated successfully, but these errors were encountered: