-
Notifications
You must be signed in to change notification settings - Fork 10.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Need support for GemmaForCausalLM #5635
Comments
My bad! 580111d |
Gemma already has a gguf but I am not able to quant it for some reason! (I only tried on colab because of its size which is 34GB) |
I pulled the latest from the main branch and re-build it. I cannot convert to 16bit GGUF and quantized from there, the quant models fail with:
|
not sure if @ggerganov could have time to take a look at the problem |
@akarshanbiswas I was able to quant it by starting from the gguf weights released on official repo and with the latest llama.cpp branch. I've uploaded some of the quantized version in my HF repo. |
@rahuldshetty Do you have quants of the 7b instruction model? |
This comment was marked as outdated.
This comment was marked as outdated.
@akarshanbiswas Just uploaded 4bit quantized version of Gemma-7b instruction model. |
@rahuldshetty Awesome. Thank you. |
@rahuldshetty Hey, thanks for the quick responses! I seem to be unable to run that quant as well...? Are you using a specific branch for inference...? Tested with both llamacpp (b2222) and koboldcpp, both failing to load the model.
Spot checked b2222 with the OpenHermes-2.5-Mistral-7B-16k-GGUF (Q4_K_S) and it loads fine, so I'm guessing it's not an error on my part... |
According to the technical paper, there are more than 8 billion parameters. I suspect that, including compute buffer, your GPU of 6GB may not be able to load this Q4 quantized model. |
I think @postmasters is right. I was able to run the inference on Google Colab using the same weights. Refer: https://colab.research.google.com/drive/1uVIz_Y6mdjjRgnfC7X2jplHz7aUiQw0I?usp=sharing |
Hmm. I'll test it again locally in a while. That colab link does indeed work. It should still be able to run with CPU only inference though, correct...? It seems as if it is a problem on my end though. My apologies. |
Thank you @rahuldshetty for your work. I guess we can use the official GGUF Google has provided, but there are 2 issues:
|
|
I'm getting much worse perplexities with models quantized from the provided one (tried Q4_0, Q4_K_M, IQ4_NL, IQ3_XXS). I imagine this is related to the changed architecture. |
OK Quantizing it from the original GGUF model provided by Google works, but man it's awful! Is it just me? llama.cpp/main -m google/gemma-7b/gemma-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
|
@maziyarpanahi seeing this too with the 7b model. I left a question on their hacker news "AMA" comment. I'm wondering if there was an issue in how the model was converted to GGUF. |
hey, I'm getting that too on termux this 2B Instruct quant |
Regarding the quality of the quantized models, see: #5650 There might be more changes, so heads up |
hey, can someone give me the link for where are the gguf files? i dont see them on kaggle. thanks. |
The original and official release of GGUF model by Google is listed in the same repo as the model itself: for instance: https://huggingface.co/google/gemma-7b/tree/main |
Weird. I built llama.cpp from the latest master branch with includes PR #5651 and it is still complaining about missing output tensor.
Also gemma is 8.5B not 7B. edit: Nevermind. Worked with a different gguf file, went OOM in the SYCL backend. |
@akarshanbiswas You need this PR #5647 |
For anyone having inconsistent model responses, try |
Everything should work now |
Thanks for the fixes. Maybe it's just me but using the official GGUF still yields some odd behavior with stop words in chat.
Sometimes it just refers to User and Llama as "U" and "L" which is odd and pretty hard to
|
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.Google just released Gemma models for 7B and 2B under
GemmaForCausalLM
arch. https://huggingface.co/models?other=gemmaMotivation
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to
llama.cpp
users.Possible Implementation
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
Maybe
GemmaForCausalLM
can be replicated via Llama-2 or Mistral to convert it to GGUF? Clearly there is a way since they also offer GGUF model in the same model's card, how it is twice the size of the model is beyond me! (maybe it's 32bit)The text was updated successfully, but these errors were encountered: