Need support for GemmaForCausalLM #5635

maziyarpanahi · 2024-02-21T14:00:33Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Google just released Gemma models for 7B and 2B under GemmaForCausalLM arch. https://huggingface.co/models?other=gemma

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Maybe GemmaForCausalLM can be replicated via Llama-2 or Mistral to convert it to GGUF? Clearly there is a way since they also offer GGUF model in the same model's card, how it is twice the size of the model is beyond me! (maybe it's 32bit)

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2024-02-21T14:34:40Z

My bad! 580111d

qnixsynapse · 2024-02-21T14:49:05Z

Gemma already has a gguf but I am not able to quant it for some reason! (I only tried on colab because of its size which is 34GB)

maziyarpanahi · 2024-02-21T14:52:05Z

I pulled the latest from the main branch and re-build it. I cannot convert to 16bit GGUF and quantized from there, the quant models fail with:

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'quantized/google/gemma-7b/gemma-7b.Q4_K_S.gguf'
main: error: unable to load model

hannibalhuang · 2024-02-21T14:54:17Z

not sure if @ggerganov could have time to take a look at the problem

rahuldshetty · 2024-02-21T15:05:18Z

@akarshanbiswas I was able to quant it by starting from the gguf weights released on official repo and with the latest llama.cpp branch. I've uploaded some of the quantized version in my HF repo.

qnixsynapse · 2024-02-21T15:12:34Z

@rahuldshetty Do you have quants of the 7b instruction model?

rahuldshetty · 2024-02-21T15:56:26Z

@akarshanbiswas Just uploaded 4bit quantized version of Gemma-7b instruction model.

qnixsynapse · 2024-02-21T15:58:15Z

@rahuldshetty Awesome. Thank you.

remghoost · 2024-02-21T16:21:22Z

@rahuldshetty Hey, thanks for the quick responses!

I seem to be unable to run that quant as well...?
It doesn't even try and load the model.

Are you using a specific branch for inference...?

Tested with both llamacpp (b2222) and koboldcpp, both failing to load the model.

D:\llm\llamacpp\b2222 python 11>.\server.exe -m "D:\llm\llamacpp\gemma-7b-it-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
{"timestamp":1708531713,"level":"INFO","function":"main","line":2533,"message":"build info","build":2222,"commit":"88c46cbd"}
{"timestamp":1708531713,"level":"INFO","function":"main","line":2540,"message":"system info","n_threads":10,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708531713,"level":"INFO","function":"main","line":2692,"message":"HTTP server listening","hostname":"127.0.0.1","port":"8080"}

D:\llm\llamacpp\b2222 python 11>

Spot checked b2222 with the OpenHermes-2.5-Mistral-7B-16k-GGUF (Q4_K_S) and it loads fine, so I'm guessing it's not an error on my part...

postmasters · 2024-02-21T16:41:11Z

According to the technical paper, there are more than 8 billion parameters. I suspect that, including compute buffer, your GPU of 6GB may not be able to load this Q4 quantized model.

rahuldshetty · 2024-02-21T17:14:15Z

I think @postmasters is right. I was able to run the inference on Google Colab using the same weights. Refer: https://colab.research.google.com/drive/1uVIz_Y6mdjjRgnfC7X2jplHz7aUiQw0I?usp=sharing

remghoost · 2024-02-21T17:40:03Z

Hmm. I'll test it again locally in a while.

That colab link does indeed work.

It should still be able to run with CPU only inference though, correct...?
Even when attempting to not offload any of the layers to my GPU, it still didn't want to run...

It seems as if it is a problem on my end though. My apologies.
I'll do more research. Thank you for the information!

maziyarpanahi · 2024-02-21T17:58:16Z

@akarshanbiswas I was able to quant it by starting from the gguf weights released on official repo and with the latest llama.cpp branch. I've uploaded some of the quantized version in my HF repo.

Thank you @rahuldshetty for your work. I guess we can use the official GGUF Google has provided, but there are 2 issues:

why is that thing 2.5x larger than the original model? Is it 32bit? (out of curiosity, quant from 32bit would be even better I guess)
how did they make that GGUF file?

postmasters · 2024-02-21T18:11:50Z

It is indeed what you suspected. The weights are F32 in the GGUFs.
The conversion was from a Google-internal checkpoint format to GGUF.

ristew · 2024-02-21T18:37:18Z

I'm getting much worse perplexities with models quantized from the provided one (tried Q4_0, Q4_K_M, IQ4_NL, IQ3_XXS). I imagine this is related to the changed architecture.

maziyarpanahi · 2024-02-21T18:58:00Z

OK Quantizing it from the original GGUF model provided by Google works, but man it's awful! Is it just me?

llama.cpp/main -m google/gemma-7b/gemma-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Building a website can be done in 10 simple steps:
Step 1: Identify your purpose and goals

If you depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic

jmorganca · 2024-02-21T19:21:24Z

@maziyarpanahi seeing this too with the 7b model. I left a question on their hacker news "AMA" comment. I'm wondering if there was an issue in how the model was converted to GGUF.

Ar57m · 2024-02-21T21:25:07Z

OK Quantizing it from the original GGUF model provided by Google works, but man it's awful! Is it just me?

llama.cpp/main -m google/gemma-7b/gemma-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
Building a website can be done in 10 simple steps:
Step 1: Identify your purpose and goals

If you depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic

hey, I'm getting that too on termux this 2B Instruct quant

ggerganov · 2024-02-21T21:29:51Z

Regarding the quality of the quantized models, see: #5650

There might be more changes, so heads up

hexbinoct · 2024-02-22T10:09:13Z

hey, can someone give me the link for where are the gguf files? i dont see them on kaggle. thanks.

maziyarpanahi · 2024-02-22T10:45:50Z

hey, can someone give me the link for where are the gguf files? i dont see them on kaggle. thanks.

The original and official release of GGUF model by Google is listed in the same repo as the model itself: for instance: https://huggingface.co/google/gemma-7b/tree/main

qnixsynapse · 2024-02-22T13:22:13Z

Weird. I built llama.cpp from the latest master branch with includes PR #5651 and it is still complaining about missing output tensor.

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found

Also gemma is 8.5B not 7B.

edit: Nevermind. Worked with a different gguf file, went OOM in the SYCL backend.

dranger003 · 2024-02-22T16:39:36Z

@akarshanbiswas You need this PR #5647

dranger003 · 2024-02-22T20:39:58Z

For anyone having inconsistent model responses, try --repeat-penalty 1.0.

ggerganov · 2024-02-22T22:09:17Z

Everything should work now

jboero · 2024-02-26T14:42:59Z

Thanks for the fixes. Maybe it's just me but using the official GGUF still yields some odd behavior with stop words in chat.

User: Hi Llama how are you?

Llama: I am great! How can i help today user ?
user : Can u tell me about your life story and what have done in the past or present that is worth sharing for others . Do not worry if it's too much to share as i understand all of this. Tell only what you want and need

Sometimes it just refers to User and Llama as "U" and "L" which is odd and pretty hard to

User: Hi Llama how are you?

Llama: I am doing great! How about you user1234567890 ?

maziyarpanahi added the enhancement New feature or request label Feb 21, 2024

This comment was marked as outdated.

Sign in to view

asifshaikat mentioned this issue Feb 22, 2024

[Request] Create llamafile for Gemma 7B Mozilla-Ocho/llamafile#269

Closed

nikleonard mentioned this issue Feb 22, 2024

Gemma 7B produces gibberish output ollama/ollama#2650

Closed

ggerganov closed this as completed Feb 22, 2024

postmasters mentioned this issue Feb 22, 2024

Add gemma model #5631

Merged

maziyarpanahi mentioned this issue Feb 23, 2024

Add support for Gemma models #5671

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need support for GemmaForCausalLM #5635

Need support for GemmaForCausalLM #5635

maziyarpanahi commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

qnixsynapse commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

hannibalhuang commented Feb 21, 2024

rahuldshetty commented Feb 21, 2024 •

edited

Loading

qnixsynapse commented Feb 21, 2024

This comment was marked as outdated.

rahuldshetty commented Feb 21, 2024

qnixsynapse commented Feb 21, 2024

remghoost commented Feb 21, 2024

postmasters commented Feb 21, 2024

rahuldshetty commented Feb 21, 2024

remghoost commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

postmasters commented Feb 21, 2024

ristew commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

jmorganca commented Feb 21, 2024

Ar57m commented Feb 21, 2024 •

edited

Loading

ggerganov commented Feb 21, 2024

hexbinoct commented Feb 22, 2024

maziyarpanahi commented Feb 22, 2024

qnixsynapse commented Feb 22, 2024 •

edited

Loading

dranger003 commented Feb 22, 2024

dranger003 commented Feb 22, 2024

ggerganov commented Feb 22, 2024

jboero commented Feb 26, 2024 •

edited

Loading

Need support for GemmaForCausalLM #5635

Need support for GemmaForCausalLM #5635

Comments

maziyarpanahi commented Feb 21, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

maziyarpanahi commented Feb 21, 2024

qnixsynapse commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

hannibalhuang commented Feb 21, 2024

rahuldshetty commented Feb 21, 2024 • edited Loading

qnixsynapse commented Feb 21, 2024

This comment was marked as outdated.

rahuldshetty commented Feb 21, 2024

qnixsynapse commented Feb 21, 2024

remghoost commented Feb 21, 2024

postmasters commented Feb 21, 2024

rahuldshetty commented Feb 21, 2024

remghoost commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

postmasters commented Feb 21, 2024

ristew commented Feb 21, 2024

maziyarpanahi commented Feb 21, 2024

jmorganca commented Feb 21, 2024

Ar57m commented Feb 21, 2024 • edited Loading

ggerganov commented Feb 21, 2024

hexbinoct commented Feb 22, 2024

maziyarpanahi commented Feb 22, 2024

qnixsynapse commented Feb 22, 2024 • edited Loading

dranger003 commented Feb 22, 2024

dranger003 commented Feb 22, 2024

ggerganov commented Feb 22, 2024

jboero commented Feb 26, 2024 • edited Loading

rahuldshetty commented Feb 21, 2024 •

edited

Loading

Ar57m commented Feb 21, 2024 •

edited

Loading

qnixsynapse commented Feb 22, 2024 •

edited

Loading

jboero commented Feb 26, 2024 •

edited

Loading