Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need support for GemmaForCausalLM #5635

Closed
4 tasks done
maziyarpanahi opened this issue Feb 21, 2024 · 27 comments
Closed
4 tasks done

Need support for GemmaForCausalLM #5635

maziyarpanahi opened this issue Feb 21, 2024 · 27 comments
Labels
enhancement New feature or request

Comments

@maziyarpanahi
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Google just released Gemma models for 7B and 2B under GemmaForCausalLM arch. https://huggingface.co/models?other=gemma

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Maybe GemmaForCausalLM can be replicated via Llama-2 or Mistral to convert it to GGUF? Clearly there is a way since they also offer GGUF model in the same model's card, how it is twice the size of the model is beyond me! (maybe it's 32bit)

@maziyarpanahi maziyarpanahi added the enhancement New feature or request label Feb 21, 2024
@maziyarpanahi
Copy link
Author

My bad! 580111d

@qnixsynapse
Copy link
Contributor

Gemma already has a gguf but I am not able to quant it for some reason! (I only tried on colab because of its size which is 34GB)

@maziyarpanahi
Copy link
Author

I pulled the latest from the main branch and re-build it. I cannot convert to 16bit GGUF and quantized from there, the quant models fail with:

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'quantized/google/gemma-7b/gemma-7b.Q4_K_S.gguf'
main: error: unable to load model

@hannibalhuang
Copy link

not sure if @ggerganov could have time to take a look at the problem

@rahuldshetty
Copy link

rahuldshetty commented Feb 21, 2024

@akarshanbiswas I was able to quant it by starting from the gguf weights released on official repo and with the latest llama.cpp branch. I've uploaded some of the quantized version in my HF repo.

@qnixsynapse
Copy link
Contributor

@rahuldshetty Do you have quants of the 7b instruction model?

@akx

This comment was marked as outdated.

@rahuldshetty
Copy link

@akarshanbiswas Just uploaded 4bit quantized version of Gemma-7b instruction model.

@qnixsynapse
Copy link
Contributor

@rahuldshetty Awesome. Thank you.

@remghoost
Copy link

@rahuldshetty Hey, thanks for the quick responses!

I seem to be unable to run that quant as well...?
It doesn't even try and load the model.

Are you using a specific branch for inference...?

Tested with both llamacpp (b2222) and koboldcpp, both failing to load the model.

D:\llm\llamacpp\b2222 python 11>.\server.exe -m "D:\llm\llamacpp\gemma-7b-it-Q4_K_M.gguf"
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1060 6GB, compute capability 6.1, VMM: yes
{"timestamp":1708531713,"level":"INFO","function":"main","line":2533,"message":"build info","build":2222,"commit":"88c46cbd"}
{"timestamp":1708531713,"level":"INFO","function":"main","line":2540,"message":"system info","n_threads":10,"n_threads_batch":-1,"total_threads":12,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | "}

llama server listening at http://127.0.0.1:8080

{"timestamp":1708531713,"level":"INFO","function":"main","line":2692,"message":"HTTP server listening","hostname":"127.0.0.1","port":"8080"}

D:\llm\llamacpp\b2222 python 11>

Spot checked b2222 with the OpenHermes-2.5-Mistral-7B-16k-GGUF (Q4_K_S) and it loads fine, so I'm guessing it's not an error on my part...

@postmasters
Copy link
Contributor

According to the technical paper, there are more than 8 billion parameters. I suspect that, including compute buffer, your GPU of 6GB may not be able to load this Q4 quantized model.

@rahuldshetty
Copy link

I think @postmasters is right. I was able to run the inference on Google Colab using the same weights. Refer: https://colab.research.google.com/drive/1uVIz_Y6mdjjRgnfC7X2jplHz7aUiQw0I?usp=sharing

@remghoost
Copy link

Hmm. I'll test it again locally in a while.

That colab link does indeed work.

It should still be able to run with CPU only inference though, correct...?
Even when attempting to not offload any of the layers to my GPU, it still didn't want to run...

It seems as if it is a problem on my end though. My apologies.
I'll do more research. Thank you for the information!

@maziyarpanahi
Copy link
Author

@akarshanbiswas I was able to quant it by starting from the gguf weights released on official repo and with the latest llama.cpp branch. I've uploaded some of the quantized version in my HF repo.

Thank you @rahuldshetty for your work. I guess we can use the official GGUF Google has provided, but there are 2 issues:

  • why is that thing 2.5x larger than the original model? Is it 32bit? (out of curiosity, quant from 32bit would be even better I guess)
  • how did they make that GGUF file?

@postmasters
Copy link
Contributor

  1. It is indeed what you suspected. The weights are F32 in the GGUFs.
  2. The conversion was from a Google-internal checkpoint format to GGUF.

@ristew
Copy link
Contributor

ristew commented Feb 21, 2024

I'm getting much worse perplexities with models quantized from the provided one (tried Q4_0, Q4_K_M, IQ4_NL, IQ3_XXS). I imagine this is related to the changed architecture.

@maziyarpanahi
Copy link
Author

OK Quantizing it from the original GGUF model provided by Google works, but man it's awful! Is it just me?

llama.cpp/main -m google/gemma-7b/gemma-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Building a website can be done in 10 simple steps:
Step 1: Identify your purpose and goals

If you depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic

@jmorganca
Copy link
Contributor

@maziyarpanahi seeing this too with the 7b model. I left a question on their hacker news "AMA" comment. I'm wondering if there was an issue in how the model was converted to GGUF.

@Ar57m
Copy link

Ar57m commented Feb 21, 2024

OK Quantizing it from the original GGUF model provided by Google works, but man it's awful! Is it just me?

llama.cpp/main -m google/gemma-7b/gemma-7b.Q4_K_M.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e

Building a website can be done in 10 simple steps:
Step 1: Identify your purpose and goals

If you depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic depic

hey, I'm getting that too on termux this 2B Instruct quant

@ggerganov
Copy link
Owner

Regarding the quality of the quantized models, see: #5650

There might be more changes, so heads up

@hexbinoct
Copy link

hey, can someone give me the link for where are the gguf files? i dont see them on kaggle. thanks.

@maziyarpanahi
Copy link
Author

hey, can someone give me the link for where are the gguf files? i dont see them on kaggle. thanks.

The original and official release of GGUF model by Google is listed in the same repo as the model itself: for instance: https://huggingface.co/google/gemma-7b/tree/main

@qnixsynapse
Copy link
Contributor

qnixsynapse commented Feb 22, 2024

Weird. I built llama.cpp from the latest master branch with includes PR #5651 and it is still complaining about missing output tensor.

llama_model_load: error loading model: create_tensor: tensor 'output.weight' not found

Also gemma is 8.5B not 7B.

edit: Nevermind. Worked with a different gguf file, went OOM in the SYCL backend.

@dranger003
Copy link
Contributor

@akarshanbiswas You need this PR #5647

@dranger003
Copy link
Contributor

For anyone having inconsistent model responses, try --repeat-penalty 1.0.

@ggerganov
Copy link
Owner

Everything should work now

@jboero
Copy link
Contributor

jboero commented Feb 26, 2024

Thanks for the fixes. Maybe it's just me but using the official GGUF still yields some odd behavior with stop words in chat.

User: Hi Llama how are you?

Llama: I am great! How can i help today user ?
user : Can u tell me about your life story and what have done in the past or present that is worth sharing for others . Do not worry if it's too much to share as i understand all of this. Tell only what you want and need

Sometimes it just refers to User and Llama as "U" and "L" which is odd and pretty hard to

User: Hi Llama how are you?

Llama: I am doing great! How about you user1234567890 ?

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests