Investigate gemma 2 generation quality #8240

ngxson · 2024-07-01T16:52:28Z

Initial reports can be seen from #8227

Important

A note for everyone: if you think there's a bug in llama.cpp tokenizer, please make sure to test with HF transformers library first (see this comment for example)

The text was updated successfully, but these errors were encountered:

qnixsynapse · 2024-07-01T16:55:49Z

Just to confirm, gemma2 's window size is hard coded right?

ngxson · 2024-07-01T16:55:54Z

Ref comment: #8227 (comment)

Issue with math questions may indicate problem with tokenizer, we should firstly try if llama.cpp tokenizer matches gemma2's tokenizer result or not.

ngxson · 2024-07-01T16:57:24Z

Just to confirm, gemma2 's window size is hard coded right?

The default value if hard-coded (in order not to break existing gguf), but the value will be override with the one in gguf (in case you re-convert to get new gguf)

Metadata key is gemma2.attention.sliding_window

TechieUser2517 · 2024-07-01T16:57:59Z

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

matteoserva · 2024-07-01T17:00:47Z

I tested all working implementations of the gemma-2-27b inference code.
the implementation in llama.cpp either outputs subpar results or breaks completely.

Reference models:

Compared implementations:

gemma.cpp unquantized (commit: b921cceb06e43a18a10cbcddedd00ffdbe4e10c6 )
chatllm.cpp Q8_0 (commit: 906de3eafe2b37967e4c5ab398ea8c59409000fc )
llama.cpp unquantized (commit: ab2c3de )
ai studio gemma-2-27b, temperature: 1.0

Not tested: hf transformers

launch commands

gemma.cpp:

./gemma --tokenizer ./gemma-tokenizer.spm --model 27b-it --compressed_weights ./gemma-2-27b-it-sfp.sbs --temperature 0.01

chatllm:

./obj/main -m ./gemma-2-27b-it-Q8_0.bin -i

llama.cpp:

$ python3 convert-hf-to-gguf.py ./gemma-2-27b-it/ --outfile ./gemma-2-27b-it.gguf
$ ./llama-server -ngl 15 -t 6 -c 8192 --host 0.0.0.0 -m ./gemma-2-27b-it.gguf --override-kv tokenizer.ggml.add_bos_token=bool:false

Outputs:

gemma.cpp:

`tanto va la gatta al lardo che ci lascia lo zampino.

chatllm.cpp at Q8_0:

`tanto va la gatta al lardo che ci lascia lo zampino.

ai studio with temperature 1.0:

`tanto va la gatta al lardo che ci lascia lo zampino.

llama.cpp at temperature 0.01:

<bos><start_of_turn>user
Completa la frase: tanto va la gatta al lardo che...<end_of_turn>
<start_of_turn>model
... **se la scrofa la ingrassa.** 

Esta es una frase hecha italiana que significa que si alguien insiste [...]

Analysis of results

The model in llama.cpp spits out random italian words and then starts speaking spanish.
All the other implementation return the correct answer.
llama.cpp gives incorrect responses even at low quantization or without quantization.
The other implementations give the same correct response at Q8_0 or at high temperature.

I tried many other questions from my benchmarks. The other three models all agree to the same correct response. llama.cpp gives a different and incorrect response.

EDIT:
formatting and paths

qnixsynapse · 2024-07-01T17:02:16Z

9B-IT is working great and now I can increase the ctx size. :)

ngxson · 2024-07-01T17:26:20Z

Issue with math questions may indicate problem with tokenizer, we should firstly try if llama.cpp tokenizer matches gemma2's tokenizer result or not.

Don't know if I'm heading the right direction or not:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")

chktxt = 'Repeat the question and then answer it: Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?'

tokenizer(chktxt)['input_ids'][1:]

# [41422, 573, 2872, 578, 1492, 3448, 665, 235292, 100006, 919, 235248, 235284, 235276, 34188, 235269, 693, 58015, 235248, 235284, 235276, 72638, 235265, 5040, 693, 9027, 2050, 3933, 576, 926, 16803, 16404, 235265, 5040, 693, 9027, 2050, 476, 9453, 576, 926, 16803, 16404, 1865, 34188, 578, 72638, 235265, 2250, 1767, 34188, 5822, 235336]

Compared to the llama.cpp output (using llama-server):

{"tokens":[41422,573,2872,578,1492,3448,665,235292,100006,919,235248,235284,235276,34188,235269,693,58015,235248,235284,235276,72638,235265,5040,693,63845,235256,3933,576,926,16803,16404,235265,5040,693,63845,235256,476,9453,576,926,16803,16404,1865,34188,578,72638,235265,2250,1767,34188,5822,235336]}

The word discards is tokenized differently:

original: 9027 "disc", 2050 "ards"
llama.cpp: 63845 "discard", 235256 "s"

tristandruyen · 2024-07-01T17:27:26Z

I noticed something possibly interesting:

with a GGUF created from scratch from huggingface, i get the same wrong result as @matteoserva
with an old outdated GGUF from bartowski (from 4 days ago) I get a much closer, but still slightly wrong answer compared to gemma.cpp, ai studio etc

The old but closer to correct GGUF [Q6_K_L] is from this commit (I matched the sha256 hashes to make sure)

AFAIK these initial versions, were not created from scratch by llama.cpp, but based on the f32 GGUF provided directly by google on kaggle, although AFAIK these initial GGUFs had various other issues...

I see 2 possible causes:

Something is still wrong with the conversion code
The official huggingface repo is broken in some way

Logs:

curl is with a "new" GGUF
curl is with the linked 4 day old GGUF (both Q6_K_L)


❯ curl http://localhost:8080/v1/chat/completions \
              -H "Content-Type: application/json" \
              -d '{
        "temperature": 0.1,
        "messages": [
          {
            "role": "user",
            "content": "Completa la frase: tanto va la gatta al lardo che..."
          }
        ]
     }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"... **se la scrofa la ingrassa**. \n\nEsta es una frase hecha italiana que significa que si alguien insiste mucho en algo, al final lo conseguirá, aunque sea por casualidad o por la ayuda de alguien más. \n","role":"assistant"}}],"created":1719853875,"model":"unknown","object":"chat.completion","usage":{"completion_tokens":51,"prompt_tokens":24,"total_tokens":75},"id":"chatcmpl-uXDEjiyq0JGjwgg1qTlA2LGqEDhTxxsG"}⏎

❯ curl http://localhost:8080/v1/chat/completions \
              -H "Content-Type: application/json" \
              -d '{
        "temperature": 0.1,
        "messages": [
          {
            "role": "user",
            "content": "Completa la frase: tanto va la gatta al lardo che..."
          }
        ]
     }'
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"...ci si lascia lo zampino. \n<end_of_turn>","role":"assistant"}}],"created":1719853954,"model":"unknown","object":"chat.completion","usage":{"completion_tokens":12,"prompt_tokens":42,"total_tokens":54},"id":"chatcmpl-jKmHo2x1dViomeiWLc8K6F3o1WJRsccT"}⏎

launch command (latest llama.cpp 49122a8):

./llama-server -ngl 999 -c 4000 --host 0.0.0.0 
    -m path_to.gguf
    --chat-template gemma2

matteoserva · 2024-07-01T17:51:10Z

@tristandruyen I think the result you provided is still wrong even for the outdated gguf.

The response from outdated gguf is "ci si lascia lo zampino".
The only correct response for that question is "ci lascia lo zampino". I used that test for the exact reason that it doesn't admit any variation in the response.

tristandruyen · 2024-07-01T17:52:07Z

@tristandruyen I think the result you provided is still wrong even for the outdated gguf.

The response from outdated gguf is "ci si lascia lo zampino". The only correct response for that question is "ci lascia lo zampino". I used that test for the exact reason that it doesn't admit any variation in the response.

My bad, as I do not speak italian my brain parsed it as correct... It's still kinda interesting that it's much closer to the correct response though....

bartowski1182 · 2024-07-01T18:33:52Z

We still don't know what the conversion code Google used was, so it's possible that yes there's still something missing...

But the Google one definitely has a bad tokenizer, so if that was somehow fixed we may be able to see the proper performance, if only someone was able to contact them 🥲

ggerganov · 2024-07-01T18:34:08Z

@ngxson This indicates a problem with the tokenizer conversion. I don't fully understand the details to fix it, but a simple observation that I found is using:

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index 4a7f500f..d7eaf9cd 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -2345,7 +2345,7 @@ class Gemma2Model(Model):
     model_arch = gguf.MODEL_ARCH.GEMMA2
 
     def set_vocab(self):
-        self._set_vocab_llama_hf()
+        self._set_vocab_sentencepiece()
         self.gguf_writer.add_add_space_prefix(False)
 
     def set_gguf_parameters(self):

This would tokenize correctly the word "discards", but there are other problems with added/special tokens not being added at all. So some fix for the vocabulary conversion is necessary

JeroenAdam · 2024-07-01T18:39:40Z

For me, Gemma2 27b is going off the rails as soon as 'slot context shift' occurs. I get high quality output until that point.
My config: latest build b3274 CUDA on Quadro P5000, 7K context set and running Q3_K_M (uploaded yesterday by bartowski). Here is an example of Java code abruptly followed by totally unrelated stuff.

**3. Security config

java
@configuration
public class SecurityConfig extends WebSecurityConfigurerAdapter {

@Override
protected void configure(HttpSecurity http) throws Exception {
    http.authorizeRequests().
    addFilter(new ApiKeyAuthenticationFilter());
}

**Exploring the Nature of Light

Introduction:

Light is an essential aspect of our universe, influencing everything from the smallest atom to the largest galaxy.

Understanding the nature of light, how it interacts, and its properties are fundamental to many scientific fields, including physics, astronomy, and biology.

**Wave-Particle Duality: The Double Nature of Light

The nature of light has been a subject of much debate and experimentation.
It was not until the 20th century that a satisfactory explanation of light emerged - the concept of wave-particle duality.

0wwafa · 2024-07-01T18:49:48Z

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

That's because, as I am trying to explain since 2 weeks, the quantizing is "wrong".
Check my Q5 & Q6 and you will see the difference:
https://huggingface.co/ZeroWw/gemma-2-9b-it-GGUF

tristandruyen · 2024-07-01T18:55:30Z

For what it's worth, I have found that Gemma-2-27B quantized to Q6_K often makes mistakes/typos with proper names compared to Gemma-2-8B in Q8_0. I don't think the difference in quantization quality would be so large, but this could be something to watch for.

That's because, as I am trying to explain since 2 weeks, the quantizing is "wrong". Check my Q5 & Q6 and you will see the difference: https://huggingface.co/ZeroWw/gemma-2-9b-it-GGUF

Bartowski and others already provide GGUF's with output and embed tensors quantized as f16 as _L variants...

Also I wouldn't call people wrong for providing standard GGUF variants with standard settings.
Your GGUF's are basically a new variant. That's why they got a new name in bartowski's repos...

matteoserva · 2024-07-01T19:02:19Z

From the hf blog.

"Running in float16 may be faster on your hardware, and results should be similar on the 9B model. Do note, however, that the 27B instruction-tuned model produces erratic outputs when using float16: you must use bfloat16 for that model weight."

Could this be relevant? I'm not familiar enough with the llama.cpp codebase to check this myself. The guuf by google is in float32 while the hf model is in bf16.

bartowski1182 · 2024-07-01T19:17:32Z

Honestly @matteoserva you may have a point, but I would hope that it's not relevant if we go bf16 to FP32 to fp16.. could try _XL versions where I leave embed and output at f32 LOL but that better not make any difference, would be pretty weird..

But yeah if even converting to f32 doesn't work properly, it's a deeper issue. My guess is Google was referring to take the bf16 and on-the-fly running it as fp16 which could definitely degrade performance at edge cases (I think we saw this in Qwen2?)

oldgithubman · 2024-07-01T19:20:47Z

"[!WARNING]
Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk."

https://huggingface.co/google/gemma-2-27b-it/discussions/17/files

matteoserva · 2024-07-01T19:37:51Z

@bartowski1182

Bfloat16->float32->float16 is generally an invalid conversion since float16 doesn't have the same range as the other two.

Is there a reason to think that the model weights are in the float16 range even if they are in the bfloat16 format?

qnixsynapse · 2024-07-01T19:44:16Z

Just to mention here, when I was converting the HF gemma2 to bft16 gguf, I noticed that the norm tensors were converted to fp16 instead of directly copying them from HF safetensors which were in bf16. I found that behaviour quite odd. I even supplied --outtype bf16 parameter.

ngxson · 2024-07-01T19:49:04Z

@ngxson This indicates a problem with the tokenizer conversion. I don't fully understand the details to fix it, but a simple observation that I found is using:

This would tokenize correctly the word "discards", but there are other problems with added/special tokens not being added at all. So some fix for the vocabulary conversion is necessary

@ggerganov Simply apply this change, I get perplexity from 9.5613 to 7.7898

My laptop is potato, I only tested with just 3 chunks of wiki.test.raw, so don't know if I mess up something or not.

With self._set_vocab_llama_hf()

[1]4.3818,[2]8.5469,[3]9.5613,
Final estimate: PPL = 9.5613 +/- 2.42077

With self._set_vocab_sentencepiece() ==> makes more sense, since gemma 1 uses this

[1]4.4272,[2]8.4867,[3]7.7898,
Final estimate: PPL = 7.7898 +/- 1.78301

arch-btw · 2024-07-01T20:25:48Z

Feel free to ignore this if it's not relevant but I noticed the json is invalid in the tokenizer.json on one line:

The line in question:

bartowski1182 · 2024-07-01T20:31:05Z

@matteoserva it's been shown that upcasting to FP32 before going to fp16 maintains a bit more accuracy than doing the conversion directly, but yes you lose out on some of the range and if Gemma 2 has a ton of values that fall outside the fp16 range that are extremely important they're different then I guess that could do it.

Does that really seem likely to be the issue? Especially when quantizing, almost zero and really almost zero are always going to basically be zero.. I'd think it more important to maintain the relationships in the middle of the range rather than the whole range (which probably matters more in training)

I suppose in an ideal world we could keep the embeddings and outputs at bf16, but then we lose GPU support (I think?)

Embeddings at f32 seems like it should be overly excessive for a quantized model, and I'd hope we never need to do that since that would be a huge increase in final size...

Maybe we need to prioritize GPU support of bf16 more, but I'm so far from the expertise required that I'm in no position to push for it lol

Take what I say with a grain of salt please 😅

bartowski1182 · 2024-07-01T20:32:17Z

@ngxson the problem with sentencepiece is it's not tokenizing the start and end tokens correctly, so it may have better PPL but it produces worse results

There's clearly some middle ground we're missing

matteoserva · 2024-07-01T21:01:47Z

@bartowski1182

Sorry for asking so many questions but I'm really missing the reason why you assume that converting to float16 is possible at all.

The maximum value for a float16 is 65535.
The maximum value of a bfloat16 is 10^38.
The maximum value of a float32 is 10^38.

I also expect most of the original weights to be greater than 65k since putting a constraint on their value would waste 20% of the bits of a bfloat16 value.

Is there some sort of quantization applied when converting gemma from bfloat to float32 to float16? In other words, how are you compressing a number from the range ±10^38 to another format whose range is ±65535? A naive division is not possible.

I suppose that models released directly in float32 format have the additional constraint that their weights are in a small range around 0, that's why the conversion to float16 is possible.
Gemma2 was instead released in bfloat16 format which doesn't allow a trivial conversion to float16.

steampunque · 2024-07-01T21:09:37Z

I ran some bench suites on my own Q6_K non-imatrix quant and the 9b model is doing well on benchmarks. It
hits 0.902 on GSM8K which is the highest I have seen on any model I have ever run and it averaged 0.653 on
BBH which is quite good. My benches are different from the standard evaluation harness. For MC I require match
on a doublecheck question where I circular shift all the answers 1 letter to make sure the model follows the right answer and I also use custom prompted CoT where necessary (MCs which require thinking, GSM8K, etc.) . I also zero shot everything except for a couple 3 shots for BBH categories (dyck languages and word ordering).

This quant was generated prior to the sliding attention patch but that shouldnt make difference since I limit CoT to 2500 tokens.

bench_gemma-2-9b-it.json

Rotatingxenomorph · 2024-07-05T07:13:23Z

Temp 1.0 seems to be a bit too high for Gemma 2 27b. What is the 'natural' temp for this model, does anyone know?

bfroemel · 2024-07-05T07:24:39Z

1.0 is the default temperature set in aistudio. Did you notice any detrimental effect regarding a temp of 1.0?

MoonRide303 · 2024-07-05T07:39:05Z

Temp 1.0 seems to be a bit too high for Gemma 2 27b. What is the 'natural' temp for this model, does anyone know?

I've noticed both temperature 0 and 1.0 used in Google code (in gemma.cpp repo):

Rotatingxenomorph · 2024-07-05T11:14:54Z

1.0 is the default temperature set in aistudio. Did you notice any detrimental effect regarding a temp of 1.0?

It seemed to have some trouble with numbers/math at temp 1.0.

gemma-2-27b-it-Q8_0.gguf --top-k 0 --min-p 0.0 --top-p 1.0 --color -t 5 --temp 1 --repeat_penalty 1 -c 4096 -n -1 -ngl 14 --conversation -i

at temp 1.0 I get this:

how many years did aliens come out before alien 3?

Aliens (the sequel to Alien) came out 7 years before Alien 3.

Here's the breakdown:

Alien: 1979
Aliens: 1986
Alien 3: 1992

at temp 0 I get:

how many years did aliens come out before alien 3?

"Aliens" was released in 1986.

"Alien 3" was released in 1992.

Therefore, "Aliens" came out 6 years before "Alien 3".

Let me know if you have any other movie trivia questions!

Rotatingxenomorph · 2024-07-05T13:59:40Z

I tried it on gemini flash 1.5 api at temp 1 AND temp 0 and it also got it wrong, so I guess llamacpp is off the hook!

ggerganov · 2024-07-05T14:52:59Z

Btw, for this kind of queries that require known facts you should always use temp == 0.0f

eskeletor97 · 2024-07-05T16:25:30Z

huggingface/transformers#31775 is this relevant to llama.cpp implementation?

matteoserva · 2024-07-05T16:50:07Z

I think it's already correct in llama.cpp (feel free to correct me if I'm wrong):

llama.cpp/src/llama.cpp

Line 11572 in be20e7f

struct ggml_tensor * KQ_mask_l = (il % 2 == 0) ? KQ_mask_swa : KQ_mask;

steampunque · 2024-07-05T17:43:32Z

Use proper nouns, it helps the model know what you are talking about.

lm how many years did Aliens come out before Alien 3?
Here's the breakdown:

* **Aliens** was released in 1986.
* **Alien 3** was released in 1992.

Therefore, **Alien 3** came out **6 years** after **Aliens**.

Or just multiturn, should work fine, model will create proper nouns in context.

bash-5.1$ lm when did movies aliens and  alien 3 come out?
Here are the release dates for the movies you asked about:

* **Alien** - June 25, 1979
* **Aliens** - July 18, 1986
* **Alien 3** - May 11, 1992 


Let me know if you have any other movie release dates you'd like to know! 

bash-5.1$ lmc how many years did aliens come out before alien 3?
Aliens was released in 1986 and Alien 3 in 1992.  

There are **6 years** between the release of Aliens and Alien 3.

Rotatingxenomorph · 2024-07-06T10:24:00Z

Use proper nouns, it helps the model know what you are talking about.

Hah! I first got the problem in the context of it writing an essay about Alien 3, but I couldn't reproduce it. I think another part of it might be that Alien was released 7 years before Aliens, so maybe that's where the network is getting that urge from?

cuelebra · 2024-07-06T19:21:53Z

Here are two prompts that were run at 0 temp with Gemma 27B Q8_0
https://pastebin.com/9UCkX201
You can remove the last output of the model up to <start_of_turn>model and test it yourself

a difference is the second prompt has one more paragraph of lorem ipsum, but in fact, just adding a linebreak to the last paragraph causes a degradation of formatting and coherence identically ("Bard" instead of "Gemma", double space instead of single space in two places)

TechieUser2517 · 2024-07-06T21:28:37Z

There have been reports that using a higher f_final_logit_softcapping than the default value of 30 (e.g. 50) may solve certain quality issues on Gemma-2-27B, has anybody tried? It would be useful if this value (and possibly that of f_attn_logit_softcapping as well) could be changed without requantizing the model.

cuelebra · 2024-07-06T22:09:23Z

@BugReporterZ i set final_logit_softcapping to 50 in config.json, and just in case replaced default 30.0f with 50.0f in llama.cpp file, requantized the model - the output for above prompts was unaffected

ggerganov · 2024-07-07T08:17:45Z

For anyone running tests relying on context shift, make sure to try #8348 since there was a bug that affected the quality of context shifts for Gemma2 models

cuelebra · 2024-07-07T12:17:34Z

someone found out that the llama.cpp gemma2 tokenizer splits a certain word in multiple tokens, which is defined as a single token in tokenizer.json. Is this expected or not?

curl -X POST -H "Content-Type: application/json" -d '{"content":"[toxicity=0]"}' http://localhost:8080/tokenize
{"tokens":[235309,1373,235293,235276,235307]}

ngxson · 2024-07-07T13:08:29Z

@cuelebra The mentioned token probably isn't used by gemma (maybe google reuse the same tokenizer for other models).

HF transformers outputs the same thing:

from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
tokenizer("[toxicity=0]")
# [2, 235309, 1373, 235293, 235276, 235307]

This token need to be marked as special token to make it work, but that's not the case, see: https://huggingface.co/google/gemma-2-9b/blob/main/tokenizer_config.json

ngxson · 2024-07-07T13:12:17Z

Important

A note for everyone: if you think there's a bug in llama.cpp tokenizer, please make sure to test with HF transformers library first (see my comment above for example)

AUTOMATIC1111 · 2024-07-07T16:02:34Z

This is a difference between how the corporate hosted implementation and llamacpp work. If it's different for this particular token, maybe there are other cases for which tokenization is different from how google trained the model. It's entirely possible that the transformers implementation of the tokenizer for gemma is not correct, especially considering they had other bugs with implementation already.

oldgithubman · 2024-07-08T02:34:23Z

Btw, for this kind of queries that require known facts you should always use temp == 0.0f

What is the 'f' for?

AUTOMATIC1111 · 2024-07-08T04:56:56Z

for letting everyone know that it's a single precision floating point number

compilade · 2024-07-08T05:39:43Z

HTML tags are not yet tokenized correctly by Gemma-2's tokenizer in llama.cpp. I think I managed to fix this in #8228, but it unfortunately requires re-converting Gemma models with the changes from that branch, see #8228 (comment)

oldgithubman · 2024-07-11T22:45:30Z

HTML tags are not yet tokenized correctly by Gemma-2's tokenizer in llama.cpp. I think I managed to fix this in #8228, but it unfortunately requires re-converting Gemma models with the changes from that branch, see #8228 (comment)

Are you guys planning to merge that branch or am I waiting around like an idiot for nothing? I see related changes happening elsewhere. Just wondering when I can re-convert. Again, let me know if there's anything I can do to help speed it up

progmars · 2024-07-28T15:20:25Z

Formatting is a serious issue with the model. It really isn't able to predict the correct formatting using previous responses at all in my use case.

That's my experience, too. My instructions had clear directions to use * (asterisks) for actions and I had dialog examples. Gemma stubbornly kept using quotes around speech and did not use asterisks around actions, and kept using double newlines between paragraphs. After a dozen of messages (which I corrected manually), Gemma finally stopped using quotes and started using asterisks correctly. However, nothing helped against double newlines. I haven't yet seen such a stubborn LLM, when it comes to formatting.

github-actions · 2024-10-16T01:11:06Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ngxson added the enhancement New feature or request label Jul 1, 2024

ngxson mentioned this issue Jul 1, 2024

gemma2: add sliding window mask #8227

Merged

2 tasks

matteoserva mentioned this issue Jul 1, 2024

Bug: quantized gemma 27b output still wrong after tokenizer fix and soft capping #8183

Closed

ngxson mentioned this issue Jul 1, 2024

Fix gemma2 tokenizer convert #8244

Merged

4 tasks

ngxson mentioned this issue Jul 7, 2024

Bug: Gemma2 tokenization seems incorrect. #8349

Closed

sand-bit mentioned this issue Jul 28, 2024

[model support] Requesting support for Gemma 2 kvcache-ai/ktransformers#10

Closed

github-actions bot added stale and removed stale labels Aug 28, 2024

github-actions bot added the stale label Oct 1, 2024

github-actions bot closed this as completed Oct 16, 2024

Investigate gemma 2 generation quality #8240

Investigate gemma 2 generation quality #8240

Comments

ngxson commented Jul 1, 2024 • edited Loading

qnixsynapse commented Jul 1, 2024

ngxson commented Jul 1, 2024

ngxson commented Jul 1, 2024 • edited Loading

TechieUser2517 commented Jul 1, 2024

matteoserva commented Jul 1, 2024 • edited Loading

Reference models:

launch commands

gemma.cpp:

chatllm:

llama.cpp:

Outputs:

gemma.cpp:

chatllm.cpp at Q8_0:

ai studio with temperature 1.0:

llama.cpp at temperature 0.01:

Analysis of results

qnixsynapse commented Jul 1, 2024

ngxson commented Jul 1, 2024 • edited Loading

tristandruyen commented Jul 1, 2024 • edited Loading

Logs:

matteoserva commented Jul 1, 2024

tristandruyen commented Jul 1, 2024 • edited Loading

bartowski1182 commented Jul 1, 2024

ggerganov commented Jul 1, 2024

JeroenAdam commented Jul 1, 2024 • edited Loading

0wwafa commented Jul 1, 2024

tristandruyen commented Jul 1, 2024

matteoserva commented Jul 1, 2024

bartowski1182 commented Jul 1, 2024

oldgithubman commented Jul 1, 2024

matteoserva commented Jul 1, 2024

qnixsynapse commented Jul 1, 2024

ngxson commented Jul 1, 2024 • edited Loading

arch-btw commented Jul 1, 2024

bartowski1182 commented Jul 1, 2024

bartowski1182 commented Jul 1, 2024

matteoserva commented Jul 1, 2024 • edited Loading

steampunque commented Jul 1, 2024

Rotatingxenomorph commented Jul 5, 2024

bfroemel commented Jul 5, 2024

MoonRide303 commented Jul 5, 2024 • edited Loading

Rotatingxenomorph commented Jul 5, 2024 • edited Loading

Rotatingxenomorph commented Jul 5, 2024 • edited Loading

ggerganov commented Jul 5, 2024

eskeletor97 commented Jul 5, 2024

matteoserva commented Jul 5, 2024

steampunque commented Jul 5, 2024 • edited Loading

Rotatingxenomorph commented Jul 6, 2024 • edited Loading

cuelebra commented Jul 6, 2024 • edited Loading

TechieUser2517 commented Jul 6, 2024

cuelebra commented Jul 6, 2024

ggerganov commented Jul 7, 2024

cuelebra commented Jul 7, 2024

ngxson commented Jul 7, 2024 • edited Loading

ngxson commented Jul 7, 2024

AUTOMATIC1111 commented Jul 7, 2024

oldgithubman commented Jul 8, 2024

AUTOMATIC1111 commented Jul 8, 2024

compilade commented Jul 8, 2024

oldgithubman commented Jul 11, 2024

progmars commented Jul 28, 2024 • edited Loading

github-actions bot commented Oct 16, 2024

ngxson commented Jul 1, 2024 •

edited

Loading

ngxson commented Jul 1, 2024 •

edited

Loading

matteoserva commented Jul 1, 2024 •

edited

Loading

ngxson commented Jul 1, 2024 •

edited

Loading

tristandruyen commented Jul 1, 2024 •

edited

Loading

tristandruyen commented Jul 1, 2024 •

edited

Loading

JeroenAdam commented Jul 1, 2024 •

edited

Loading

ngxson commented Jul 1, 2024 •

edited

Loading

matteoserva commented Jul 1, 2024 •

edited

Loading

MoonRide303 commented Jul 5, 2024 •

edited

Loading

Rotatingxenomorph commented Jul 5, 2024 •

edited

Loading

Rotatingxenomorph commented Jul 5, 2024 •

edited

Loading

steampunque commented Jul 5, 2024 •

edited

Loading

Rotatingxenomorph commented Jul 6, 2024 •

edited

Loading

cuelebra commented Jul 6, 2024 •

edited

Loading

ngxson commented Jul 7, 2024 •

edited

Loading

progmars commented Jul 28, 2024 •

edited

Loading