gemma2: add sliding window mask #8227

ngxson · 2024-06-30T17:33:35Z

This is a hack to support sliding window attention for gemma 2 by masking past tokens.

The goal is to make it works. While the ideal solution is to have per-layer KV cache management (with different n_kv per-layer), this seems to be quite challenge (ref: #3377 (comment))

This implementation is mainly inspired by @arlo-phoenix 's works arlo-phoenix@265a8f2

(Test & perplexity below in the comment)

Link to working gguf: https://huggingface.co/bartowski/gemma-2-9b-it-GGUF/tree/main

I have read the contributing guidelines
Self-reported review complexity:
- Medium

matteoserva · 2024-06-30T18:07:49Z

Thanks for your work.

I tested your PR by regenerating the gguf from hf with:
python3 convert-hf-to-gguf.py gemma-2-27b-it
and then I launched with the resulting file without quantization.

The model still is unable to solve questions that are easy for aistdio gemma2. It could be that there is something missing in your implementation or there are other issues beside SWA.

Example problem (anwers is 7 or 8):

<bos><start_of_turn>user
Matteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>
<start_of_turn>model
Here's how to solve the problem step-by-step:

1. **Total Fruit:** Matteo starts with 20 apples + 20 oranges = 40 fruits.

2. **First Discard:** He discards half, so he loses 40 fruits / 2 = 20 fruits.

3. **Remaining Fruit:** He now has 40 fruits - 20 fruits = 20 fruits.

4. **Second Discard:** He discards a quarter, which is 20 fruits / 4 = 5 fruits.

5. **Final Fruit:** He's left with 20 fruits - 5 fruits = 15 fruits.

6. **Apples Remaining:** Since he discarded apples and oranges equally, he lost 5 fruits / 2 = 2.5 apples.  Since you can't have half an apple, we'll assume he lost 2 apples.

7. **Final Apple Count:** Matteo started with 20 apples and lost 2, leaving him with 20 apples - 2 apples = 18 apples.


**Answer:** Matteo has 18 apples remaining.

I run the inference without offloading the entire model in the GPU since I don't have enough VRAM.

src/llama.cpp

ngxson · 2024-06-30T18:12:39Z

@matteoserva I think it's normal that small model like this one to make on math mistake. What this PR trying to address is that gemma 2 current breaks after generating more than 4096 tokens.

We could try for example, input a long document (like shakespeare) then ask it something related.

matteoserva · 2024-06-30T18:21:12Z

@ngxson

Sorry, I was in a hurry and i didn't explain why I made that post. With this PR (and also without) the model breaks with even simple questions, well before the 4096 tokens limit.
To get a comparison I sent the same questions to gemma2 in AIstudio and gemma2 in chatllm.cpp. Both these projects answer correctly all my questions.
I have many more questions that work correctly in other implementation of gemma2 but not this one.

It could be related to how SWA was implemented but I'm not sure.

ngxson · 2024-06-30T18:32:49Z

@matteoserva I think the bug that you described is unrelated to this PR. The goal here is to make no change if you're generating less than 4096 tokens.

Probably you should open an issue so other users can share their results (i.e. with different quantizations, sampling settings, etc)

bfroemel · 2024-06-30T18:40:36Z

Can we have one or two test cases (prompt + expected outcome) that work in aistudio and should work with llama.cpp and this PR?

ngxson · 2024-06-30T18:43:01Z

@bfroemel I heard other users reported that after 4096 tokens, the generation breaks completely (gibberish output), so probably you just need to input 4096 tokens (or more, don't need to be exact), then see if it still speak english or it's drunk.

(If someone know this better, feel free to correct what I said)

matteoserva · 2024-06-30T19:15:13Z

I tested this PR using gemma-9b unquantized.
The prompt length is around 6k tokens. A transcript of a youtube video.

Without this PR:

<bos><start_of_turn>user
Here is the transcript of a youtube video, with timestamps: 
[...]
Summarize it in one sentence. Then find the best jokes and output the jokes with their timestamps.<end_of_turn>
<start_of_turn>model
This video is about the author's experience with oversized heating systems and how it's a problem that needs to be addressed.

The video discusses the author's experience with oversized heating systems and how it's a problem that needs to be addressed.

The video is about the author's experience with oversized heating systems and how it's a problem that needs to be addressed.

The video is about the author's experience with oversized heating systems and how it's a problem that needs to be addressed.

The video is about the author's experience with oversized heating systems and how it's a problem that needs to

With this PR:

<bos><start_of_turn>user
Here is the transcript of a youtube video, with timestamps: 
[...]
Summarize it in one sentence. Then find the best jokes and output the jokes with their timestamps.<end_of_turn>
<start_of_turn>model
This YouTube video argues that HVAC systems are often oversized, leading to inefficiency and higher costs, and advocates for heat pumps as a more sustainable and efficient alternative.

Here are some of the best jokes from the video with their timestamps:

* **"It’s almost comical how little my furnace runs."** (00:05:38.287)
* **"But... this is Chicago. It gets a lot colder than that."** (00:07:05.348)
* **"Turns out, heat pumps … ARE JUST AIR CONDITIONERS!"** (00:13:44.600)
* **"Shout it from the rooftops, put it on a T-shirt, annoy your friends, and spread the word."** (00:13:50.608)

Let me know if you'd like me to find more jokes or summarize any specific sections of the transcript!

bfroemel · 2024-06-30T19:18:25Z

@ngxson Attached a test prompt which should be about 6k tokens. I tried it on aistudio (I have only the 27b-it model available), and I get this output:

Let's count the words starting with 'a' in the text you provided.

After carefully reviewing the text, I've counted 187 words that begin with the letter 'a'.

Let me know if you have any other text you'd like me to analyze!

However, I am not sure about the gibberish output without SWA, for example, I am getting on llama.cpp (without the applied PR) on the 27b-it pretty sane output, but unrelated to the question in the prompt: /edit: polluted by ollama which limits context size to 4k

This is a fantastic scene from Shakespeare's The Tempest! You've captured the key elements beautifully:
 
 
Prospero's Control: We see Prospero's power over Ariel and Caliban, highlighting his manipulative nature. His treatment of Caliban is particularly brutal, showcasing the colonial themes present in the play.
 
Ariel's Magic: Ariel's song and appearance as a water-nymph create a sense of enchantment and otherworldly beauty. The music draws Ferdinand in, setting the stage for their encounter.
 
Ferdinand's Arrival: Ferdinand's entrance is full of wonder and grief. He's lost his father and is stranded on this strange island. His reaction to Miranda and her beauty is immediate and intense.
 
Miranda's Innocence: Miranda's naivety and fascination with Ferdinand are endearing. She's never seen anyone like him before, and she's instantly smitten.
 
 
Some things to consider as you continue the scene:
 
 
Prospero's Plan: What is Prospero's ultimate goal in bringing Ferdinand and Miranda together? Is it purely for his own amusement, or does he have a deeper purpose?
 
Ferdinand and Miranda's Relationship: How will their relationship develop? Will they fall in love despite the circumstances?
 
Caliban's Role: What role will Caliban play in the unfolding events? Will he seek revenge on Prospero, or will he find an unexpected ally?
 
 
I'm eager to see where you take this scene next!

Of course, I regenerated the output from both aistudio and llama.cpp a couple of times: aistudio always tried to answer the question in the prompt, ~~llama.cpp always commented on the "fantastic scene". I'll report back as soon as I could test with this PR, unless someone else is faster.~~

slaren · 2024-06-30T19:32:14Z

Perplexity with 8192 context improves a lot.

$ ./llama-perplexity -f wikitext-2-raw/wiki.test.raw -m models/gemma-2-9b-it/ggml-model-f16.gguf -ngl 99 -c 8192

master:
[1]60.0631,[2]35.6986,[3]29.8380,[4]30.1761,[5]27.8885,[6]28.4963,[7]31.7245,[8]32.1660,[9]31.4798,[10]29.1953,[11]30.8328,[12]31.5990,[13]30.7990,[14]28.9782,[15]30.2532,[16]30.3491,[17]29.6455,[18]29.7374,[19]29.6457,[20]29.7762,[21]29.6722,[22]29.7747,[23]30.6936,[24]31.1542,[25]31.3473,[26]31.6713,[27]31.3694,[28]31.6611,[29]31.4997,[30]31.3214,[31]31.4099,[32]31.1238,[33]31.1150,[34]30.3834,[35]30.4868,
Final estimate: PPL = 30.4868 +/- 0.28072

PR:
[1]12.2630,[2]7.8748,[3]7.9286,[4]8.2527,[5]8.0558,[6]8.3889,[7]9.0239,[8]9.2015,[9]8.8839,[10]8.2007,[11]8.7389,[12]8.8758,[13]8.6679,[14]8.4109,[15]8.7539,[16]8.6858,[17]8.5179,[18]8.5817,[19]8.5995,[20]8.6795,[21]8.5801,[22]8.5626,[23]8.7545,[24]8.8413,[25]8.8929,[26]9.0335,[27]9.0750,[28]9.1664,[29]9.1345,[30]9.1329,[31]9.1417,[32]9.0576,[33]9.0947,[34]8.9226,[35]8.9857,
Final estimate: PPL = 8.9857 +/- 0.07196

ngxson · 2024-06-30T19:41:24Z

Perfect, thanks @slaren @bfroemel

To correct what I said earlier: without SWA, the model does not output gibberish, but repeated output (ref: #8197 (comment)). That explains what @bfroemel got from master branch. However, even with this PR, it seems like we still have issue with generation quality in general. The test with video transcription seems to be a good idea (better than shakespeare), so let's keep testing with that.

bfroemel · 2024-06-30T19:58:08Z

uhm, just to correct my report: now I see the same repeated text on master branch (the thing I saw earlier was polluted by ollama. on pure llama.cpp, master I see the repeating mess).
and to complete: with applied PR, the repeats disappear, but the model now attempts to list all found 'a' characters which aistudio doesn't do:

Here's a count of the words starting with 'a' in the provided excerpt from *The Tempest*:

1. **a** tempestuous
2. **a**
3. **a**
4. **a**
5. **a**
.
.
90. **a**
91. **a**



There are **91 words** that start with the letter 'a' in this excerpt.

The test with video transcription seems to be a good idea (better than shakespeare), so let's keep testing with that.

-> ok, also focusing on the video transcript test from now on.

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

src/llama.cpp

slaren · 2024-06-30T21:52:56Z

src/llama.cpp

+            if (lctx.model.arch == LLM_ARCH_GEMMA2) {
+                GGML_ASSERT(lctx.inp_KQ_mask_SWA);
+                GGML_ASSERT(hparams.n_sliding > 0);
+                data     = (float *) lctx.inp_KQ_mask->data;
+                data_swa = (float *) lctx.inp_KQ_mask_SWA->data;
+                // because layer masks are alternate for gemma 2, we only need to take first 2 layers
+            }


This can be simplified a bit.

Suggested change

if (lctx.model.arch == LLM_ARCH_GEMMA2) {

GGML_ASSERT(lctx.inp_KQ_mask_SWA);

GGML_ASSERT(hparams.n_sliding > 0);

data = (float *) lctx.inp_KQ_mask->data;

data_swa = (float *) lctx.inp_KQ_mask_SWA->data;

// because layer masks are alternate for gemma 2, we only need to take first 2 layers

}

if (lctx.inp_KQ_mask_SWA) {

data_swa = (float *) lctx.inp_KQ_mask_SWA->data;

}

If I am not mistaken, mistral uses SWA every layer. So maybe this needs to be separated to allow having only inp_KQ_mask_SWA? Will the same implementation work?

I've just looked at mistral reference implementation, they seem to use different mask for each layer. Link: https://github.com/mistralai/mistral-inference/blob/main/src/mistral_inference/cache.py

So I think my previous version (using std::vector) can handle that. Do you think I should revert the change?

It surprises me a bit, since mistral's quality doesn't seem to degrade even it's missing SWA (or it only breaks after 4096 tokens?)

I have been looking at this code for a while and reviewing the mistral paper, and I think this is an implementation of the rolling buffer cache rather than sliding window attention. As far as I can tell, mistral has the same sliding window of 4096 tokens on each layer. Knowing that, it is possible to reduce the size of the KV cache to the sliding window size, but that requires some additional housekeeping so that eg. the rope still receives the absolute positions of the tokens, but the data is actually stored in the position pos % sliding_window. But maybe I am misunderstanding something, can you point me to the specific code?

Yes, it should be possible. The thing I cannot figure out is how to avoid calling llama_kv_cache_find_slot() per-layer - seems it would be a big waste to do it like this, although it would generalize to support arbitrary KV cache layer sizes

Yeah I assume the code is reference implementation so not very good quality. Having rolling buffer would be ideal for llama.cpp, but seems like too many changes. This is mostly to answer your question earlier: Will the same implementation work? Yes it works with different sliding window mask per layer, but will be waste of memory without rolling buffer.

How would the mask differ in each layer? My understanding is that the mask would be the same for all the layers, and it relies on the fact that the states in the KV cache depend on all the previous tokens to be able to access information beyond the sliding window.

I looked deeper into the paper, seems like I missed something.

Looking at this figure:

And the explanation:

I'd assume that the mask for each layer is shifted by the size of window - 1, for example:

layer 0: 0, 0, 0, 1, 1

layer 1: 0, 0, 1, 1, 0

layer 2: 0, 1, 1, 0, 0

...

But then what I don't understand is the phrase "position i of the layer k, hi, attends to all hidden states from
the previous layer with positions between i − W and i". On the surface, it seems to explain how layer 1 knows about the tokens fall outside of its window (which is in layer 0), but then what's not clear to me is how one layer can attend to the previous one.

Also looking at the HF implementation code, seems like there is no such thing. They just add same attention mask for all layers: https://github.com/huggingface/transformers/blob/e65502951593a76844e872fee9c56b805598538a/src/transformers/models/mistral/modeling_mistral.py#L354

This can be simplified a bit.

Changed in ed5496f

I think for now we can keep the implementation this way, I'll need more time to figure out how mistral actually use SWA.

But then what I don't understand is the phrase "position i of the layer k, hi, attends to all hidden states from the previous layer with positions between i − W and i". On the surface, it seems to explain how layer 1 knows about the tokens fall outside of its window (which is in layer 0), but then what's not clear to me is how one layer can attend to the previous one.

I think it doesn't directly "attend" to the tokens from the previous one. It just receives information about those tokens through the output of previous layer.

I am also trying to understand this concept from the past 3 days. I did not pay attention to this when Mistral v1 was released and I remember seeing that Mistral v2 removed SWA.

Dampfinchen · 2024-06-30T22:01:57Z

Does quants need to be redone again, or is this just for the inference side?

ngxson · 2024-06-30T22:12:41Z

@Dampfinchen it's recommend to re-generate, but not required. We have a default value for the added metadata, so at least existing ggufs won't break.

bartowski1182 · 2024-07-01T04:20:14Z

The only benefit presumably being from long context imatrix measurements being more accurate?

ggerganov · 2024-07-01T08:40:03Z

src/llama.cpp

@@ -2099,6 +2101,7 @@ struct llama_hparams {
    uint32_t n_ff_shexp = 0;
    uint32_t n_expert_shared = 0;
    float    expert_weights_scale = 0.0;
+    uint32_t n_sliding = 0; // sliding window attention (SWA)


Suggested change

uint32_t n_sliding = 0; // sliding window attention (SWA)

uint32_t n_swa = 0; // sliding window attention (SWA)

Changed in ed5496f

ggerganov · 2024-07-01T08:40:17Z

src/llama.cpp

@@ -2661,6 +2664,9 @@ struct llama_context {
    struct ggml_tensor * inp_s_mask;    // F32 [1, n_kv]
    struct ggml_tensor * inp_s_seq;     // I32 [n_kv, n_batch]

+    // KQ mask per layer, used by sliding window attention (gemma 2)
+    struct ggml_tensor * inp_KQ_mask_SWA;


Suggested change

struct ggml_tensor * inp_KQ_mask_SWA;

struct ggml_tensor * inp_KQ_mask_swa;

Changed in ed5496f

ggerganov · 2024-07-01T08:43:29Z

src/llama.cpp

-            float * data = (float *) lctx.inp_KQ_mask->data;
+            float * data     = (float *) lctx.inp_KQ_mask->data;
+            float * data_swa = nullptr;
+            const llama_pos n_keep_swa = hparams.n_sliding - batch.n_tokens;


I don't understand the meaning of n_keep_swa. Seems this won't work with batches of multiple sequences

Yeah I'm not sure if I'm doing it correctly: It is to emulate the rolling. If we input n_tokens then we only keep n_sliding - n_tokens tokens in cache, so the total number of tokens for attention is n_tokens plus n_sliding - n_tokens equals n_sliding

Seems to me just restricting the position delta to be less than n_swa is enough:

diff --git a/src/llama.cpp b/src/llama.cpp index 71b7ef62..fa207234 100644 --- a/src/llama.cpp +++ b/src/llama.cpp @@ -12722,7 +12722,7 @@ static void llama_set_inputs(llama_context & lctx, const llama_batch & batch) { // may need to cut off old tokens for sliding window if (data_swa) { - if (pos - lctx.kv_self.cells[i].pos > n_keep_swa) { + if (pos - lctx.kv_self.cells[i].pos >= hparams.n_sliding) { f = -INFINITY; } data_swa[h*(n_kv*n_tokens) + j*n_kv + i] = f;

This way, in SWA layers, the token with position 4096 does not "see" the token with position 0, but does "see" the token at position 1.

OK thanks, that's clear for me now. I changed this code in ed5496f

Dampfinchen · 2024-07-01T09:04:25Z

The only benefit presumably being from long context imatrix measurements being more accurate?

I think the purpose of this PR is that right now the context size is fixed at 4K and this enables sliding window attention to get accurate results at 8K, so it's very important.

slaren · 2024-07-01T14:23:10Z

Perplexity improved a bit with the latest change.

Final estimate: PPL = 8.9711 +/- 0.07180

bfroemel · 2024-07-01T15:19:44Z

looking really good, but still seeing seemingly degraded performance/quality compared to the aistudio, Gemma2 model output :/ I am able to test the 27b-it, fp16 model locally (same temperature and top p). Maybe just expected degradation, because originally the model was bf16?

Here the same perplexity test for the 27b-it, fp16:

Final estimate: PPL = 7.7068 +/- 0.05720

ggerganov · 2024-07-01T15:28:58Z

@bfroemel Degraded quality is not expected - show us the exact commands that you are using, otherwise we mainly ignore such comments because there are many ways to use the examples incorrectly and in majority of cases it is a user error

ggerganov · 2024-07-01T15:30:35Z

Long-term we should refactor the KV cache code to support SWA properly and with less memory. For now we can merge this so that we have Gemma2 support

ngxson · 2024-07-01T15:55:04Z

Let's merge when CI passed

bfroemel · 2024-07-01T16:11:15Z

@ggerganov At first I thought it was something related to longer context and maybe a bug in the SWA implementation, but looking back at @matteoserva's test, it is as simple as that:

./llama-cli -m /models/gemma-2-27b-it-fp16.gguf  --gpu-layers 100 --host 0.0.0.0 --temp 0 --top-p 0.95 -c 8192 -p "<bos><start_of_turn>user\nMatteo has 20 apples, he buys 20 oranges. Then he discards half of his fruits equally. Then he discards a quarter of his fruits equally between apples and oranges. How many apples remain?<end_of_turn>\n<start_of_turn>model\n"

Locally with llama.cpp + applied PR, I get the confused answer: 18 apples, while the model on aistudio answers correctly 8 apples (also set to a temperature of 0). Gemma-2 goes through these reasoning problems step-by-step, like @matteoserva already showed, and along the way it probably confused on llama.cpp two objects (fruits and apples) and ended up with the wrong result.

-> Probably best to open a new issue.

qnixsynapse · 2024-07-01T16:15:06Z

@bfroemel Have you tried it in bf16 instead of fp16?

bfroemel · 2024-07-01T16:31:51Z

Ah of course, I can try this out without offloading. /edit: grr, now I am confusing stuff. Test is still ongoing. /edit2: same bad result (18 apples). So it's not bf16.

ngxson · 2024-07-01T16:54:00Z

@bfroemel @qnixsynapse @matteoserva I moved the discussion related to generation quality to #8240 , could you copy-paste your results there? (And also move the discussion there). Thank you.

ggerganov · 2024-07-01T17:11:54Z

@bfroemel You have an extra BOS token in your command. No need to add the token explicitly because it is automatically added. Use --verbose-prompt to see the actual tokens

bfroemel · 2024-07-01T18:23:12Z

( @ggerganov I am feeling a bit dumb now :) Thanks for this hint! Indeed the extra BOS token significantly degrades the model performance further. With a correct prompt at least I am getting a good apple count for that particular prompt. )

This is a cherry-pick of ggerganov/llama.cpp#8227

* gemma2: add sliding window mask * fix data_swa uninitialized * better naming * add co-author Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> * replace list with single tensor * update * llama : minor styling * convert : add sanity check for query_pre_attn_scalar * fix small typo in README --------- Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

gemma2: add sliding window mask

7df7530

ngxson added the help wanted Extra attention is needed label Jun 30, 2024

slaren reviewed Jun 30, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

fix data_swa uninitialized

ab2c3de

better naming

46b56e6

github-actions bot added the python python script changes label Jun 30, 2024

add co-author

231dae4

Co-authored-by: Arlo Phoenix <arlo-phoenix@users.noreply.github.com>

ngxson requested review from slaren and ggerganov June 30, 2024 21:12

ngxson marked this pull request as ready for review June 30, 2024 21:12

slaren reviewed Jun 30, 2024

View reviewed changes

src/llama.cpp Outdated Show resolved Hide resolved

replace list with single tensor

d09ecb8

slaren reviewed Jun 30, 2024

View reviewed changes

Nexesenex mentioned this pull request Jul 1, 2024

gemma2: add sliding window mask LostRuins/koboldcpp#959

Closed

4 tasks

ggerganov reviewed Jul 1, 2024

View reviewed changes

update

ed5496f

slaren approved these changes Jul 1, 2024

View reviewed changes

llama : minor styling

ce711f6

ggerganov approved these changes Jul 1, 2024

View reviewed changes

ggerganov and others added 3 commits July 1, 2024 18:38

convert : add sanity check for query_pre_attn_scalar

7dc9cbf

Merge branch 'master' into xsn/gemma2_mask_swa

80bdc38

fix small typo in README

e24328e

ngxson added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 1, 2024

ngxson merged commit 49122a8 into ggerganov:master Jul 1, 2024
54 checks passed

ngxson mentioned this pull request Jul 1, 2024

Investigate gemma 2 generation quality #8240

Closed

jart pushed a commit to Mozilla-Ocho/llamafile that referenced this pull request Jul 1, 2024

Add sliding window mask for Gemma2

41678c8

This is a cherry-pick of ggerganov/llama.cpp#8227

ki-manufaktur mentioned this pull request Jul 2, 2024

bump llama.cpp for gemma2 fixes ollama/ollama#5428

Closed

	uint32_t n_sliding = 0; // sliding window attention (SWA)
	uint32_t n_swa = 0; // sliding window attention (SWA)

	struct ggml_tensor * inp_KQ_mask_SWA;
	struct ggml_tensor * inp_KQ_mask_swa;

gemma2: add sliding window mask #8227

gemma2: add sliding window mask #8227

Conversation

ngxson commented Jun 30, 2024 • edited Loading

matteoserva commented Jun 30, 2024

ngxson commented Jun 30, 2024

matteoserva commented Jun 30, 2024

ngxson commented Jun 30, 2024

bfroemel commented Jun 30, 2024

ngxson commented Jun 30, 2024 • edited Loading

matteoserva commented Jun 30, 2024 • edited Loading

Without this PR:

With this PR:

bfroemel commented Jun 30, 2024 • edited Loading

slaren commented Jun 30, 2024

ngxson commented Jun 30, 2024

bfroemel commented Jun 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngxson Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dampfinchen commented Jun 30, 2024

ngxson commented Jun 30, 2024

bartowski1182 commented Jul 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dampfinchen commented Jul 1, 2024

slaren commented Jul 1, 2024

bfroemel commented Jul 1, 2024

ggerganov commented Jul 1, 2024

ggerganov commented Jul 1, 2024

ngxson commented Jul 1, 2024

bfroemel commented Jul 1, 2024 • edited Loading

qnixsynapse commented Jul 1, 2024 • edited Loading

bfroemel commented Jul 1, 2024 • edited Loading

ngxson commented Jul 1, 2024

ggerganov commented Jul 1, 2024

bfroemel commented Jul 1, 2024 • edited Loading

ngxson commented Jun 30, 2024 •

edited

Loading

ngxson commented Jun 30, 2024 •

edited

Loading

matteoserva commented Jun 30, 2024 •

edited

Loading

bfroemel commented Jun 30, 2024 •

edited

Loading

bfroemel commented Jun 30, 2024 •

edited

Loading

ngxson Jul 1, 2024 •

edited

Loading

bfroemel commented Jul 1, 2024 •

edited

Loading

qnixsynapse commented Jul 1, 2024 •

edited

Loading

bfroemel commented Jul 1, 2024 •

edited

Loading

bfroemel commented Jul 1, 2024 •

edited

Loading