[Request] Create llamafile for Gemma 7B #269

elimisteve · 2024-02-22T02:51:34Z

...which is allegedly better than Llama 2 13B (and Mistral 7B)!

Source: https://blog.google/technology/developers/gemma-open-models/

jart · 2024-02-22T03:33:04Z

We're very excited about the new release too. This is being worked on.

asifshaikat · 2024-02-22T11:30:52Z

Egarley waiting for it as latest build of llama.cpp can run the gguf version of quantized gemma though it produce very bad quality outputs. Hopefully after this awaiting PR merge will make things better . As people pointed this issue is probably responsible for bad output.

dartharva · 2024-02-23T16:29:25Z

@jart Gemma.cpp is out:

https://github.com/google/gemma.cpp

jart · 2024-02-23T19:14:32Z

Thanks for pointing out that project @dartharva. They have an elegant codebase. Gemma 7b is able to solve riddles that I've only seen Mixtral solve if you have an AVX512 machine: google/gemma.cpp#23

theonewolf · 2024-02-25T17:22:25Z

I am very excited for this as well. @jart thank you so much for making such a valuable project for the community and, importantly, maintaining it with efforts like this!

Perhaps we could create a llamafile zoo repository and accept llamafiles from the community to crowdsource creating these!

rabimba · 2024-02-27T19:21:12Z

I can help in creating a llamafile for gemma (both 2b and 7b), if its needed

jart · 2024-02-28T08:49:25Z

I have good news and more good news re: Gemma. In this tutorial, you'll learn (1) how to run Gemma with llamafile, and (2) how to build the official gemma.cpp project as a cross-platform "gemmafile" that you can run on six OSes and two architectures, similar to llamafile.

(1) Running Gemma with llamafile

Having finished synchronizing upstream, I'm now confident that llamafile is able to offer you a first-class Gemma 7b experience.

The screenshot above shows llamafile running gemma-7b-it on my Radeon 7900 XTX with q5_k_m quantization going 81 tokens per second. Most importantly, we can see that Gemma correctly answered the math riddle that I previously thought only Mixtral 8x7b could solve. That's really neat if we consider Mixtral at Q5_K_M is a 31gb file whereas Gemma is 5.7gb.

Gemma appears to be a model you can put to work too. In the screenshot above, I asked Gemma to summarize an old rant from USENET with 3,779 words. Gemma's reading speed on my GPU was ~2,000 tokens per second. It successfully produced a summary. The summary is good, but a bit on the enterprisey side compared to Mistral, which tends to be more focused and succinct. Although it makes no matter if you're planning to just use --grammar 'root ::= "yes" | "no"' regardless.

Similar to what we saw when Mixtral was released, the Q5_K_M quantization format also seem to have the magic touch with Gemma. I haven't measured the perplexity, but I like the responses I've seen so far a lot better than what I've seen with Q4_0, Q8_0, and Q6_K.

To get started using llamafile with Gemma, you can either wait for the next llamafile release, or you can build llamafile from source today, by running the following commands on your preferred OS which must have a bourne compatible shell.

git clone https://github.com/Mozilla-Ocho/llamafile
cd llamafile
make -j8
sudo make install

Next, go to the Gemma Kaggle and agree to the EULA for Gemma. Even though it's Kaggle, agreeing to the EULA there will also grant you access to the Hugging Face repo at https://huggingface.co/google/gemma-7b-it/ which is where you need to go to download the GGUF file. Do not be discouraged by the 34.2gb file size. These are f32 weights. After you've downloaded them, you can use the following command to turn them into a 5.8gb file:

llamafile-quantize gemma-7b-it.gguf gemma-7b-it-Q5_K_M.gguf Q5_K_M

Now to use Gemma Instruct, you need to follow its specific prompt format. Here's an example of how to do that on the command line. First, we'll define a bash function named gemma that inserts the prompt, tones down the temperature, and configures the maximum context window size:

gemma() {
  llamafile -m ~/weights/gemma-7b-it-Q5_K_M.gguf \
    -e -p "<bos><start_of_turn>user\n$*<end_of_turn>\n<start_of_turn>model\n" \
    --temp 0 -ngl 999 -c 0
}

You can then ask gemma questions or give it instructions as follows:

gemma what do you think of google

(2) Building gemma.cpp as a gemmafile

While llamafile is able to run the Gemma model, it doesn't run it exactly the same as the gemma.cpp project that was written by Google themselves. For example, one thing I've noticed is that gemma.cpp gives much more succinct answers to questions usually, which means it appears to share even more of Mistral's strengths than I mentioned earlier. The people who wrote gemma.cpp are also prolific, e.g. Jan Wassenberg who coded the world's fastest sorting function. He brought his SIMD expertise to Gemma and made the project go as fast on CPU as it's possible for Gemma to go on CPU, faster even than llamafile. Please note however that CPU goes much slower than GPU (about ~11 tokens per second) although having good CPU support has many benefits, such as longevity, full support on all OSes, and it doesn't require linking non-libre code into the address space.

The good news is that gemma.cpp supports Cosmopolitan Libc out of the box. I didn't need to do anything to make this happen, for their main gemma executable. Although I did add some patches to Cosmo this morning to ensure that all their unit tests pass. What this means is you can go straight to the gemma.cpp project itself to get your own gemmafile. Here's how to do that.

First, you need to install the cosmocc v3.3.2 or newer, available here: https://github.com/jart/cosmopolitan/releases/tag/3.3.2

mkdir ~/cosmocc
cd ~/cosmocc
wget https://github.com/jart/cosmopolitan/releases/download/3.3.2/cosmocc-3.3.2.zip
unzip cosmocc-3.3.2.zip
export PATH="$HOME/cosmocc:$PATH"

Now that you've installed cosmocc, you can configure gemma.cpp's cmake build system to build your gemmafile as follows:

cd ~
git clone https://github.com/google/gemma.cpp/
cd gemma.cpp
mkdir -p build
cd build
cmake .. -DCMAKE_C_COMPILER=cosmocc -DCMAKE_CXX_COMPILER=cosmoc++ -DCMAKE_CXX_FLAGS=-fexceptions -DCMAKE_AR=$(command -v cosmoar) -DBUILD_SHARED_LIBS=OFF -DSPM_ENABLE_SHARED=OFF
make -j8  gemma  # builds just the gemmafile
make -j8  # runs the tests too

Now that that's done, you need to agree to their EULA and download the gemma-7b-it-sfp weights off their kaggle page: https://www.kaggle.com/models/google/gemma Once you untar them, you can open up a chatbot cli interface as follows:

./gemma --model 7b-it --compressed_weights 7b-it-sfp.sbs --tokenizer tokenizer.spm

If you want to combine them into a single file, then you can use llamafile's zipalign program to put the weights inside the gemma executable. Here it works slightly differently than with llamafile. Once the weights are in the zip structure, we'll be using cosmopolitan's generic ZipOS filesystem (mounted as /zip/... to access the files (whereas llamafile has a handwritten solution for doing this, that enables faster mmap'ing [gemma.cpp doesn't use mmap]). To do this, you could run:

zipalign -j0 gemma 7b-it-sfp.sbs tokenizer.spm

Then create a .args file with the following content:

--model
7b-it
--compressed_weights
/zip/7b-it-sfp.sbs
--tokenizer
/zip/tokenizer.spm
...

And add that file to the zip too:

zipalign -j0 gemma .args

Now you're done. You should be able now to just run:

./gemma

And your gemmafile chatbot CLI interface will pop up. Enjoy!

@theonewolf Hugging Face is the official llamafile zoo. Just upload your llamafiles there and use the llamafile tag. Please also follow the best practices for distribution described in the README file.

rajatpundir · 2024-04-09T04:29:22Z

Can someone please upload their gemmafile to huggingface?

gretadolcetti · 2024-04-10T06:15:10Z

Can someone please upload their gemmafile to huggingface?

+1, I would like it very much!

jart self-assigned this Feb 22, 2024

jart added the enhancement label Feb 22, 2024

jart closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] Create llamafile for Gemma 7B #269

[Request] Create llamafile for Gemma 7B #269

elimisteve commented Feb 22, 2024 •

edited

Loading

jart commented Feb 22, 2024

asifshaikat commented Feb 22, 2024

dartharva commented Feb 23, 2024

jart commented Feb 23, 2024

theonewolf commented Feb 25, 2024

rabimba commented Feb 27, 2024

jart commented Feb 28, 2024 •

edited

Loading

rajatpundir commented Apr 9, 2024

gretadolcetti commented Apr 10, 2024

[Request] Create llamafile for Gemma 7B #269

[Request] Create llamafile for Gemma 7B #269

Comments

elimisteve commented Feb 22, 2024 • edited Loading

jart commented Feb 22, 2024

asifshaikat commented Feb 22, 2024

dartharva commented Feb 23, 2024

jart commented Feb 23, 2024

theonewolf commented Feb 25, 2024

rabimba commented Feb 27, 2024

jart commented Feb 28, 2024 • edited Loading

(1) Running Gemma with llamafile

(2) Building gemma.cpp as a gemmafile

rajatpundir commented Apr 9, 2024

gretadolcetti commented Apr 10, 2024

elimisteve commented Feb 22, 2024 •

edited

Loading

jart commented Feb 28, 2024 •

edited

Loading