-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Request] Create llamafile for Gemma 7B #269
Comments
We're very excited about the new release too. This is being worked on. |
Egarley waiting for it as latest build of llama.cpp can run the gguf version of quantized gemma though it produce very bad quality outputs. Hopefully after this awaiting PR merge will make things better . As people pointed this issue is probably responsible for bad output. |
@jart Gemma.cpp is out: |
Thanks for pointing out that project @dartharva. They have an elegant codebase. Gemma 7b is able to solve riddles that I've only seen Mixtral solve if you have an AVX512 machine: google/gemma.cpp#23 |
I am very excited for this as well. @jart thank you so much for making such a valuable project for the community and, importantly, maintaining it with efforts like this! Perhaps we could create a llamafile zoo repository and accept llamafiles from the community to crowdsource creating these! |
I can help in creating a llamafile for gemma (both 2b and 7b), if its needed |
I have good news and more good news re: Gemma. In this tutorial, you'll learn (1) how to run Gemma with llamafile, and (2) how to build the official gemma.cpp project as a cross-platform "gemmafile" that you can run on six OSes and two architectures, similar to llamafile. (1) Running Gemma with llamafileHaving finished synchronizing upstream, I'm now confident that llamafile is able to offer you a first-class Gemma 7b experience. The screenshot above shows llamafile running gemma-7b-it on my Radeon 7900 XTX with Gemma appears to be a model you can put to work too. In the screenshot above, I asked Gemma to summarize an old rant from USENET with 3,779 words. Gemma's reading speed on my GPU was ~2,000 tokens per second. It successfully produced a summary. The summary is good, but a bit on the enterprisey side compared to Mistral, which tends to be more focused and succinct. Although it makes no matter if you're planning to just use Similar to what we saw when Mixtral was released, the To get started using llamafile with Gemma, you can either wait for the next llamafile release, or you can build llamafile from source today, by running the following commands on your preferred OS which must have a bourne compatible shell.
Next, go to the Gemma Kaggle and agree to the EULA for Gemma. Even though it's Kaggle, agreeing to the EULA there will also grant you access to the Hugging Face repo at https://huggingface.co/google/gemma-7b-it/ which is where you need to go to download the GGUF file. Do not be discouraged by the 34.2gb file size. These are f32 weights. After you've downloaded them, you can use the following command to turn them into a 5.8gb file:
Now to use Gemma Instruct, you need to follow its specific prompt format. Here's an example of how to do that on the command line. First, we'll define a bash function named gemma() {
llamafile -m ~/weights/gemma-7b-it-Q5_K_M.gguf \
-e -p "<bos><start_of_turn>user\n$*<end_of_turn>\n<start_of_turn>model\n" \
--temp 0 -ngl 999 -c 0
} You can then ask gemma questions or give it instructions as follows:
(2) Building gemma.cpp as a gemmafileWhile llamafile is able to run the Gemma model, it doesn't run it exactly the same as the gemma.cpp project that was written by Google themselves. For example, one thing I've noticed is that gemma.cpp gives much more succinct answers to questions usually, which means it appears to share even more of Mistral's strengths than I mentioned earlier. The people who wrote gemma.cpp are also prolific, e.g. Jan Wassenberg who coded the world's fastest sorting function. He brought his SIMD expertise to Gemma and made the project go as fast on CPU as it's possible for Gemma to go on CPU, faster even than llamafile. Please note however that CPU goes much slower than GPU (about ~11 tokens per second) although having good CPU support has many benefits, such as longevity, full support on all OSes, and it doesn't require linking non-libre code into the address space. The good news is that gemma.cpp supports Cosmopolitan Libc out of the box. I didn't need to do anything to make this happen, for their main First, you need to install the cosmocc v3.3.2 or newer, available here: https://github.com/jart/cosmopolitan/releases/tag/3.3.2
Now that you've installed cosmocc, you can configure gemma.cpp's cmake build system to build your gemmafile as follows:
Now that that's done, you need to agree to their EULA and download the gemma-7b-it-sfp weights off their kaggle page: https://www.kaggle.com/models/google/gemma Once you untar them, you can open up a chatbot cli interface as follows:
If you want to combine them into a single file, then you can use llamafile's
Then create a
And add that file to the zip too:
Now you're done. You should be able now to just run:
And your gemmafile chatbot CLI interface will pop up. Enjoy! @theonewolf Hugging Face is the official llamafile zoo. Just upload your llamafiles there and use the |
Can someone please upload their gemmafile to huggingface? |
+1, I would like it very much! |
...which is allegedly better than Llama 2 13B (and Mistral 7B)!
Source: https://blog.google/technology/developers/gemma-open-models/
The text was updated successfully, but these errors were encountered: