llama.cpp allocates a way more ram than ollama #9414

commonuserlol · 2024-09-10T16:59:56Z

commonuserlol
Sep 10, 2024

Hi, right away I want to warn that I am just an end user who wants to use the features of Vulkan/AVX which ollama doesn't have.

I've compiled llama.cpp and tried to run https://huggingface.co/bartowski/LongWriter-llama3.1-8b-GGUF/blob/main/LongWriter-llama3.1-8b-IQ3_XS.gguf. On the CPU it somehow got up and running (although I think slower than ollama (well I'm comparing to https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct which might have less ram usage due safetensors format?), and used almost all the available RAM). On the GPU, Vulkan was unable to allocate memory. I've noticed that K and V size is 8GB for each what means I need more (V)RAM?

Setup:
Windows 11 24h2 (msys2 for Vulkan).
RX 570 4GB
16GB RAM

Answered by 0cc4m

Jan 26, 2025

You might want to use a K-quant (Q3_K or Q4_K) for now. IQ quants are not yet supported in Vulkan and probably fall back to CPU. IQ2 and IQ3 support is being worked on (#11360), but it will take a little time until they are optimized similar to the other quants.

View full answer

ggerganov · 2024-09-10T17:06:28Z

ggerganov
Sep 10, 2024
Maintainer

By default, llama.cpp examples allocate the full training context of the model which in this case is quite big: ~128k tokens. You can reduce that, for example -c 8192.

1 reply

commonuserlol Sep 10, 2024
Author

Thanks, you're saved me :)

However, I'm able to allocate ~16k tokens on GPU for which would fit into 2gb (that's max size available for vulkan as log reports). Can I increase buffer size since performance is worse than on CPU?

zbruceli · 2025-01-26T01:37:14Z

zbruceli
Jan 26, 2025

On my 16GB M1 Pro macbook, after adding -c 8192 I was able to run the 8B Q4 model. Otherwise I have the "cannot allocate memory" error.

0 replies

0cc4m · 2025-01-26T07:43:14Z

0cc4m
Jan 26, 2025
Collaborator

You might want to use a K-quant (Q3_K or Q4_K) for now. IQ quants are not yet supported in Vulkan and probably fall back to CPU. IQ2 and IQ3 support is being worked on (#11360), but it will take a little time until they are optimized similar to the other quants.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp allocates a way more ram than ollama #9414

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

llama.cpp allocates a way more ram than ollama #9414

commonuserlol Sep 10, 2024

Replies: 3 comments · 1 reply

ggerganov Sep 10, 2024 Maintainer

commonuserlol Sep 10, 2024 Author

zbruceli Jan 26, 2025

0cc4m Jan 26, 2025 Collaborator

commonuserlol
Sep 10, 2024

Replies: 3 comments 1 reply

ggerganov
Sep 10, 2024
Maintainer

commonuserlol Sep 10, 2024
Author

zbruceli
Jan 26, 2025

0cc4m
Jan 26, 2025
Collaborator