llama.cpp allocates a way more ram than ollama #9414
-
Hi, right away I want to warn that I am just an end user who wants to use the features of Vulkan/AVX which ollama doesn't have. I've compiled llama.cpp and tried to run https://huggingface.co/bartowski/LongWriter-llama3.1-8b-GGUF/blob/main/LongWriter-llama3.1-8b-IQ3_XS.gguf. On the CPU it somehow got up and running (although I think slower than ollama (well I'm comparing to https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct which might have less ram usage due safetensors format?), and used almost all the available RAM). On the GPU, Vulkan was unable to allocate memory. I've noticed that K and V size is 8GB for each what means I need more (V)RAM? Setup: |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 1 reply
-
By default, |
Beta Was this translation helpful? Give feedback.
-
On my 16GB M1 Pro macbook, after adding -c 8192 I was able to run the 8B Q4 model. Otherwise I have the "cannot allocate memory" error. |
Beta Was this translation helpful? Give feedback.
-
You might want to use a K-quant (Q3_K or Q4_K) for now. IQ quants are not yet supported in Vulkan and probably fall back to CPU. IQ2 and IQ3 support is being worked on (#11360), but it will take a little time until they are optimized similar to the other quants. |
Beta Was this translation helpful? Give feedback.
You might want to use a K-quant (Q3_K or Q4_K) for now. IQ quants are not yet supported in Vulkan and probably fall back to CPU. IQ2 and IQ3 support is being worked on (#11360), but it will take a little time until they are optimized similar to the other quants.