-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubernetes example #6546
Comments
Hi! I will take this up! |
Great @OmegAshEnr01n , few notes:
Ping here if you have question, good luck ! Excited to use it. |
Hi @OmegAshEnr01n, are you still working on this issue ? |
Yes, still am. Will share a pull request over the weekend when completed. |
Hi @phymbert What is the architecutral reason for having embedding living on a seperate deployment to the model? Becuase requiring that would mean we would need to make changes to the http server. Instead of that we can have an architecture where model and embedding is tightly coupled. Something like this
On another note, What is the intended use of prometheus? Do you need it to live alongside the helm chart or within it as a subchart? I dont see the value in adding prometheus as a subchart. Perhaps you can share your view on it as well. |
Embeddings model are different from the generative ones. In an RAG setup you need two models. Prometheus is not required but if present metrics are exported. |
Ok, Just to clarify, the server.cpp has a route for requesting embeddings but the existing code for the server doesnt include the option to send embeddings for completions . That will need to be written before the helm chart can be completed. Kindly correct me if im wrong. |
Embeddings aim to be stored in a vector db for search. There is nothing related to completions except RAG later on. |
@OmegAshEnr01n Sir, is the chart ready for production ? 🚀🚀🚀🚀 |
Not yet. Currently testing it on a personal kube cluster with separate node selectors. |
@phymbert The project https://github.com/distantmagic/paddler argues in its README.md that simple round-robin load-balancing is not suitable for llama.cpp:
From your experience in your k8s example is the k8s Service load-balancing enough or would you find it necessary to use a "slot aware" load-balancer? /cc @mcharytoniuk |
Thanks the mention. I maintain that point. Of course round robin will work. "Least connections" will be better (but it does not have to reflect how many slots are being used), but the issue is - prompts can take a long, varying time to finish. With round robin it is very possible to distribute the load unevenly (for example if one of the servers was unlucky and is still processing a few of huge prompts). To me the ideal is balancing based on slots and have some requests queue on top of that (which I plan to add to paddler btw :)). I love the slots idea because they make the infra really predictable. |
Firstly, it's better to use native llama.cpp KV cache, so if you have k8s nodes with 2-4 A/H100, having one pod per node using all VRAM and as many as possible slots/cache for the server will give you the maximum performance, but not HA. Maybe an interesting approach would be to prioritize upfront based on input tokens size. Nonetheless you cannot predict output tokens size. I mainly faced issues with long living http connections, IMHO we need a better architecture for this than SSE. |
@phymbert ive made a pull request. |
The PR is on my fork: We need to bring it here somehow |
Hope to meet soon |
Motivation
Kubernetes is widely used in the industry to deploy product and application at scale.
It can be useful for the community to have a
llama.cpp
helm chart for the server.I have started several weeks ago, I will continue when I have more time, meanwhile any help is welcomed:
https://github.com/phymbert/llama.cpp/tree/example/kubernetes/examples/kubernetes
References
The text was updated successfully, but these errors were encountered: