From 8b350356b28f782deab63d8b0e9ae103ceb25fcd Mon Sep 17 00:00:00 2001 From: Pierrick Hymbert Date: Sun, 25 Feb 2024 21:46:29 +0100 Subject: [PATCH] server: docs - refresh and tease a little bit more the http server (#5718) * server: docs - refresh and tease a little bit more the http server * Rephrase README.md server doc Co-authored-by: Georgi Gerganov * Update examples/server/README.md Co-authored-by: Georgi Gerganov * Update examples/server/README.md Co-authored-by: Georgi Gerganov * Update README.md --------- Co-authored-by: Georgi Gerganov --- README.md | 3 +++ examples/server/README.md | 18 +++++++++++++++--- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index d61f9171b1b62..d0af5d0b9b077 100644 --- a/README.md +++ b/README.md @@ -114,6 +114,9 @@ Typically finetunes of the base models below are supported as well. - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM) - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL) +**HTTP server** + +[llama.cpp web server](./examples/server) is a lightweight [OpenAI API](https://github.com/openai/openai-openapi) compatible HTTP server that can be used to serve local models and easily connect them to existing clients. **Bindings:** diff --git a/examples/server/README.md b/examples/server/README.md index cb3fd6054095b..0e9bd7fd404ba 100644 --- a/examples/server/README.md +++ b/examples/server/README.md @@ -1,8 +1,20 @@ -# llama.cpp/example/server +# LLaMA.cpp HTTP Server -This example demonstrates a simple HTTP API server and a simple web front end to interact with llama.cpp. +Fast, lightweight, pure C/C++ HTTP server based on [httplib](https://github.com/yhirose/cpp-httplib), [nlohmann::json](https://github.com/nlohmann/json) and **llama.cpp**. -Command line options: +Set of LLM REST APIs and a simple web front end to interact with llama.cpp. + +**Features:** + * LLM inference of F16 and quantum models on GPU and CPU + * [OpenAI API](https://github.com/openai/openai-openapi) compatible chat completions and embeddings routes + * Parallel decoding with multi-user support + * Continuous batching + * Multimodal (wip) + * Monitoring endpoints + +The project is under active development, and we are [looking for feedback and contributors](https://github.com/ggerganov/llama.cpp/issues/4216). + +**Command line options:** - `--threads N`, `-t N`: Set the number of threads to use during generation. - `-tb N, --threads-batch N`: Set the number of threads to use during batch and prompt processing. If not specified, the number of threads will be set to the number of threads used for generation.