Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include in Readme how to Pass Custom Arguments to llama_cpp.server in Docker #1029

Open
jaredquekjz opened this issue Dec 19, 2023 · 6 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@jaredquekjz
Copy link

Title:

Issue with Passing Custom Arguments to llama_cpp.server in Docker

Issue Description:

Hello abetlen,

I've been trying to use your Docker image ghcr.io/abetlen/llama-cpp-python:v0.2.24 for llama_cpp.server, and I encountered some difficulties when attempting to pass custom arguments (--n_gpu_layers 81, --chat_format chatml, --use_mlock False) to the server through Docker.

Steps to Reproduce:

  1. Pull the Docker image: docker pull ghcr.io/abetlen/llama-cpp-python:v0.2.24

  2. Run the container with custom arguments:

    docker run --rm -it -p 8000:8000 \
      -v /home/jaredquek/text-generation-webui/models:/models \
      -e MODEL=/models/tulu-2-dpo-70b.Q5_K_M.gguf \
      --entrypoint uvicorn \
      ghcr.io/abetlen/llama-cpp-python:v0.2.24 \
      --factory llama_cpp.server.app:create_app --host 0.0.0.0 --port 8000 --n_gpu_layers 81 --chat_format chatml --use_mlock False

    This results in an error: Error: No such option: --n_gpu_layers.

Expected Behavior:

I expected to be able to pass these arguments to the llama_cpp.server application inside the Docker container.

Actual Behavior:

The uvicorn command does not recognize these arguments as it's designed for the ASGI server, not the llama_cpp.server application.

Potential Solutions:

  • Modify the Dockerfile or application configuration to accept these arguments.
  • Provide guidance in Readme on how to correctly pass additional arguments or configure the server with these settings.

I would appreciate any assistance or guidance you could provide on this issue.

Thank you for your time and for maintaining this project.

Best regards.

@3x3cut0r
Copy link

you could try my container:
https://hub.docker.com/r/3x3cut0r/llama-cpp-python

i implemented all supported options to an env variable.
tell me what do you think and please tell me any bugs.

@jaredquekjz
Copy link
Author

jaredquekjz commented Dec 20, 2023

Thanks for ur attention. So I tried the Docker but the GPU isn't being activated even though the uvicorn server starting. This is my Docker run:

    --name llama-cpp-python \
    --cap-add SYS_RESOURCE \
    -v /home/jaredquek/text-generation-webui/models:/models \
    -p 8000:8000/tcp \
    3x3cut0r/llama-cpp-python:latest \
    --model /models/tulu-2-dpo-70b.Q5_K_M.gguf \
    --n_gpu_layers 81 \
    --chat_format chatml \
    --use_mlock False

Does the Docker image run Cuda acceleration by default or I have to do some other thing? Also would you know which parameter to adjust should I wish to handle many concurrent requests through the server? i understand that for the llama cpp server it's done by ngl: ggerganov/llama.cpp#3228. Thanks for your advice!

@3x3cut0r
Copy link

Unfortunately, this alpine based image is built with these CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS -DLLAMA_AVX=OFF -DLLAMA_AVX2=OFF -DLLAMA_F16C=OFF -DLLAMA_FMA=OFF".
This means it is optimized for older CPUs and GPU support is deactivated.

An image optimized for GPUs with cuda would need a different base image anyway. Unfortunately I don't have an Nvidia GPU myself and therefore can't test or deploy anything. But maybe i can have a look activating gpu support (without cuda). But then llama-cpp-python needs to be recompiled after image creation. or i could deploy i on another tag. need to think about that.

I have also not yet dealt with your question about parallel requests. But I would also be very interested into that too.

Sorry

@jaredquekjz
Copy link
Author

jaredquekjz commented Dec 21, 2023

Thanks! Perhaps that's true with abetlen's original image too - not for Cuda? Have to look deeply. I have managed to get the non Docker version of the server working already. However of course I prefer the stability of Docker, and need to find out about the parallel requests.

@abetlen
Copy link
Owner

abetlen commented Dec 22, 2023

@jaredquekjz there are two options really

  1. Use environment variables instead of cli args. This is the slightly more idiomatic solution for containers and every cli argument has a corresponding environment variable, so --n_gpu_layers is equivalent to N_GPU_LAYERS.
  2. Change your entrypoint to python in the docker command and run with -m llama_cpp.server followed by the cli args, this should allow you to pass either cli or environment variable arguments.

The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.

@abetlen abetlen added documentation Improvements or additions to documentation question Further information is requested labels Dec 22, 2023
@maziyarpanahi
Copy link

@jaredquekjz there are two options really

  1. Use environment variables instead of cli args. This is the slightly more idiomatic solution for containers and every cli argument has a corresponding environment variable, so --n_gpu_layers is equivalent to N_GPU_LAYERS.
  2. Change your entrypoint to python in the docker command and run with -m llama_cpp.server followed by the cli args, this should allow you to pass either cli or environment variable arguments.

The benefit to using the default entrypoint and environment variables with the official image is that it includes a compiler and will rebuild the image for any cpu architecture you deploy it to ensuring that it's going to be as fast as or faster than pre-built binaries.

this is pretty cool, are all the server arguments can be set via ENV variable? (all capitalized?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants