-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to talk to llamafile's OpenAI API endpoint #53
Comments
Here's an example of how you can talk to the OpenAI Completions API provided by your llamafile server. Note: Due to a bug in the most recent 0.2.1 release, this example will only work currently if you build llamafile-server at HEAD. You can do that by downloading the cosmocc compiler and put it on your $PATH as discussed in the README. Then run:
To build the following program which you'd run:
You now have a llamafile server running on localhost port 8080. You can now use its completions API. Here is the quickstart tutorial example that OpenAI provides at https://platform.openai.com/docs/quickstart curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' You could put that in a shell script for example, and see something like the following:
You now have your response JSON. It's not very readable on the shell. It's assumed you'd be using your programming language of choice, e.g. Python, and use its appropriate http and json libraries (or some high-level openai client library veneer) to do the actual talking to the server. This concludes the tutorial. Thanks for reaching out, and enjoy llamafile! |
Thank you for prompt reply. I've tried the above, but server crashs: {"timestamp":1701667611,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080} |
I tried this as well, using Mistral weights: o/llama.cpp/server/server \
-m mistral-7b-instruct-v0.1.Q4_K_M.gguf
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
loading weights...
{"timestamp":1701671568,"level":"INFO","function":"main","line":3045,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62091,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62092,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}} If I try to run the cURL request: curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}' It sort of hangs there, not getting any response. If I go over to the llamafile server tab I see the following output: slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 |
this works: curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "stop": null, "messages": [{ "role": "user", "content": "tell me history of canada" }] }' |
The issue now is CORS issue if you want to interact with it programmatically |
@Maxamed As mentioned earlier, that |
I am also getting a cors error if I call send fetch request to |
If you can tell me what header we need to add to the server to fix the CORS problem, then I'm happy to add it to the codebase. Thanks! |
context shift issue @mneedham origin from this: ggml-org/llama.cpp#3969 there is nothing to do on llamafile. I just get this error today |
How to connect to it using API ? i've installed it and it works great but i want to create to it using api
The text was updated successfully, but these errors were encountered: