How to talk to llamafile's OpenAI API endpoint #53

Maxamed · 2023-12-04T05:09:02Z

How to connect to it using API ? i've installed it and it works great but i want to create to it using api

jart · 2023-12-04T05:22:08Z

Here's an example of how you can talk to the OpenAI Completions API provided by your llamafile server.

Note: Due to a bug in the most recent 0.2.1 release, this example will only work currently if you build llamafile-server at HEAD. You can do that by downloading the cosmocc compiler and put it on your $PATH as discussed in the README. Then run:

make -j8

To build the following program which you'd run:

o//llama.cpp/server/server -m ~/weights/llava-v1.5-7b-Q4_K.gguf

You now have a llamafile server running on localhost port 8080. You can now use its completions API. Here is the quickstart tutorial example that OpenAI provides at https://platform.openai.com/docs/quickstart

curl http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "gpt-3.5-turbo",
       "messages": [
         {
           "role": "system",
           "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
         },
         {
           "role": "user",
           "content": "Compose a poem that explains the concept of recursion in programming."
         }
       ]
     }'

You could put that in a shell script for example, and see something like the following:

jart@studio:~$ ~/scratch/completions-client.sh
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"In programming, recursion is a tool divine,\nA way to solve problems, both big and fine.\nIt's like a loop, but with a twist,\nA function that calls itself, and never quits.\n\nIt starts with an initial case,\nA base to build upon, a place to begin.\nThen, it calls itself, and adds a new case,\nUntil the problem is solved, or the stack is too vast.\n\nRecursion is powerful, and can be quite deep,\nBut with care and practice, it can be a treat.\nIt's a tool that can solve problems, both small and large,\nAnd it's a concept that's worth mastering, to make your code go far.\n\nSo if you're stuck on a problem, and you can't seem to find a way,\nTry recursion, and see what it can do today.\nIt might just be the tool you need, to solve your problem and win,\nWith recursion, you'll be programming, like a true machine.","role":"assistant"}}],"created":1701666907,"id":"chatcmpl-sho1yAtTvl32sAymCUMZvPIYwvm6C1hf","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":227,"prompt_tokens":76,"total_tokens":303}}

You now have your response JSON. It's not very readable on the shell. It's assumed you'd be using your programming language of choice, e.g. Python, and use its appropriate http and json libraries (or some high-level openai client library veneer) to do the actual talking to the server.

This concludes the tutorial. Thanks for reaching out, and enjoy llamafile!

Maxamed · 2023-12-04T05:29:42Z

Thank you for prompt reply.

I've tried the above, but server crashs:

{"timestamp":1701667611,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50502,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50503,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
{"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/index.js","params":{}}
llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Users/jj/Projects/llamafile/llava-v1.5-7b-q4-server.llamafile 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0)
[1] 82671 abort ./llava-v1.5-7b-q4-server.llamafile

mneedham · 2023-12-04T06:34:25Z

I tried this as well, using Mistral weights:

o/llama.cpp/server/server \
  -m mistral-7b-instruct-v0.1.Q4_K_M.gguf

Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701671568,"level":"INFO","function":"main","line":3045,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62091,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62092,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}

If I try to run the cURL request:

curl http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "gpt-3.5-turbo",
       "messages": [
         {
           "role": "system",
           "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
         },
         {
           "role": "user",
           "content": "Compose a poem that explains the concept of recursion in programming."
         }
       ]
     }'

It sort of hangs there, not getting any response. If I go over to the llamafile server tab I see the following output:

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

Maxamed · 2023-12-04T06:35:39Z

this works:

curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "stop": null, "messages": [{ "role": "user", "content": "tell me history of canada" }] }'

Maxamed · 2023-12-04T06:54:21Z

The issue now is CORS issue if you want to interact with it programmatically

jart · 2023-12-04T07:05:59Z

@Maxamed As mentioned earlier, that eos crash will happen unless you build from source right now. What exactly is the issue with CORS? As far as I know, the server always sends Access-Control-Allow-Origin: *.

basujindal · 2023-12-04T22:41:00Z

I am also getting a cors error if I call send fetch request to http://127.0.0.1:8080/completition from some other domain (using the Dev tools console tab). Any idea on how I can resolve the error? Thanks!

jart · 2023-12-05T07:42:35Z

If you can tell me what header we need to add to the server to fix the CORS problem, then I'm happy to add it to the codebase. Thanks!

hiepxanh · 2024-01-17T15:34:18Z

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

context shift issue @mneedham origin from this: ggml-org/llama.cpp#3969

there is nothing to do on llamafile. I just get this error today

jart added documentation question labels Dec 4, 2023

jart closed this as completed Dec 4, 2023

jart changed the title ~~API conenction~~ How to talk to llamafile's OpenAI API endpoint Dec 4, 2023

mneedham mentioned this issue Dec 5, 2023

Python example #60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to talk to llamafile's OpenAI API endpoint #53

How to talk to llamafile's OpenAI API endpoint #53

Maxamed commented Dec 4, 2023

jart commented Dec 4, 2023 •

edited

Loading

Maxamed commented Dec 4, 2023 •

edited

Loading

mneedham commented Dec 4, 2023

Maxamed commented Dec 4, 2023

Maxamed commented Dec 4, 2023

jart commented Dec 4, 2023

basujindal commented Dec 4, 2023

jart commented Dec 5, 2023

hiepxanh commented Jan 17, 2024 •

edited

Loading

How to talk to llamafile's OpenAI API endpoint #53

How to talk to llamafile's OpenAI API endpoint #53

Comments

Maxamed commented Dec 4, 2023

jart commented Dec 4, 2023 • edited Loading

Maxamed commented Dec 4, 2023 • edited Loading

mneedham commented Dec 4, 2023

Maxamed commented Dec 4, 2023

Maxamed commented Dec 4, 2023

jart commented Dec 4, 2023

basujindal commented Dec 4, 2023

jart commented Dec 5, 2023

hiepxanh commented Jan 17, 2024 • edited Loading

jart commented Dec 4, 2023 •

edited

Loading

Maxamed commented Dec 4, 2023 •

edited

Loading

hiepxanh commented Jan 17, 2024 •

edited

Loading