FEEDBACK WANTED: making llamafile a better drop-in replacement for OpenAI #304

stlhood · 2024-03-25T22:39:44Z

stlhood
Mar 25, 2024
Maintainer

llamafile's main goal is to make it easier for developers to use open models. The project already greatly simplifies running open models. And since it's built on top of llama.cpp, llamafile also comes with OpenAI-compatible API endpoints.

But we want to do more. We want llamafile to become a viable drop-in replacement for commercial inference APIs, so that developers can easily switch from services like GPT4 to using open models and the open source ecosystem.

We want your feedback and ideas. What's holding you back today from using open models in your applications, instead of services like GPT4? What features or capabilities are missing or lacking in tools like llamafile?

(There are likely a number of issues that this project can't directly address, like the quality/performance gap between OpenAI and today's open models. But there are also probably plenty of other ways that we can make a difference!)

arunaagrawal · 2024-07-19T06:00:23Z

arunaagrawal
Jul 19, 2024

We would love too but we need concurrency on CPU based infrastructure to open it to group of 10-15 with the benefits of cost,security and ease.

0 replies

kikocorreoso · 2024-07-24T10:07:37Z

kikocorreoso
Jul 24, 2024

Integration with jupyter-ai, a JupyterLab extension to integrate a code assistant, would be great. It seems there is ollama integration and a Llamafile integration could be done in a, more or less, similar way just defining the base API url (jupyterlab/jupyter-ai#904, jupyterlab/jupyter-ai#868, jupyterlab/jupyter-ai#389).

Also, Spyder is creating a way to use an AI code assistant but what they have at this moment seems more primitive and more difficult to adapt (see spyder-ide/spyder#20632).

The quantity of people using these tools these days is very high. I'm not a big fan of Ollama so a simplification of the use of AI using llamafile would be amazing. Start llamafile and point the jupyter-ai pluging to the local url API.

0 replies

ChristianWeyer · 2024-07-24T10:10:29Z

ChristianWeyer
Jul 24, 2024

We at least need full API compatibility, like Function / Tool Calling - which llama.cpp still does not offer.

0 replies

ChristianWeyer · 2024-08-07T05:47:46Z

ChristianWeyer
Aug 7, 2024

Should we open an issue for this important piece in the puzzle @jart ?

0 replies

svdoever · 2024-08-13T13:50:00Z

svdoever
Aug 13, 2024

In my AI solutions I need three things:

an LLM model
An embedding model (for RAG)
A vector database (for RAG)

I would love to start these locally, and have a web application (running as SaaS) use my local running LLM/embedding model/vector database to provide AI needs within the web application. These would mean to have three http endpoints that can be started and configured within the web application. Should these be three separate executables, or be combined into a single executable providing three http endpoints?

where should we store the contents of the vector database? As a file next to the executable?

0 replies

mounta11n · 2024-08-16T04:47:20Z

mounta11n
Aug 16, 2024

My problem occurs when I switch between different computers. For example I have a beefy desktop with NVIDIA GPUs and a 12 years old Lenovo thinkpad. Additionally I often want to show others on their computers what is possible with open models today. So as universal solution I want to use and show models between 5b and 15b, because they could even run on older CPUs with decent speed - for example deepseek-v2-lite 16b wich gives me on my lenovo like 5 tokens/s

So in my imagination, for example, I have a well-prepared llamafile on my huggingface profile and can download and run it from anywhere and on almost any standard household computer. But that doesn't work so ideally yet, because there is no automatic adjustment of the parameters (e.g. utilize gpu if there is one, determine how much ngl... if not, then determine the number of optimal CPU threads, determine the maximum context, etc.)

And maybe this already works and I just don't know how to do it. But I couldn't find anything about it either.

In any case, that's what's stopping me personally from using llamafiles more consistently: A not yet satisfactory portability and therefore currently still limited reproducibility.

And again to make it clear: I don't expect such a seamless experience for 70b models or similar. I'm simply talking about models that can run at an acceptable speed on modern laptops with 8gb or 16gb RAM, but which adapt and automatically recognize and use the available performance on more powerful desktops.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEEDBACK WANTED: making llamafile a better drop-in replacement for OpenAI #304

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

FEEDBACK WANTED: making llamafile a better drop-in replacement for OpenAI #304

stlhood Mar 25, 2024 Maintainer

Replies: 6 comments

arunaagrawal Jul 19, 2024

kikocorreoso Jul 24, 2024

ChristianWeyer Jul 24, 2024

ChristianWeyer Aug 7, 2024

svdoever Aug 13, 2024

mounta11n Aug 16, 2024

stlhood
Mar 25, 2024
Maintainer

arunaagrawal
Jul 19, 2024

kikocorreoso
Jul 24, 2024

ChristianWeyer
Jul 24, 2024

ChristianWeyer
Aug 7, 2024

svdoever
Aug 13, 2024

mounta11n
Aug 16, 2024