Replies: 6 comments
-
We would love too but we need concurrency on CPU based infrastructure to open it to group of 10-15 with the benefits of cost,security and ease. |
Beta Was this translation helpful? Give feedback.
-
Integration with jupyter-ai, a JupyterLab extension to integrate a code assistant, would be great. It seems there is ollama integration and a Llamafile integration could be done in a, more or less, similar way just defining the base API url (jupyterlab/jupyter-ai#904, jupyterlab/jupyter-ai#868, jupyterlab/jupyter-ai#389). Also, Spyder is creating a way to use an AI code assistant but what they have at this moment seems more primitive and more difficult to adapt (see spyder-ide/spyder#20632). The quantity of people using these tools these days is very high. I'm not a big fan of Ollama so a simplification of the use of AI using llamafile would be amazing. Start llamafile and point the jupyter-ai pluging to the local url API. |
Beta Was this translation helpful? Give feedback.
-
We at least need full API compatibility, like Function / Tool Calling - which llama.cpp still does not offer. |
Beta Was this translation helpful? Give feedback.
-
Should we open an issue for this important piece in the puzzle @jart ? |
Beta Was this translation helpful? Give feedback.
-
In my AI solutions I need three things:
I would love to start these locally, and have a web application (running as SaaS) use my local running LLM/embedding model/vector database to provide AI needs within the web application. These would mean to have three http endpoints that can be started and configured within the web application. Should these be three separate executables, or be combined into a single executable providing three http endpoints? where should we store the contents of the vector database? As a file next to the executable? |
Beta Was this translation helpful? Give feedback.
-
My problem occurs when I switch between different computers. For example I have a beefy desktop with NVIDIA GPUs and a 12 years old Lenovo thinkpad. Additionally I often want to show others on their computers what is possible with open models today. So as universal solution I want to use and show models between 5b and 15b, because they could even run on older CPUs with decent speed - for example deepseek-v2-lite 16b wich gives me on my lenovo like 5 tokens/s So in my imagination, for example, I have a well-prepared llamafile on my huggingface profile and can download and run it from anywhere and on almost any standard household computer. But that doesn't work so ideally yet, because there is no automatic adjustment of the parameters (e.g. utilize gpu if there is one, determine how much ngl... if not, then determine the number of optimal CPU threads, determine the maximum context, etc.) And maybe this already works and I just don't know how to do it. But I couldn't find anything about it either. In any case, that's what's stopping me personally from using llamafiles more consistently: A not yet satisfactory portability and therefore currently still limited reproducibility. And again to make it clear: I don't expect such a seamless experience for 70b models or similar. I'm simply talking about models that can run at an acceptable speed on modern laptops with 8gb or 16gb RAM, but which adapt and automatically recognize and use the available performance on more powerful desktops. |
Beta Was this translation helpful? Give feedback.
-
llamafile's main goal is to make it easier for developers to use open models. The project already greatly simplifies running open models. And since it's built on top of llama.cpp, llamafile also comes with OpenAI-compatible API endpoints.
But we want to do more. We want llamafile to become a viable drop-in replacement for commercial inference APIs, so that developers can easily switch from services like GPT4 to using open models and the open source ecosystem.
We want your feedback and ideas. What's holding you back today from using open models in your applications, instead of services like GPT4? What features or capabilities are missing or lacking in tools like llamafile?
(There are likely a number of issues that this project can't directly address, like the quality/performance gap between OpenAI and today's open models. But there are also probably plenty of other ways that we can make a difference!)
Beta Was this translation helpful? Give feedback.
All reactions