-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Refactor the mid-level and high-level implementations of LLamaSharp #684
Comments
@AsakusaRinne The overall idea seems good to me. But I have the following observations:
I will begin to provide feedback on the prototype. |
Agreed. Currently we ask user to build the web API from mid-level APIs themselves and it's difficult for them to apply batched inference. The
That's a good idea. However it seems that #670 and this proposal will consume all my free time, so I'm afraid I'm not available for it in the future 3 months. If you found it helpful for it to modify some parts of this proposal, I'll be more than happy to help and discuss with you. :) |
That's mostly because it's not designed to be 😆 The My intention with the I haven't been pushing for anyone to use it until recently because I've only just reached feature parity with the addition of loading/saving individual conversations in #681!
Going from this diagram I would say BatchedExecutor can currently provide:
Thoughts on the other parts of that diagram: SamplingThere is the entire sampling pipeline API I developed (see here) which I think serves this. A sampling pipeline can be put together by implementing SchedulerThis is a tricky one that I haven't done any work on, I assume you're meaning something to schedule when inference is run to maximise the work done in a single batch but minimise the latency? That's probably the hardest part of the batched inference, you need to bring together all the work into a batch before calling infer and definitely needs some kind of higher level system to help schedule it. Stopping CriteriaNot something I've worked on much at all, since it comes after inference and sampling which have been my main focus. Definitely something we need though! Other ThingsI think some other things I would add to the "mid level" API list would be: Templating. We need the low level implementation of templating - taking some text and transforming it into alernative text according to the template. We probably also need the higher level implementation (something like ChatSession/ChatHistory) which represents the history in an object oriented way and can be manipulated in ways that make sense at lower level (e.g. rewind, fork, shift can all be done at the high level and map down into low level KV manipulations. Embeddings. There seem to be a lot of changes coming with how llama.cpp handles embeddings - generative models, embedding models, pooling techniques etc. Our current LLamaEmbedder is very primitive, at the very least it could be made into something that uses a batch to generate lots of embeddings at once much much faster than currently. High Level APIsI think these would probably be better of split into separate packages? Our current high level APIs have become a bit of a mess over time as the low level has shifted underneath them, splitting into separate packages somewhat prevents that becoming an issue in the future. That would leave LLamaSharp providing the core things that everyone needs (low and mid level) and then separate special purpose packages providing other specific usecases. e.g. individual nuget packages for:
|
Yes, in my prototype, I referred to the implementations of
Yes, and it's also responsible for continuous batching. I think it's important for making LLM servers because the requests may come at any time.
I could try to figure out how to make the embedding APIs better when moving on in this proposal. However currently I have no idea about the template. To reduce the duplicated works and refactoring, I think we'd better keep the prototype in
In my opinion, I would like to keep the text-completion and chat-completion classes in the main package and put others on separate packages, such as the sever Engine, OpenAI-style APIs and RAG. As you can see in #683, |
(Just to note I haven't looked at #683 yet. I wasn't suggesting things that should be added to that specific PR, just the general direction of the project overall for the next 12 months!) |
Introduction
This proposal requires lots of works and will introduce many break changes. It should be discussed in detail before it's merged into master branch. Any suggestion will be appreciated! FYI @martindevans @SignalRT
This proposal is inspired by vllm and has already had a prototype implementation in #683. Though it's far from completed, the main ideas have been manifested. If you want to know further about this proposal, please follow the example in that PR to take a try of it. The example does not have a good UI to show the process of parallel inference, but it does execute multiple sequences at the same time. You could set breakpoints in
LLM.RunEngine
to confirm that.Motivations
At the very early stage of LLamaSharp,
LLamaModel
class was used for dealing with all things of the model, including loading, state, inference and high-level. After v0.4.1, It was splitted toLLamaWeights
,LLamaExecutor
andChatSession
, in which LLamaExecutor is the mid-level API to run the model and ChatSession is the high-level API.Though this design once worked well both for developers and users, as time passed, the issues with this design have become increasingly evident. The main problems are described as following.
Design
The full design is like below.
![LLamaSharp Refactor](https://private-user-images.githubusercontent.com/47343601/324217490-9623eea4-2a07-4376-820f-bd1ef349ae34.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDcxMzEsIm5iZiI6MTczOTE0NjgzMSwicGF0aCI6Ii80NzM0MzYwMS8zMjQyMTc0OTAtOTYyM2VlYTQtMmEwNy00Mzc2LTgyMGYtYmQxZWYzNDlhZTM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAwMjAzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWZmOTQ3MmI0MWI2OWQ5OWYxMTAxOGM1NmUwYmZkMmEyYzViMjA3OWI3MzlhYjliYTRkZDU3NTNmOWYwNWYzYWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.tCWJSe_u52uJLIUPjNN8Vh8uDs-nlzYUbbzeQABnodk)
In which the llama.cpp backend is like below (see #670 for auto-downloading proposal).
![llama cpp backend](https://private-user-images.githubusercontent.com/47343601/324217514-bbe4718d-5474-4de0-a8c8-31533eef0d8a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNDcxMzEsIm5iZiI6MTczOTE0NjgzMSwicGF0aCI6Ii80NzM0MzYwMS8zMjQyMTc1MTQtYmJlNDcxOGQtNTQ3NC00ZGUwLWE4YzgtMzE1MzNlZWYwZDhhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDAwMjAzMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWFhOTE5MjMwNTQ0NjJjNzU0MGQyOWU0MzZkMzk4ZDkyYjJlN2U4YmUwYzYzYzYxYTU4ZmYzMDMxZWEzZWRmZGUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.yyV512cLcHmjm3RNhJC812NevSj8H2gHdLwBcoZzriE)
The design is still separated into low-level, mid-level and high-level APIs. However, the low-level part contains multiple backends.
Don't get me wrong. I am not going to introduce other backends now (though it's possible). The purpose of this design is to better abstract llama.cpp related part. Thus, mid-level implementations will only need to take use of several APIs of
llama.cpp model runner
,llama.cpp tokenizer
andllama.cpp kv-cache manager
. Some logics, such as scheduling, sampling and stopping, could be independent with the backend part..Here is the explanation of the newly introduced components.
Text completion APIs
Here is what the APIs of text completion will be like (only show the key elements).
When using it, the code is like below.
For APIs related with server, I'll update them after more investigations.
Conclusion
The proposal refactors most of the current designs of mid-level and high-level. Break change is the major risk of it. However, it seems that the current executors could be implemented with the mid-level APIs provided in this proposal.
LLM
is actually aStatelessExecutor
with scheduler and better abstractions. As forInteractiveExecutor
, it could be implemented withLLMEngine
+KvCacheManager
, because LLM chatting could be regarded as text completion with roles and kv-cache management. In this way, it's possible for us to make the changes smoothly.It will have so many impacts that I won't rush for it. I'll leave enough time for the community to discuss about it, to correct the unreasonable parts. It's completely okay to drop it if most of the users & developers don't like it.
I prefer to aiming to make LLamaSharp a library to run LLM efficiently with easy-to-use APIs, instead of a simple wrapper of llama.cpp. That's also why we spent lots of time on performance improvement and dynamic native library selection. If we could agree on this, I believe we'll work it out soon. :)
Again, any suggestions and discussions about this proposal will be appreciated. 🤗
The text was updated successfully, but these errors were encountered: