-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add experimental refactorings. #683
feat: add experimental refactorings. #683
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my main concern. Assuming this is the highest level API running locally and the class to use to implement the WEB API I would build an interface similar to OLlama one. I wouldn't build the interface based on string prompts but in "message" prompts to be able to make the system extensible and support llava or other kind of models. The message concept will allow any type of model support.
If we don´t think about this kind of interface for the very begining I think that we will fail the users.
public interface IOllamaApiClient
{
/// <summary>
/// Gets or sets the name of the model to run requests on.
/// </summary>
string SelectedModel { get; set; }
/// <summary>
/// Sends a request to the /api/chat endpoint
/// </summary>
/// <param name="chatRequest">The request to send to Ollama</param>
/// <param name="streamer">
/// The streamer that receives parts of the answer as they are streamed by the Ollama endpoint.
/// Can be used to update the user interface while the answer is still being generated.
/// </param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
/// <returns>List of the returned messages including the previous context</returns>
Task<IEnumerable<Message>> SendChat(ChatRequest chatRequest, IResponseStreamer<ChatResponseStream> streamer, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/copy endpoint to copy a model
/// </summary>
/// <param name="request">The parameters required to copy a model</param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task CopyModel(CopyModelRequest request, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/create endpoint to create a model
/// </summary>
/// <param name="request">The parameters for the model to create</param>
/// <param name="streamer">
/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
/// Can be used to update the user interface while the operation is running.
/// </param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task CreateModel(CreateModelRequest request, IResponseStreamer<CreateStatus> streamer, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/delete endpoint to delete a model
/// </summary>
/// <param name="model">The name of the model to delete</param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task DeleteModel(string model, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/embeddings endpoint to generate embeddings
/// </summary>
/// <param name="request">The parameters to generate embeddings for</param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task<GenerateEmbeddingResponse> GenerateEmbeddings(GenerateEmbeddingRequest request, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/generate endpoint to get a completion
/// </summary>
/// <param name="request">The parameters to generate a completion</param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
/// <returns>
/// A context object that holds the conversation history.
/// Should be reused for further calls to this method to keep a chat going.
/// </returns>
Task<ConversationContextWithResponse> GetCompletion(GenerateCompletionRequest request, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/tags endpoint to get all models that are available locally
/// </summary>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task<IEnumerable<Model>> ListLocalModels(CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/pull endpoint to pull a new model
/// </summary>
/// <param name="request">The request parameters</param>
/// <param name="streamer">
/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
/// Can be used to update the user interface while the operation is running.
/// </param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task PullModel(PullModelRequest request, IResponseStreamer<PullStatus> streamer, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/push endpoint to push a new model
/// </summary>
/// <param name="request">The request parameters</param>
/// <param name="streamer">
/// The streamer that receives status updates as they are streamed by the Ollama endpoint.
/// Can be used to update the user interface while the operation is running.
/// </param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
Task PushModel(PushRequest request, IResponseStreamer<PushStatus> streamer, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/show endpoint to show the information of a model
/// </summary>
/// <param name="model">The name of the model the get the information for</param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
/// <returns>The model information</returns>
Task<ShowModelResponse> ShowModelInformation(string model, CancellationToken cancellationToken = default);
/// <summary>
/// Sends a request to the /api/generate endpoint to get a completion and streams the returned chunks to a given streamer
/// that can be used to update the user interface in real-time.
/// </summary>
/// <param name="request">The parameters to generate a completion for</param>
/// <param name="streamer">
/// The streamer that receives parts of the answer as they are streamed by the Ollama endpoint.
/// Can be used to update the user interface while the answer is still being generated.
/// </param>
/// <param name="cancellationToken">The token to cancel the operation with</param>
/// <returns>
/// A context object that holds the conversation history.
/// Should be reused for further calls to this method to keep a chat going.
/// </returns>
Task<ConversationContext> StreamCompletion(GenerateCompletionRequest request, IResponseStreamer<GenerateCompletionResponseStream> streamer, CancellationToken cancellationToken = default);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current implementation is simplified to reduce the complexity, while multi-modal support is necessary. Which kinds of abstractions would you prefer to provide support for multi-modal? I'm wondering the following two structures of Sequence
.
- Add a
MultimodalData
member toSequence
. It provides an interface to get the embeddings. The multi-modal moidel, however, would not be included inLLMEngine
. - Integrate the multi-modal model into
LLMEngine
, or add a classMultiModalLLMEngine
. In this way, the input modal-data will be original data, such as image and audio.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion LLMEngine should manage any type of model, and the interface should be able to manage this. Having some abstration as Message (the properties are just esamples) woud allow us to manage what we have right now (text and images) and to extend that in the future as needed.
public class Message
{
public Message(ChatRole role, string content, string[] images)
{
Role = role;
Content = content;
Images = images;
}
public Message(ChatRole role, string[] images)
{
Role = role;
Images = images;
}
public Message(ChatRole? role, string content)
{
Role = role;
Content = content;
}
// We need this for json deserialization
public Message()
{
}
/// <summary>
/// The role of the message, either system, user or assistant
/// </summary>
[JsonPropertyName("role")]
public ChatRole? Role { get; set; }
/// <summary>
/// The content of the message
/// </summary>
[JsonPropertyName("content")]
public string Content { get; set; }
/// <summary>
/// Base64-encoded images (for multimodal models such as llava)
/// </summary>
[JsonPropertyName("images")]
public string[] Images { get; set; }
}
For me it will be better that kind of wrapper than just text / strings as is propose in this PR:
public void AddRequest(string requestId, string? prompt,...
In the Middle tier of the library if we use the same structure or another one in my opinion is less important, because I think that the main contract for the user will be LLMEngine anyone working on the library on the middle or lower tiers will need to know deeper the inner working of the library. Keeping the upper interface as something stable that can be extended but not changed in each version is important to be able to build tools working with LLamaSharp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds nice to add the abstraction of Message
to hold both text and multimodal data. However I don't think ChatRole
should be a part of it. The LLMEngine
is supposed to deal with any kind of request, mainly including text-completion and chat-completion, single request and multiple request. If we add ChatRole
to it, it seems to be against the purpose of LLMEngine
.
I think that the main contract for the user will be LLMEngine.
In fact I'm afraid it's still too difficult for users who have little experience with LLM to use it, especially if the user has little experience in programming, either. I prefer to asking them to use higher level APIs, like the LLM
, ChatLLM
and ServerEngine
in the proposal, and leave LLMEngine
to those who are experienced with LLM or LLamaSharp to customize their high-level APIs. No matter in which way, I agree that the APIs of LLMEngine
should be stable.
To be honest I think this PR needs to be split into many smaller PRs, which we can consider one by one. It's overwhelming at the moment, with so many new and somewhat unrelated things all added at once! (Edit: I will do a full review when I have time later) |
@martindevans Actually I don't expect this PR to be merged. It's only a prototype to manifest my proposal. With these works, at least it proves to be possible to implement the proposal. It's far from completed (maybe 30%) so I'd like to discuss with you all to make sure I'm not at a wrong way before I move on. :) Note: to make the basic text-completion example work ASAP, I used some naive implementations in sampling and detokenization. The main idea is manifested in |
/// <summary> | ||
/// Method to sample the model output. | ||
/// </summary> | ||
public interface ISamplingMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks similar to ISamplingPipeline
, is it basically that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I renamed it because I was not sure if there'll be slightly different from it. Using a different name will not introduce modifications in non-experimental part.
/// <param name="result"></param> | ||
/// <param name="skipSpecialTokens"></param> | ||
/// <returns>The consumed tokens for decoding.</returns> | ||
int ConvertIdsToText(IEnumerable<int> tokenIds, out string result, bool skipSpecialTokens = false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't have an API like this. It's the same problem as the old DeTokenize
method, sovled by the StreamingTokenDecoder
.
A single character may be several tokens. So for example say the tokens [1, 2, 3]
produce the character A
then decoding [1, 2]
will produce a broken string. Then decoding [3]
will produce another broken string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an IncrementalDecodingOffset
member to Sequence. I'm trying to find a way to integrate the streaming decoder better into this design but haven't made it yet. The biggest problem of using it directly is that it rents some memories and needs to be released. Though adding StreamingTokenDecoder
as a member of Sequence
could make it auto-released, it's a bit weird to make Sequence
own a StreamingTokenDecoder
because it should be a part of the tokenizer. Besides, the Sequence
, as a mid-level class which may used frequently by users, is not supposed to deal with logics related with LLamaContext
, while LLamaContext
is required to initialize a StreamingTokenDecoder
.
{ | ||
// TODO: `ApplyChatTemplate` API | ||
|
||
// TODO: Batched Encode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know there's no batched tokenization APIs in llama.cpp we could use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be all in to allow bypassing llama.cpp's built-in tokenizer -- no reason to lock the architecture with that as dependency.
/// <summary> | ||
/// Stores the data, status, and other information of a sequence. | ||
/// </summary> | ||
public sealed class Sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be implemented as a higher level wrapper around a Conversation
object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though Conversation
has already provided some abilities of dealing with batched inference and sequence, I prefer not to wrap around it because it requires a BatchedExecutor
, which further requires a LLamaContext
. In this proposal I'd like to make Sequence
a class with no coupling with llama.cpp things. Instead, I put everything related with it in model runner and tokenizer.
/// The logprobs of the output token. | ||
/// (Token id -> logP(x_i+1 | x_0, ..., x_i)) | ||
/// </summary> | ||
public float[]? Logprobs { get; init; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is used for every output token it would mean a huge number of allocations (an array of VocabSize floats for every token)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm considering something like this, only get it from native when it's needed.
namespace LLama.Experimental.Core | ||
{ | ||
// TODO: This is only the most simple implementation to run the test now. We should replace it in the future. | ||
public class AntipromptStoppingCriteria: IStoppingCriteria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be the same as the AntiPromptProcessor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but IStoppingCriteria
is more flexible, which allows the sequence to be stooped because of text, token, length and any other reason (as long as the user could implement it).
namespace LLama.Experimental.Core.LLamaCpp | ||
{ | ||
// TODO: This is only the most simple implementation to run the example. It should be replaced in the future. | ||
public class LLamaGreedySamplingMethod: ISamplingMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think I may have mentioned this already in another reply, but I can't find it now).
This looks like the same as the GreedySamplingPipeline
. And more generally I guess ISamplingMethod
is basically the same as ISamplingPipeline
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are very similar. I renamed it only because I don't want to change the current implementations because of some slight difference.
/// <summary> | ||
/// Input special for <see cref="LLamaCppRunner"/>. | ||
/// </summary> | ||
public class LLamaCppRunnerInput |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as LLamabatch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same reason with above, I don't want to break the current implementations. Instead, I want to keep all things in Experimental
namespace.
|
||
internal static IEnumerable<T> SkipLastImpl<T>(IEnumerable<T> source, int count) | ||
{ | ||
return source.Take(source.Count() - count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this enumerates source
twice.
A parts here seem to be duplicates of things we've already got, but that's something we can resolve later.
I'm not 100% sure I understand the purpose of the various bits. Here's my understanding:
Please correct me if I'm wrong :) ThoughtsTo me the With the various bits I've added over the last few months (e.g. The idea of a scheduler is quite interesting. How to handle the required flow with prompt->infer->sample over multiple conversations is a tricky to use part of the ModelRunner I'm a bit confused about - It seems to be a way to support entirely different backends to llama.cpp? That sounds like it would introduce a huge amount of complexity into the codebase! |
I think it's already a quite simple wrapper of the scheduler and the model runner? As you can see here, it only does three things, calling scheduler, calling model runner and processing outputs (the only logic written inside
I'm not sure if you are getting me wrong, that's also what I think of. I believe we'll reach an agreement on the direction and only diverge on some details.😄
As described in #684, I'm not going to introduce any new backend now, though it's possible to add such a thing with the proposal. The purpose of the abstraction of model runner is to separate things into two kinds: those related with llama.cpp and those not related with llama.cpp. The backbone of our current design, especially the mid-level (executors) mainly refers to llama.cpp examples. TBH I don't think llama.cpp examples are good enough for us to follow them only. They are examples instead of a library with well designed APIs. Coupling too deeply with it may prevent us from making LLamaSharp better. In this proposal, as you can see here, the
Yes, that will be complicated, so we won't want to mix logics related with llama.cpp in it, which will further increase the complexity. 😆 The proposal is a big change compared with the current design, so please feel free to ask me if there's anything not clear for you. I'm open to the design and will only move on until we reach an agreement after some modifications of it. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid! I have a few suggestions:
- Add abstractions for Modals:
public interface ILlamaModal {
bool NeedsEvaluation(IContext context, out int neededTokens);
void Evaluate(IBatch target);
}
This could allow for more straightforward mid-level implementation & flexibility.
Modals can then have standalone functionality shareable between models.
- Oof:
public interface ILLamaSymbol { } // Future-proof & can handle multimodality
public struct Token : ILLamaSymbol { } // Size == 1
public struct Image : ILLamaSymbol { } // Size == model.DecodedImageSize
public struct Embedding : ILLamaSymbol { } // Size == model.EmbeddingSize
// Cache: Dictionary<LLamaPos, ILLamaSymbol> cache;
// Cache: ToString() => [..]
Will allow universal and robust control of the cache with just references. (e.g. context.Remove(image)
)
Iffy but promising imo, especially considering Meta's multimodality promise.
- Only expose Batch & Cache handling via a core
Context
class.
public class Context {
ILLamaSymbol[] KVCache; // Contains offsets & is easily comvertable to Dictionary.
public ILLamaSymbol[] this[Range range] => ...; // For QOL like `context[512..]`
IReadOnlyList<IBatch> batches; // Name 'Batch' would make more sense instead of 'Sequence' imo
public IBatch AddBatch() { // Could register callbacks to the new batch automatically,
var batch = default(IBatch); // ..handle switch of control (e.g. if modified externally)
// .. initialize, register callbacks, pass references, etc.
return batch; // There should be no other way to register a batch to this context.
}
}
Idea here is that this'd be global handle to allow various 'plays' from different levels.
.. and of course that it'd have the whole representation of a lightweight cache accessible in C#.
{ | ||
// TODO: `ApplyChatTemplate` API | ||
|
||
// TODO: Batched Encode? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be all in to allow bypassing llama.cpp's built-in tokenizer -- no reason to lock the architecture with that as dependency.
/// <summary> | ||
/// Method to sample the model output. | ||
/// </summary> | ||
public interface ISamplingMethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the flexibility of SamplingMetaData
replacing current ISamplingPipeline
's lastTokens
! A lot of info could be passed that way. A slightly more flexible design could be this:
public interface ISampler<T> { T Sample(in Span<float> logits, SamplingMetaData metadata = null); }
public class BaseTokenSampler : ISampler<LLamaToken> { ... }
public interface ISamplerOption {
bool ShouldApply(in Span<float> logits, SamplingMetadata data) => true;
void Apply(ref Span<float> logits, SamplingMetadata data);
}
And for the highest level something like samplingParams.CreateSampler()
could unify these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you a lot for your suggestions! I like the way of ILLamaSymbol
you mentioned. Actually Martin has made some similar changes before, (for example LLamaToken and LLamaSeqId). I skipped these abstractions to rush in this prototype and will add them later.
Add abstractions for Modals
This looks a bit confusing to me, could you please tell me more about your idea? When and who should call NeedsEvaluation
?
IReadOnlyList batches; // Name 'Batch' would make more sense instead of 'Sequence' imo
From my point of view they are entirely two different things. Sequence
is a collection of semantic-continuous information and its lengths grows during the inference, while Batch
is the data fed into the model in one-step inference. Batch
might contain data from multiple sequences, as you can see in llama.cpp llama_batch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m glad you like it! I actually made a POC for the modals with some draft code, I’ll share tomorrow but one idea is to call them from the high level generator that has access to model.modals
to digest the prompt.
Modals don’t need to have state, and would mostly be logic containers. (e.g. ImageModal
would process pending image links in the llamabatch, and TextModal
would finally process the tokens).
Regarding Batch and Sequence from an inference perspective it just makes more sense to me for a ‘Batch’ to be a ‘batch’ for the user to process/consume, rather than a ‘batch’ for the models internals — but I get it’s an established term for that right now :p Anyway whenever I wrote ‘batch’ above I meant what’s established as ‘sequence’!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly okay to expose batch
to users, but I prefer to expose it in low-level APIs (ModelRunner
in this draft) because batch
is related with llama.cpp directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SamplingMetaData
I like the idea of SamplingMetaData
containing more info than just the raw lastTokens that ISamplingPipeline
has at the moment (although I'd probably try to come up with another name). Would be a nice simple PR if someone want to adjust the ISamplingPipeline
interface to accept something like that (even if it only wraps up lastTokens for now, it's a good place to add more things in the future).
samplingParams.CreateSampler()
I don't like this idea though. One of the motivations of creating ISamplingPipeline
was that we have/had configurable sampling and it's a mess - see ISamplingParams
. Many of those properties don't mean anything depending on the value of other properties. Adding new sampling stages needs even more properties. Stages are not re-orderable so even with all this it's not powerful enough. New/custom sampling stages cannot be added at all.
In general I don't think designs that try to have a single all-powerful config object are very good. I think it's almost always simpler and more powerful to expose primitives that can be chained together in a nice simple chain of calls like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AsakusaRinne It's certainly okay to expose
batch
to users, but I prefer to expose it in low-level APIs
We agree ^^ I now regret the naming I used lol.. I meant to say suggest we wrap it all up inside a Context
class for all other low-level and middle-level classes.
@martindevans I don't like this idea though. [..] - see ISamplingParams. [..] it's not powerful enough.
You're right! ISamplerOption
in a dynamic list would fix that in my mind, but I get where you're coming from.
I was also having some chain design in mind, but it would shift control during creation, somewhat like so:
var sampler = p.With(new CustomTemperatureSamplerOption()).With(new BaseTopPSamplerOption(0.95f)).ToSampler();
But I understand how that might be considered boilerplate/bloat and leaving it up to a pipeline might be best here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before I pushed the current sampler API I experimented with a few different APIs, one was encapsulating every sampler operation into an object (with all the relevant parameters) so a pipeline would just be a list of sampler stage objects. I didn't find it very good to work with, in reality sampler stages are not always completely isolated and so it was a very leaky abstraction.
@martindevans As an addition to the previous reply, I think we have already had multiple backends (I noticed this when I was working on auto-download proposal).
If I'm not misunderstanding it, What I replied before is saying I don't want to introduce an entirely different backend, for example, rwkv.cpp and TensorRT-LLM. however, if there's an audio-modal model similar to LLaVA, shall we support it, too? If the answer is yes, I think we should take that into account. |
Rinne, your proposal seems very complicated. This of course does not mean that it is not good, but that it maybe still can be simplified. What I would like to see in the future is an LLM engine which can process (schedule) parallel requests based on some priority rules, based on available memory (GPU), etc. Each request should have full state info which can be swapped in/out quickly when needed. I see this swapping as simply saving the state into memory (may have an option to save it into a DB/file for later processing) and sending it back to the requester (ref). Then, freeing memory and loading the other request's state if any, processing, etc. In case of the same request just keep everything going. I would not use statefull/stateless executors anymore because it is redundant with the former logic (one type of 'executor' is enough). I am not sure what you mean by batching, but if you mean splitting the input into chunks which fit into the context, then I would do this also simply by processing a whole request first (processing all chunks of one request sequentially without stopping) and then allow the next request. Some libraries do this chunking already (for example, Semantic Kernel). Maybe it is also an option to use SK as a lower level engine. SK schedules already several tasks (text completion, chat, etc.)... only the final scheduling of the LLM engine is missing (memory management, etc.) which can be added as plugin to SK. All details about KV cache management, sampling, etc. should be hidden and manageable by options in the LLM engine (but preferably set to good values with no need to change). Sequence, conversation, message, text... I think that it would be better to have just one type, for example, called 'message' (byte[] ?), because that is what is going to the LLM. Based on the request this message will be simple text, chat history, image, etc. So, as a summary, I think that you should think reversed (from the point of view of the Application which uses the library) and see how to make all of this much simpler and much more efficient. |
Please refer to this code of the PR. A scheduler has three queues, which are
I mean the technology of processing multiple sequences (conversations) at the same time to improve the throughput, which was introduced to llama.cpp in this PR for the first time. From the view of me, the design is not really complicated. As you can see here, only 6 components are abstracted (including |
Rinne, the scheduler staff sound good! About batching, this is a bit confusing. I think that what they are discussing in ggerganov/llama.cpp#3749 is an internal batching of tokens on the GPU and not several requests/conversations. I think that processing multiple conversations at the same time should mean that the scheduler should execute the conversations in sequence and not parallel by swapping context state only (if the model is the same). This is not batching, but scheduling, but I may misunderstand something. |
I have to disagree I'm afraid and it's leaving me very conflicted about this proposal. This is trying to introduce multiple new concepts all at once, with a very high level of abstraction, trying to hide/abstract away important details and it's all built on top of a very unstable foundation (llama.cpp changes fast). I honestly can't imagine this all coming together successfully with the current approach. Rather than trying to build this all at once almost as an entirely new library in the experimental namespace I would rather see this done in very small parts, step by step, building on top of the primitives we have rather than duplicating large amounts of them. For example referencing back to some of my earlier comments:
All of these are almost equivalent. If there are improvements to them to be made, let's develop those improvements in isolation and get them added to the existing interfaces!
I agree something like llava could be regarded as a separate backend, it provides new capabilities that could probably be fenced of into a set of extensions and the core library wouldn't need to be aware of it at all. What I was really asking about though was multiple backends for the same thing. e.g. llama.cpp for text LLMs, but then also trying to have a separate e.g. exllama backend for text LLMs. I would be very strongly against this, we would have to abstract away a lot of the details of the backend to produce a "lowest common denominator" API between all the backends.
Let's define some of the various level we have in LLamaSharp at the moment. At the moment we have:
Given these levels, does it make sense to split LLamaSharp into a new package (e.g. This user profile actually describes... me! personally I want decent bindings to llama.cpp so I can build other libraries based on LLMs. The very high level stuff doesn't really interest me that much, it abstracts away too much for my taste. |
@zsogitbe "Batching" in this sense means running inference over a number of separate "sequences" all at once, in one pass over the model you can produce continuations for all sequences at once. This is significantly faster than running inference for each sequence individually because LLMs are normally memory bandwidth bound, not compute bound (e.g. running 8 sequences at once might cost you about 15% speed, for 8x as many tokens, very very rough estimate because it depends a lot on your hardware). This is all exposed in the Each sequence needs a certain amount of memory in the KV cache (to store it's entire history). So scheduling multiple sequences is all about packing as many of them as possible into that space. This is made extra complex by the fact that sequences can (and often do) share bits of the KV cache! For example if you evaluate the system prompt (just once) you can then fork that sequence into e.g. 8 conversations which all generate different continuations but only consume the KV cache memory for the system prompt once. This is made even more complex by the fact that if you save a sequence and unload it from the KV cache, and then immediately reload it that will no longer share memory so there's potentially a very very high cost to unloading a sequence the first time you do it. |
I'm not going to introduce any other LLM backend. It's only kinda a design to make things more clear (from my point of view).
That's certainly okay. This PR is a draft to manifest my idea.
Don't you think this idea is similar to the abstraction of
TBH I don't think Edit: making
I believe you are not alone. It seems to be a good idea to split the package though it requires much work. :) |
This sounds good, but I don't like the LLamaSharp.OpenAI part. I would let KernelMemory and SemanticKernel in separate packages as they are now. I would not go in the direction of OpenAI, because that is a completely different story. KernelMemory and SemanticKernel are not OpenAI related, but can work with OpenAI. It is important that LLamaSharp.Engine is very stable and does not need to change often if llama.cpp changes! The interface LLamaSharp.Native and LLamaSharp.Engine should be made in such a clever way that LLamaSharp.Engine can become as stable product as possible. I do not believe in batching for Desktop applications because we adjust the context size, number of layers uploaded to the GPU, etc to optimize GPU memory usage. In this scenario batching becomes not important. In a cloud server environment where you have several GPUs and a lot of (expensive!) memory batching is interesting. It is thus important that batching does not become standard, but a choice of execution. I however still do not really understand what happens when two requests have different inference parameters (different temperature topP, etc.) and you process them in one batch together... I expect many errors here. I believe in automatic scheduling of model processing which would then also mean the automatic (GPU) memory management. I am doing the GPU memory management now manually, freeing GPU and loading other model, etc. This should be done automatically and optimally (difficult!). This is where I expect most from Rinne. |
Batching is important for desktop usage too, it provides a speedup in any scenario except:
Even desktop apps with a single user chatting to a model might have multiple sequences (e.g. internal summarisation of history, multiple agents for reasoning etc) I think you and maybe Rinne might be misunderstanding the BatchedExecutor. It isn't only for batching. It provides a better (safer) API to the low level primitives of llama.cpp and exposes things like KV cache manipulations as well (rewind, fork, shift etc). My long term design goal as I've been working on it is for it to act as the foundation of all future executors (even ones with a single conversation thread). |
Yes, maybe we don't understand what you exactly mean/want to do. What I mean is that I am filling my GPU's memory completely with one request, so there is no more place for an extra sequence. I think that in this case batching (if I understand you well) will only cause an overhead which will slow down execution. But, maybe you could make a very simple example with two requests in parallel and then we can test it and see what happens. A test where you run the 2 requests sequentially and measure time and then you run the same with batching and measure time. We will then have a good idea if it is interesting. |
@zsogitbe If you're completely filling the KV cache with a single requests then batching can't help, it can only parallelise as far as it can fit sequences into the cache. You still get the advantage of a better API though, for example if your context is full there's a nice simple Another nice thing about batching is you can share the cache, so if you're almost filling the cache there are some potential use cases. For example if you had a long conversation filling 90% of the cache you could fork it (costing no extra memory), ask the LLM for a summary in that fork (using up extra memory only for the summary) and then discard that fork (getting that memory back). |
@martindevans It seems that ideas from more people is needed to converge this discussion. Your in-progress work sounds coupled with the current design of master branch, while mine is relatively independent with it. What about introducing a package
|
To clarify: I was not saying we should work out two solutions and let one defeat the other. It seems that @martindevans concerns more about low/mid level APIs, while I prefer to mid/high level APIs. There might be some work of one of us dropped finally, but others could be put together. I believe it won't be a waste of time. |
I have the very strong feeling that this should not be called batching, but maybe load balancing and should incorporate scheduling... the whole automatic management of processing. I also think that this should happen automatically in order to prevent that the users need a PhD to use it :) |
I feel a little bit of a conflict here. It is maybe not my place to suggest a solution, but I will try anyway. You can always neglect my suggestion :) . I think that the best way to proceed would be if Rinne would implement scheduling and Martin would implement load balancing (batching as Martin calls it) and then this two would be merged together to a magic solution which would be the subject of many talks around the world because it would be such a great improvement to LLamaSharp. Rinne could further improve his experimental solution and we would cherry pick each improvement and add it to the current solution. If deeper refactoring would be needed later because, for example, the code becomes a bit less logical because of the additions, then this could be done later based on the already working modules (will be much easier, then reworking the whole thing at once now without knowing if the improvements work well and are necessary). |
I will move the works to the experimental package in the future, so I'm going to close this draft PR. However, please feel free to leave comments here. Our opinions differ in many ways, but this exactly the purpose of opening this draft PR -- to hear different voices to avoid being isolated in my own mind. Thank you a lot for all your comments! |
This is a prototype implementation for #684 . It proves how the design will be like and the feasibility of the proposal.
TODO:
LLM
.How to run
I have provided an example which is named
Experimental LLM
. It shows how to use a very simple API to run batched inference of multiple inputs. Since streaming hasn't been supported yet, it will take you a long time before you could see the output on the console.Plan
I won't require to merge this PR now since it's not only not at a good state, but also will introduce break changes. I'll follow the steps below to move on.
Experimental
namespace first, and try my best to keep other parts not changed.ChatSession
with the features. Then mark them asobsolete
if necessary.Experimental
namespace to root name space.Completing the whole TODO list requires a large amount of work. So I'd like to discuss with you first and hope we could agree on it after some improvements of the proposal. It will proceed much faster if you would like to develop it together. :)
About the code
The code are under the namespce
Experimental
. Note that some parts are not well done but only implemented to get a simple example to run. So, please ignore the naive implementation ofLLamaGreedySamplingMethod
,LLamaTokenizer
andSequenceLengthStoopingCriteria
.