-
-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel decoding #95
Comments
Is this any different from batched inference? Cause that's already supported. |
Yes so it's slightly different, in batch Inference all requests have to be sent at the same time. Whereas in paralell decoding, you can have each request start at a different time, end at a different time and be sent back. This is very similar to how vLLM and paged attention works. It also allows to implement streamingllm |
See a working example of the multi request handling here |
This is actually also implemented, there just isn't a manager for it yet. But you can start an inference and then at any point start another one, supplying a second cache (of any length) and then running the two inferences in parallel. Linear layers will be batched, and attention will be performed separately on each sequence. There isn't full paged-attention style mapping of cache chunks between the sequences, though. And I'm not sure how it relates to StreamingLLM (which so far I'm not really sold on either), but it is a feature I plan to expand on eventually, when there's less than a hundred items on the to-do list. Hopefully Torch has finalized nested tensors by then. |
Ah I see. Thanks. My primary usecase for this would be a server. Hence the on demand starting and stopping. |
Does this mean I can have multiple threads start inference with different caches on the same model object, and these concurrent inference requests will be batched under the hood resulting in increased total throughput? |
Yes, pretty much. I haven't tested it much so I guess it's slightly unfinished, mostly because I got distracted by other things but also I kind of want to see how nested tensors turn out in Torch when that feature is finalized. If nested tensors end up working the way I'd hope, it would change how I'd approach the multi-cache stuff. I've added an example of how you might use it the way the feature exists currently. It implements a very basic generator with sampling (but without the more more advanced features like healing and stop strings) that you could build on to allow concurrent requests in some sort of streaming backend. |
Hey @turboderp, do you have any updates regarding to this feature? 😄 |
I plan to look into paged attention at some point, but overall I never intended for ExLlama to be an efficient backend for large deployments. |
I agree that Exllama was not intended for deployments. But given this is being used by almost everyone who knows quantization and with this throughput and memory, it would be great to add support for deployments (queueing, paged attention, parallel processing) |
All of the other deployment stacks are using exllamav2 gptq kernels and some support exl2. |
Seeing as this is being built from the ground up, I was wondering if its possible to implement something similar to ggerganov/llama.cpp#3228
Where it's natively possible to have parallel inference.
The text was updated successfully, but these errors were encountered: