Parallel decoding #95

nivibilla · 2023-10-07T10:40:19Z

Seeing as this is being built from the ground up, I was wondering if its possible to implement something similar to ggerganov/llama.cpp#3228

Where it's natively possible to have parallel inference.

turboderp · 2023-10-07T16:33:35Z

Is this any different from batched inference? Cause that's already supported.

nivibilla · 2023-10-07T20:54:37Z

Yes so it's slightly different, in batch Inference all requests have to be sent at the same time. Whereas in paralell decoding, you can have each request start at a different time, end at a different time and be sent back. This is very similar to how vLLM and paged attention works. It also allows to implement streamingllm

nivibilla · 2023-10-07T21:04:22Z

See a working example of the multi request handling here
ggerganov/llama.cpp#3490

turboderp · 2023-10-07T22:19:43Z

This is actually also implemented, there just isn't a manager for it yet. But you can start an inference and then at any point start another one, supplying a second cache (of any length) and then running the two inferences in parallel. Linear layers will be batched, and attention will be performed separately on each sequence.

There isn't full paged-attention style mapping of cache chunks between the sequences, though. And I'm not sure how it relates to StreamingLLM (which so far I'm not really sold on either), but it is a feature I plan to expand on eventually, when there's less than a hundred items on the to-do list. Hopefully Torch has finalized nested tensors by then.

nivibilla · 2023-10-08T11:17:51Z

Ah I see. Thanks. My primary usecase for this would be a server. Hence the on demand starting and stopping.

RisaKirisu · 2023-10-31T06:43:02Z

This is actually also implemented, there just isn't a manager for it yet. But you can start an inference and then at any point start another one, supplying a second cache (of any length) and then running the two inferences in parallel. Linear layers will be batched, and attention will be performed separately on each sequence.

Does this mean I can have multiple threads start inference with different caches on the same model object, and these concurrent inference requests will be batched under the hood resulting in increased total throughput?
If thread B starts inference while thread A is already doing the inference, does the batching still happen?

turboderp · 2023-11-01T18:13:58Z

Yes, pretty much. I haven't tested it much so I guess it's slightly unfinished, mostly because I got distracted by other things but also I kind of want to see how nested tensors turn out in Torch when that feature is finalized. If nested tensors end up working the way I'd hope, it would change how I'd approach the multi-cache stuff.

I've added an example of how you might use it the way the feature exists currently. It implements a very basic generator with sampling (but without the more more advanced features like healing and stop strings) that you could build on to allow concurrent requests in some sort of streaming backend.

gabinguo · 2024-03-11T21:07:44Z

Hey @turboderp, do you have any updates regarding to this feature? 😄

turboderp · 2024-03-26T09:58:27Z

I plan to look into paged attention at some point, but overall I never intended for ExLlama to be an efficient backend for large deployments.

rjmehta1993 · 2024-04-01T20:43:46Z

I agree that Exllama was not intended for deployments. But given this is being used by almost everyone who knows quantization and with this throughput and memory, it would be great to add support for deployments (queueing, paged attention, parallel processing)

qeternity · 2024-04-06T08:50:55Z

All of the other deployment stacks are using exllamav2 gptq kernels and some support exl2.

turboderp closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel decoding #95

Parallel decoding #95

nivibilla commented Oct 7, 2023

turboderp commented Oct 7, 2023

nivibilla commented Oct 7, 2023 •

edited

Loading

nivibilla commented Oct 7, 2023

turboderp commented Oct 7, 2023

nivibilla commented Oct 8, 2023

RisaKirisu commented Oct 31, 2023

turboderp commented Nov 1, 2023

gabinguo commented Mar 11, 2024

turboderp commented Mar 26, 2024

rjmehta1993 commented Apr 1, 2024

qeternity commented Apr 6, 2024

Parallel decoding #95

Parallel decoding #95

Comments

nivibilla commented Oct 7, 2023

turboderp commented Oct 7, 2023

nivibilla commented Oct 7, 2023 • edited Loading

nivibilla commented Oct 7, 2023

turboderp commented Oct 7, 2023

nivibilla commented Oct 8, 2023

RisaKirisu commented Oct 31, 2023

turboderp commented Nov 1, 2023

gabinguo commented Mar 11, 2024

turboderp commented Mar 26, 2024

rjmehta1993 commented Apr 1, 2024

qeternity commented Apr 6, 2024

nivibilla commented Oct 7, 2023 •

edited

Loading