Feature Request: Ability to cancel during prompt processing (llama_decode
)
#10509
Closed
4 tasks done
Labels
llama_decode
)
#10509
Prerequisites
Feature Description
Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling
llama_decode
on the ctx.This is being proposed in server via #9679. However, I believe this behavior should be in the core library as well.
Motivation
API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.
In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.
Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:
Possible Implementation
Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in
llama_model_params
that allows for cancellation. Other opinions are welcome.The text was updated successfully, but these errors were encountered: