Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509

kingbri1 · 2024-11-26T05:52:26Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling llama_decode on the ctx.

This is being proposed in server via #9679. However, I believe this behavior should be in the core library as well.

Motivation

API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.

In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.

Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:

Possible Implementation

Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in llama_model_params that allows for cancellation. Other opinions are welcome.

The text was updated successfully, but these errors were encountered:

kingbri1 · 2024-11-26T19:46:13Z

After digging deeper in the code, it looks like ggml_abort_callback is what's required here. However, this only works for CPU and Metal backends from what I've seen.

Implementing this for all GGML backends will allow this issue to be closed.

github-actions · 2025-01-10T01:07:27Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

kingbri1 added the enhancement New feature or request label Nov 26, 2024

This was referenced Nov 26, 2024

Add cancel() method to interrupt a stream abetlen/llama-cpp-python#733

Open

llama: Add generic abort to token_decode_internal #10571

Open

github-actions bot added the stale label Dec 27, 2024

github-actions bot closed this as completed Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509

kingbri1 commented Nov 26, 2024

kingbri1 commented Nov 26, 2024

github-actions bot commented Jan 10, 2025

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Comments

kingbri1 commented Nov 26, 2024

Prerequisites

Feature Description

Motivation

Possible Implementation

kingbri1 commented Nov 26, 2024

github-actions bot commented Jan 10, 2025

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509

Feature Request: Ability to cancel during prompt processing (`llama_decode`) #10509