Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Closed
4 tasks done
kingbri1 opened this issue Nov 26, 2024 · 2 comments
Closed
4 tasks done
Labels
enhancement New feature or request stale

Comments

@kingbri1
Copy link

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Currently, there is no way to cancel when a prompt is being processed through llama.cpp. In the event of a large prompt, the user must wait for the entire prompt to be processed before cancellation can occur. This bottleneck happens when calling llama_decode on the ctx.

This is being proposed in server via #9679. However, I believe this behavior should be in the core library as well.

Motivation

API servers are generally used for many requests at a time (as evident by llama-server). Therefore, there should be a way to abort requests at any point. The main bottleneck that cannot easily be aborted is decoding. By having this bottleneck, there is a desync (race) between the server and the client which causes a segfault and crash once another request is sent.

In addition, not having processing cancellation increases load on a system even with a batching server because it causes extra resources to be used for a cancelled request.

Giving a cursory look over the llama-cpp-python repo, there are others that have the same problem:

Possible Implementation

Issue 313 in llama-cpp-python suggests to use signals which is a valid option. Another method would be to use a callback like in llama_model_params that allows for cancellation. Other opinions are welcome.

@kingbri1 kingbri1 added the enhancement New feature or request label Nov 26, 2024
@kingbri1
Copy link
Author

After digging deeper in the code, it looks like ggml_abort_callback is what's required here. However, this only works for CPU and Metal backends from what I've seen.

Implementing this for all GGML backends will allow this issue to be closed.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

1 participant