`server`: cancel prompt processing & non-streamed requests when connection closed #9679

ochafik · 2024-09-29T00:22:05Z

(superseded by #11285)

WIP: server tests still not green

Fixes #9273 (and supersedes #6941 ) & #6421

Connection liveness is checked in server_context::process_token & server_context::update_slots using a new Response::is_alive() proposed in httplib.

This makes the non-streamed handling of completions to use a Sink-based API (set_content_provider) similar to the streaming path, which allows polling for connection status (proposed sink.is_alive() in yhirose/cpp-httplib#1956, via a slot.is_alive function).

This allows cancellation of token generation & (update) prompt processing (batch granularity)

Usage:

# Terminal 1
./llama-server -fa \
    -hfr lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF -hff Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf

# Terminal 2

# Long token generation
curl http://localhost:8080/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": "Write a long novel about AIs."}]}'
# Then Ctrl+C: server immediately goes back to "update_slots: all slots are idle" in Terminal 1

# Long prompt processing
curl http://localhost:8080/v1/completions \
  -d "$( python -c "import json; print(json.dumps(dict(prompt='abc' * 30000)))" )"

ngxson · 2024-10-01T11:44:24Z

examples/server/server.cpp

@@ -1117,6 +1120,13 @@ struct server_context {
    }

    bool process_token(completion_token_output & result, server_slot & slot) {
+        if (!slot.is_alive()) {


We're calling sink.is_writable() from another thread (other than HTTP thread). Unless sink object is thread-safe, I don't think this is a safe way to do.

One possible solution could be to introduce thread-safe support into the httplib itself, but I'm not sure how complex it is.

Great point, asked on yhirose/cpp-httplib#1952 :-)

@yhirose's detailed answer suggests this is probably fine. I'd suggest to give it an optimistic try and watch out for crash reports (we could also hide this under a define but the lack of discoverability would reduce feedback opportunities).

Incidentally, I'm keen to explore fuzzing tests at some point, maybe in conjunction w/ k6 / bench (the server would benefit from a bit more hardening)

the server would benefit from a bit more hardening

(just discovered we set CPPHTTPLIB_NO_EXCEPTIONS in debug, maybe some of the crashes I've seen were fine haha)

As someone who often works with system programming, I think this must be tested very seriously before merging, rather than waiting for bug reports. Thread-safe related stuff is very nasty to debug. Bugs related to thread-safety are usually not consistent, thus many cases are misdiagnosed.

Furthermore, we don't know if sink.is_writable() is really atomic on other platforms than linux/unix. Also, because sink.is_writable() make pool and select syscall under the hood, it potentially stall the inference thread, thus can negatively impact performance.

Also want to notice, thread-safety violations do not always result in crashes.

In the worst case scenario, if a data race happen somewhere, you could end up with some weird behaviors, for example interlaced output (just take this as an example, it can't happen on socket IRL, but you get the idea)

Opened yhirose/cpp-httplib#1956 to maybe introduce an is_alive checker that might be lighterweight as you suggested in the other thread. If it goes through we could use it on platforms where we're happy w/ the threadsafety guarantee 🤞

…rose/cpp-httplib#1956)

ochafik · 2024-10-04T12:31:00Z

Added prompt generation cancelling (is_alive check in server_context::update_slots)

ngxson · 2024-10-04T13:07:58Z

I'm re-thinking about the life-cycle of sink object. The biggest risk now is what happen (and can it be possible) if sink object is already destroyed by the HTTP thread, but then it's being called by slot.is_alive(). In this case, it may result in segfault.

Another way, more guaranteed, is to delegate this check to queue_results instead of slot. This is because it's guaranteed that HTTP thread calls remove_waiting_task_id(s) before deleting itself.

So a plan would be:

add_waiting_tasks / add_waiting_task_id can optionally accept a sink
queue_results.can_send(int task_id) can be called to check if the task_id is still available for sending. It must check if the task_id is in waiting_task_ids AND sink.is_alive() == true`
remove_waiting_task_id(s) must remove both task_id from waiting_task_ids and the sink associated to this task

In the future, it this cause performance trouble, we could easily spin up a thread inside queue_results to check for sink.is_alive() status

ochafik · 2024-10-04T17:48:58Z

The biggest risk now is what happen (and can it be possible) if sink object is already destroyed by the HTTP thread, but then it's being called by slot.is_alive(). In this case, it may result in segfault.

@ngxson Excellent point!

Another way, more guaranteed, is to delegate this check to queue_results instead of slot. This is because it's guaranteed that HTTP thread calls remove_waiting_task_id(s) before deleting itself.

Unfortunately no intermediate results are sent in non-streaming mode (nor after each batch of prompt processing).

Luckily though, we do have a resource_releaser for both streamed and non streamed branches which is called at a perfect time to mark the sink reference as radioactive (technically it's called in httplib::Response::~Response, which is called at the end of httplib::Server::process_request(Stream &strm, ...), so the stream will still be valid by then).

I already had a shared_ptr<State>, just shoved a bool is_sink_valid in there, capture a copy of the shared ptr in the is_alive lambda and only call sink.is_alive() on a valid sink :-) (6297d9f)

ngxson · 2024-10-04T22:48:15Z

Unfortunately no intermediate results are sent in non-streaming mode (nor after each batch of prompt processing).

I'm not saying that you need to send the result, but I'm talking about adding something like std::map<task_id, sink>

Upon add_waiting_task_id, the mapping task_id-->sink will be added, then upon disconnect, remove_waiting_task_id will remove the sink from mapping. Doing this way has certain advantages:

It moves the connection state management away from slot. Indeed, slot should be responsible for managing inference-related state, not HTTP state
It can be benefit from the std::mutex already present inside queue_results, so a bit more thread-safe
As mentioned earlier, it could provide a simple escape hatch in case we detect negative impact on performance

…after status / headers sent

ngxson · 2024-10-04T22:53:57Z

examples/server/server.cpp

+        std::unordered_set<int> task_ids;
+        bool is_sink_valid = true;
+    };
+    auto state = std::make_shared<State>();


To be honest, I would prevent using shared_ptr whenever possible. If you search google for "why shared_ptr is bad", you will know why.

ngxson · 2024-10-04T23:20:07Z

I'm not saying that you need to send the result, but I'm talking about adding something like std::map<task_id, sink>

To illustrate what I mean, I drafted this PR: ngxson@2e1c355

(Keep in mind that it's a non-working draft, just to show how it can be implemented)

ochafik · 2024-10-05T00:16:25Z

First approach was fatally flawed as (set_content_provider) is always sent after the status & headers, so couldn't propagate failures that only occur at the end of a non-streamed request (E.g. "prompt too long"). I've updated the patch to httplib to add Response::is_alive() so the change to server.cpp is much simpler now

ochafik · 2024-10-05T00:25:18Z

To illustrate what I mean, I drafted this PR: ngxson@2e1c355

Thanks a lot for crafting this! I'll try and digest + integrate this in the coming days.

I've just switched gears re/ set_content_provider (that one red ctx_shift test revealed a core issue!), hope to get some positive signal from httplib's maintainer before committing too much more work :-)

jurov · 2024-11-02T15:49:35Z

Excuse my stupid suggestion...can you send progress info instead after every batch? Will vastly improve UX and you'll get connection liveness check for free. I find myself looking often at console output to check on prompt processing progress.

ochafik · 2025-01-22T10:06:32Z

Superseded by #11285

ochafik added 3 commits September 29, 2024 01:08

server: support cancelling non-streamed requests

1da67a3

tests: allow artificial slowdown of sampling for tests

4dcb3ea

server: test request cancellation (WIP)

5f00747

github-actions bot added examples python python script changes server labels Sep 29, 2024

ochafik added 7 commits September 29, 2024 17:11

server: rm superfluous is_alive check in streamed code

419e995

server: simplify handle_tasks signature

88c9b54

server: fix cancel tests

3f96ab0

server: fix seed in tests (comma creates a tuple)

231a5e4

Update cancel.feature

c5a0d57

server: update log

0e9c4bf

server: speed up cancel test setup

d07387c

ngxson reviewed Oct 1, 2024

View reviewed changes

ochafik mentioned this pull request Oct 1, 2024

Is Sink.is_writable() thread-safe yhirose/cpp-httplib#1952

Closed

server: fix error status

43e306e

ochafik mentioned this pull request Oct 4, 2024

Add {Response,DataSink}::is_alive yhirose/cpp-httplib#1956

Closed

server: introduce supposedly lighterweight is_alive in httplib (yhi…

42f5465

…rose/cpp-httplib#1956)

github-actions bot added the testing Everything test related label Oct 4, 2024

ochafik added 2 commits October 4, 2024 13:22

server: support cancellation of prompt processing

03efb92

server: mime nit

16ff502

ochafik changed the title ~~server: cancel non-streamed requests w/ closed connection~~ server: cancel prompt processing & non-streamed requests when connection closed Oct 4, 2024

server: avoid calling sink.is_alive() after it died 🧟‍♂️

6297d9f

Delete tests/.DS_Store

2dc708c

server: use (new) Request::is_alive as set_content_provider called …

6f693f1

…after status / headers sent

ngxson reviewed Oct 4, 2024

View reviewed changes

ochafik added 2 commits October 5, 2024 00:44

server: fix disconnection logic in test (before post response headers)

52c5a62

reinstate trailing \n

e41a540

kingbri1 mentioned this pull request Nov 26, 2024

Feature Request: Ability to cancel during prompt processing (llama_decode) #10509

Closed

4 tasks

ochafik closed this Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server`: cancel prompt processing & non-streamed requests when connection closed #9679

`server`: cancel prompt processing & non-streamed requests when connection closed #9679

ochafik commented Sep 29, 2024 •

edited

Loading

ngxson Oct 1, 2024

ngxson Oct 1, 2024

ochafik Oct 1, 2024

ochafik Oct 2, 2024

ochafik Oct 2, 2024

ngxson Oct 2, 2024 •

edited

Loading

ngxson Oct 2, 2024 •

edited

Loading

ochafik Oct 4, 2024

ochafik commented Oct 4, 2024

ngxson commented Oct 4, 2024 •

edited

Loading

ochafik commented Oct 4, 2024

ngxson commented Oct 4, 2024

ngxson Oct 4, 2024 •

edited

Loading

ngxson commented Oct 4, 2024 •

edited

Loading

ochafik commented Oct 5, 2024

ochafik commented Oct 5, 2024 •

edited

Loading

jurov commented Nov 2, 2024 •

edited

Loading

ochafik commented Jan 22, 2025

server: cancel prompt processing & non-streamed requests when connection closed #9679

server: cancel prompt processing & non-streamed requests when connection closed #9679

Conversation

ochafik commented Sep 29, 2024 • edited Loading

ngxson Oct 1, 2024

Choose a reason for hiding this comment

ngxson Oct 1, 2024

Choose a reason for hiding this comment

ochafik Oct 1, 2024

Choose a reason for hiding this comment

ochafik Oct 2, 2024

Choose a reason for hiding this comment

ochafik Oct 2, 2024

Choose a reason for hiding this comment

ngxson Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

ochafik Oct 4, 2024

Choose a reason for hiding this comment

ochafik commented Oct 4, 2024

ngxson commented Oct 4, 2024 • edited Loading

ochafik commented Oct 4, 2024

ngxson commented Oct 4, 2024

ngxson Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

ngxson commented Oct 4, 2024 • edited Loading

ochafik commented Oct 5, 2024

ochafik commented Oct 5, 2024 • edited Loading

jurov commented Nov 2, 2024 • edited Loading

ochafik commented Jan 22, 2025

`server`: cancel prompt processing & non-streamed requests when connection closed #9679

`server`: cancel prompt processing & non-streamed requests when connection closed #9679

ochafik commented Sep 29, 2024 •

edited

Loading

ngxson Oct 2, 2024 •

edited

Loading

ngxson Oct 2, 2024 •

edited

Loading

ngxson commented Oct 4, 2024 •

edited

Loading

ngxson Oct 4, 2024 •

edited

Loading

ngxson commented Oct 4, 2024 •

edited

Loading

ochafik commented Oct 5, 2024 •

edited

Loading

jurov commented Nov 2, 2024 •

edited

Loading