-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server
: cancel prompt processing & non-streamed requests when connection closed
#9679
Conversation
examples/server/server.cpp
Outdated
@@ -1117,6 +1120,13 @@ struct server_context { | |||
} | |||
|
|||
bool process_token(completion_token_output & result, server_slot & slot) { | |||
if (!slot.is_alive()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're calling sink.is_writable()
from another thread (other than HTTP thread). Unless sink
object is thread-safe, I don't think this is a safe way to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One possible solution could be to introduce thread-safe support into the httplib itself, but I'm not sure how complex it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, asked on yhirose/cpp-httplib#1952 :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yhirose's detailed answer suggests this is probably fine. I'd suggest to give it an optimistic try and watch out for crash reports (we could also hide this under a define but the lack of discoverability would reduce feedback opportunities).
Incidentally, I'm keen to explore fuzzing tests at some point, maybe in conjunction w/ k6 / bench (the server would benefit from a bit more hardening)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the server would benefit from a bit more hardening
(just discovered we set CPPHTTPLIB_NO_EXCEPTIONS in debug, maybe some of the crashes I've seen were fine haha)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As someone who often works with system programming, I think this must be tested very seriously before merging, rather than waiting for bug reports. Thread-safe related stuff is very nasty to debug. Bugs related to thread-safety are usually not consistent, thus many cases are misdiagnosed.
Furthermore, we don't know if sink.is_writable()
is really atomic on other platforms than linux/unix. Also, because sink.is_writable()
make pool
and select
syscall under the hood, it potentially stall the inference thread, thus can negatively impact performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also want to notice, thread-safety violations do not always result in crashes.
In the worst case scenario, if a data race happen somewhere, you could end up with some weird behaviors, for example interlaced output (just take this as an example, it can't happen on socket IRL, but you get the idea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened yhirose/cpp-httplib#1956 to maybe introduce an is_alive checker that might be lighterweight as you suggested in the other thread. If it goes through we could use it on platforms where we're happy w/ the threadsafety guarantee 🤞
Added prompt generation cancelling ( |
server
: cancel non-streamed requests w/ closed connectionserver
: cancel prompt processing & non-streamed requests when connection closed
I'm re-thinking about the life-cycle of Another way, more guaranteed, is to delegate this check to So a plan would be:
In the future, it this cause performance trouble, we could easily spin up a thread inside |
@ngxson Excellent point!
Unfortunately no intermediate results are sent in non-streaming mode (nor after each batch of prompt processing). Luckily though, we do have a I already had a |
I'm not saying that you need to send the result, but I'm talking about adding something like Upon
|
…after status / headers sent
examples/server/server.cpp
Outdated
std::unordered_set<int> task_ids; | ||
bool is_sink_valid = true; | ||
}; | ||
auto state = std::make_shared<State>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I would prevent using shared_ptr whenever possible. If you search google for "why shared_ptr is bad", you will know why.
To illustrate what I mean, I drafted this PR: ngxson@2e1c355 (Keep in mind that it's a non-working draft, just to show how it can be implemented) |
First approach was fatally flawed as (set_content_provider) is always sent after the status & headers, so couldn't propagate failures that only occur at the end of a non-streamed request (E.g. "prompt too long"). I've updated the patch to httplib to add |
Thanks a lot for crafting this! I'll try and digest + integrate this in the coming days. I've just switched gears re/ set_content_provider (that one red ctx_shift test revealed a core issue!), hope to get some positive signal from httplib's maintainer before committing too much more work :-) |
Excuse my stupid suggestion...can you send progress info instead after every batch? Will vastly improve UX and you'll get connection liveness check for free. I find myself looking often at console output to check on prompt processing progress. |
Superseded by #11285 |
(superseded by #11285)
WIP: server tests still not green
Fixes #9273 (and supersedes #6941 ) & #6421
Connection liveness is checked in
server_context::process_token
&server_context::update_slots
using a newResponse::is_alive()
proposed in httplib.This makes the non-streamed handling of completions to use a Sink-based API (set_content_provider) similar to the streaming path, which allows polling for connection status (proposedsink.is_alive()
in yhirose/cpp-httplib#1956, via aslot.is_alive
function).This allows cancellation of token generation & (update) prompt processing (batch granularity)
Usage:
# Terminal 1 ./llama-server -fa \ -hfr lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF -hff Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf