Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: cancel prompt processing & non-streamed requests when connection closed #9679

Closed
wants to merge 19 commits into from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Sep 29, 2024

(superseded by #11285)

WIP: server tests still not green

Fixes #9273 (and supersedes #6941 ) & #6421

Connection liveness is checked in server_context::process_token & server_context::update_slots using a new Response::is_alive() proposed in httplib.

This makes the non-streamed handling of completions to use a Sink-based API (set_content_provider) similar to the streaming path, which allows polling for connection status (proposed sink.is_alive() in yhirose/cpp-httplib#1956, via a slot.is_alive function).

This allows cancellation of token generation & (update) prompt processing (batch granularity)

Usage:

# Terminal 1
./llama-server -fa \
    -hfr lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF -hff Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
# Terminal 2

# Long token generation
curl http://localhost:8080/v1/chat/completions \
  -d '{"messages": [{"role": "user", "content": "Write a long novel about AIs."}]}'
# Then Ctrl+C: server immediately goes back to "update_slots: all slots are idle" in Terminal 1

# Long prompt processing
curl http://localhost:8080/v1/completions \
  -d "$( python -c "import json; print(json.dumps(dict(prompt='abc' * 30000)))" )"

@github-actions github-actions bot added examples python python script changes server labels Sep 29, 2024
@@ -1117,6 +1120,13 @@ struct server_context {
}

bool process_token(completion_token_output & result, server_slot & slot) {
if (!slot.is_alive()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're calling sink.is_writable() from another thread (other than HTTP thread). Unless sink object is thread-safe, I don't think this is a safe way to do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible solution could be to introduce thread-safe support into the httplib itself, but I'm not sure how complex it is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, asked on yhirose/cpp-httplib#1952 :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yhirose's detailed answer suggests this is probably fine. I'd suggest to give it an optimistic try and watch out for crash reports (we could also hide this under a define but the lack of discoverability would reduce feedback opportunities).

Incidentally, I'm keen to explore fuzzing tests at some point, maybe in conjunction w/ k6 / bench (the server would benefit from a bit more hardening)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the server would benefit from a bit more hardening

(just discovered we set CPPHTTPLIB_NO_EXCEPTIONS in debug, maybe some of the crashes I've seen were fine haha)

Copy link
Collaborator

@ngxson ngxson Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone who often works with system programming, I think this must be tested very seriously before merging, rather than waiting for bug reports. Thread-safe related stuff is very nasty to debug. Bugs related to thread-safety are usually not consistent, thus many cases are misdiagnosed.

Furthermore, we don't know if sink.is_writable() is really atomic on other platforms than linux/unix. Also, because sink.is_writable() make pool and select syscall under the hood, it potentially stall the inference thread, thus can negatively impact performance.

Copy link
Collaborator

@ngxson ngxson Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also want to notice, thread-safety violations do not always result in crashes.

In the worst case scenario, if a data race happen somewhere, you could end up with some weird behaviors, for example interlaced output (just take this as an example, it can't happen on socket IRL, but you get the idea)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened yhirose/cpp-httplib#1956 to maybe introduce an is_alive checker that might be lighterweight as you suggested in the other thread. If it goes through we could use it on platforms where we're happy w/ the threadsafety guarantee 🤞

@github-actions github-actions bot added the testing Everything test related label Oct 4, 2024
@ochafik
Copy link
Collaborator Author

ochafik commented Oct 4, 2024

Added prompt generation cancelling (is_alive check in server_context::update_slots)

@ochafik ochafik changed the title server: cancel non-streamed requests w/ closed connection server: cancel prompt processing & non-streamed requests when connection closed Oct 4, 2024
@ngxson
Copy link
Collaborator

ngxson commented Oct 4, 2024

I'm re-thinking about the life-cycle of sink object. The biggest risk now is what happen (and can it be possible) if sink object is already destroyed by the HTTP thread, but then it's being called by slot.is_alive(). In this case, it may result in segfault.

Another way, more guaranteed, is to delegate this check to queue_results instead of slot. This is because it's guaranteed that HTTP thread calls remove_waiting_task_id(s) before deleting itself.

So a plan would be:

  1. add_waiting_tasks / add_waiting_task_id can optionally accept a sink
  2. queue_results.can_send(int task_id) can be called to check if the task_id is still available for sending. It must check if the task_id is in waiting_task_ids AND sink.is_alive() == true`
  3. remove_waiting_task_id(s) must remove both task_id from waiting_task_ids and the sink associated to this task

In the future, it this cause performance trouble, we could easily spin up a thread inside queue_results to check for sink.is_alive() status

@ochafik
Copy link
Collaborator Author

ochafik commented Oct 4, 2024

The biggest risk now is what happen (and can it be possible) if sink object is already destroyed by the HTTP thread, but then it's being called by slot.is_alive(). In this case, it may result in segfault.

@ngxson Excellent point!

Another way, more guaranteed, is to delegate this check to queue_results instead of slot. This is because it's guaranteed that HTTP thread calls remove_waiting_task_id(s) before deleting itself.

Unfortunately no intermediate results are sent in non-streaming mode (nor after each batch of prompt processing).

Luckily though, we do have a resource_releaser for both streamed and non streamed branches which is called at a perfect time to mark the sink reference as radioactive (technically it's called in httplib::Response::~Response, which is called at the end of httplib::Server::process_request(Stream &strm, ...), so the stream will still be valid by then).

I already had a shared_ptr<State>, just shoved a bool is_sink_valid in there, capture a copy of the shared ptr in the is_alive lambda and only call sink.is_alive() on a valid sink :-) (6297d9f)

@ngxson
Copy link
Collaborator

ngxson commented Oct 4, 2024

Unfortunately no intermediate results are sent in non-streaming mode (nor after each batch of prompt processing).

I'm not saying that you need to send the result, but I'm talking about adding something like std::map<task_id, sink>

Upon add_waiting_task_id, the mapping task_id-->sink will be added, then upon disconnect, remove_waiting_task_id will remove the sink from mapping. Doing this way has certain advantages:

  • It moves the connection state management away from slot. Indeed, slot should be responsible for managing inference-related state, not HTTP state
  • It can be benefit from the std::mutex already present inside queue_results, so a bit more thread-safe
  • As mentioned earlier, it could provide a simple escape hatch in case we detect negative impact on performance

std::unordered_set<int> task_ids;
bool is_sink_valid = true;
};
auto state = std::make_shared<State>();
Copy link
Collaborator

@ngxson ngxson Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I would prevent using shared_ptr whenever possible. If you search google for "why shared_ptr is bad", you will know why.

@ngxson
Copy link
Collaborator

ngxson commented Oct 4, 2024

I'm not saying that you need to send the result, but I'm talking about adding something like std::map<task_id, sink>

To illustrate what I mean, I drafted this PR: ngxson@2e1c355

(Keep in mind that it's a non-working draft, just to show how it can be implemented)

@ochafik
Copy link
Collaborator Author

ochafik commented Oct 5, 2024

First approach was fatally flawed as (set_content_provider) is always sent after the status & headers, so couldn't propagate failures that only occur at the end of a non-streamed request (E.g. "prompt too long"). I've updated the patch to httplib to add Response::is_alive() so the change to server.cpp is much simpler now

@ochafik
Copy link
Collaborator Author

ochafik commented Oct 5, 2024

To illustrate what I mean, I drafted this PR: ngxson@2e1c355

Thanks a lot for crafting this! I'll try and digest + integrate this in the coming days.

I've just switched gears re/ set_content_provider (that one red ctx_shift test revealed a core issue!), hope to get some positive signal from httplib's maintainer before committing too much more work :-)

@jurov
Copy link

jurov commented Nov 2, 2024

Excuse my stupid suggestion...can you send progress info instead after every batch? Will vastly improve UX and you'll get connection liveness check for free. I find myself looking often at console output to check on prompt processing progress.

@ochafik
Copy link
Collaborator Author

ochafik commented Jan 22, 2025

Superseded by #11285

@ochafik ochafik closed this Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: (Server) Cannot properly cancel a non-stream completion request
3 participants