Server: Improve work queue stability #5710

ngxson · 2024-02-25T12:01:46Z

My idea in this PR:

Change function name update_slots() to run_slots() to better reflect its function. This is the function that really does the inference.
Add documentations so that it's more clear which function does what.
Replace many occurences in run_slots() where we call return that skip jobs of other slots. It now returns an error.
Move all group attention logics to llama_client_slot, reduce complexity of code in run_slots() function

~~The only test case that is currently failed is "Multi users with total number of tokens to predict exceeds the KV Cache size", I'm still investigating it.~~ UPDATE: All test cases passed

phymbert · 2024-02-25T12:06:37Z

thanks for letting me know, I will merge #5708 and #5700 as I need them for tomorrow, then I will focus on a kubernetes example.

ngxson · 2024-02-25T12:24:26Z

@phymbert Seems like your fix for the condition_results.notify_all does also fix the test case that I failed. Thanks for that. I'm waiting for the CI run to confirm.

I'll wait for you to merge the 2 PRs you mentioned (whenever you want), then I'll update mine with the main branch (expected to have some conflicts, but not a problem for me)

ngxson · 2024-02-25T14:51:18Z

examples/server/server.cpp

-                    slot.n_past_se += n_tokens;
+                    // TODO @ngxson: What happen if we're retrying with smaller n_batch?
+                    //       By the second time we retry, "grp_attn_shift" has already been called
+                    slot.grp_attn_shift(ctx, n_tokens);


@ggerganov I noticed a potential bug here where llama_kv_cache_seq_shift and llama_kv_cache_seq_div may be called multiple times when we retry llama_decode with different batch size. Can you please have a look to see if that's the case? Thanks.

(The group attention mechanism is still too complex for me to really understand, I'm not sure what I'm doing here is correct or not)

I'll try to reimplement the self-extend logic in the following days. Even if there is a bug here, we'll fix it in the new implementation, so don't worry for now

Btw, it would be very useful to add a passkey test that works with server with extended context. This is the command that we run using the passkey example:

make -j && ./passkey ./models/llama-7b/ggml-model-f16.gguf 250 4 50

This generates a prompt of about ~6k tokens and puts a number (the "pass key") at the start. It uses self-extend with a factor of 4, so that even a 2k model like LLaMA v1 will be able to recall it.

The test would be too heavy for the GH CI, so it should only run locally. Probably a simple curl command that sends a similar prompt as the example I've shown above. It could even be a multi-user test, so we can test that self-extend works with more than one prompts in parallel

cc @phymbert

Thanks for the confirmation. I'll leave my TODO here so that I can look into it in the future.

FYI, I also remove some changes to make this PR smaller, because initially I wanted this PR to be more about "fixing bugs" than "refectoring"

phymbert

It makes the whole thing clearer a lot, thanks

phymbert · 2024-02-25T21:17:41Z

examples/server/server.cpp

-                        //  early returning without changing the slot state will block the slot for ever
-                        // no one at the moment is checking the return value
-                        return false;
+                        send_error(slot, "failed processing images");


Will this change be backported somewhere else ?

refactor work queue related stuff

91e7e0f

condition_results.notify_all

3b2dea1

ngxson added 4 commits February 25, 2024 14:58

remove_waiting_task_id: also clean pending results

c420e05

Merge branch 'master' into xsn/improve_server_works

0eac1a3

move llama_client_slot to utils.hpp

a5603de

revert test addr change

624214a

ngxson commented Feb 25, 2024

View reviewed changes

ngxson added 4 commits February 25, 2024 21:41

Merge branch 'master' into xsn/improve_server_works

85c0334

move llama_client_slot back to server.cpp

72a8d59

adapt to new api change

bb363b9

revert move server_log

92671d7

ngxson marked this pull request as ready for review February 25, 2024 20:51

ngxson requested review from ggerganov and phymbert February 25, 2024 20:52

phymbert approved these changes Feb 25, 2024

View reviewed changes

phymbert mentioned this pull request Feb 26, 2024

Server gets stuck after invalid request #5724

Closed

Merge branch 'master' into xsn/improve_server_works

fa498a2

ngxson mentioned this pull request Feb 27, 2024

server : improvements and maintenance #4216

Open

10 tasks

ngxson marked this pull request as draft February 28, 2024 20:44

ngxson closed this Feb 28, 2024

ngxson mentioned this pull request Feb 28, 2024

Server: normalize naming #5779

Merged

phymbert mentioned this pull request Mar 2, 2024

server: passkey challenge / self-extend with context shift demo #5832

Merged

1 task

ngxson mentioned this pull request Mar 9, 2024

Server: format error to json #5961

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: Improve work queue stability #5710

Server: Improve work queue stability #5710

ngxson commented Feb 25, 2024 •

edited

Loading

phymbert commented Feb 25, 2024

ngxson commented Feb 25, 2024

ngxson Feb 25, 2024 •

edited

Loading

ggerganov Feb 25, 2024

ngxson Feb 25, 2024

ngxson Feb 25, 2024 •

edited

Loading

phymbert left a comment

phymbert Feb 25, 2024

phymbert Feb 28, 2024

Server: Improve work queue stability #5710

Server: Improve work queue stability #5710

Conversation

ngxson commented Feb 25, 2024 • edited Loading

phymbert commented Feb 25, 2024

ngxson commented Feb 25, 2024

ngxson Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

ggerganov Feb 25, 2024

Choose a reason for hiding this comment

ngxson Feb 25, 2024

Choose a reason for hiding this comment

ngxson Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

phymbert left a comment

Choose a reason for hiding this comment

phymbert Feb 25, 2024

Choose a reason for hiding this comment

phymbert Feb 28, 2024

Choose a reason for hiding this comment

ngxson commented Feb 25, 2024 •

edited

Loading

ngxson Feb 25, 2024 •

edited

Loading

ngxson Feb 25, 2024 •

edited

Loading