-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: Improve work queue stability #5710
Conversation
@phymbert Seems like your fix for the I'll wait for you to merge the 2 PRs you mentioned (whenever you want), then I'll update mine with the main branch (expected to have some conflicts, but not a problem for me) |
slot.n_past_se += n_tokens; | ||
// TODO @ngxson: What happen if we're retrying with smaller n_batch? | ||
// By the second time we retry, "grp_attn_shift" has already been called | ||
slot.grp_attn_shift(ctx, n_tokens); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov I noticed a potential bug here where llama_kv_cache_seq_shift
and llama_kv_cache_seq_div
may be called multiple times when we retry llama_decode
with different batch size. Can you please have a look to see if that's the case? Thanks.
(The group attention mechanism is still too complex for me to really understand, I'm not sure what I'm doing here is correct or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to reimplement the self-extend logic in the following days. Even if there is a bug here, we'll fix it in the new implementation, so don't worry for now
Btw, it would be very useful to add a passkey test that works with server with extended context. This is the command that we run using the passkey
example:
make -j && ./passkey ./models/llama-7b/ggml-model-f16.gguf 250 4 50
This generates a prompt of about ~6k tokens and puts a number (the "pass key") at the start. It uses self-extend with a factor of 4, so that even a 2k model like LLaMA v1 will be able to recall it.
The test would be too heavy for the GH CI, so it should only run locally. Probably a simple curl
command that sends a similar prompt as the example I've shown above. It could even be a multi-user test, so we can test that self-extend works with more than one prompts in parallel
cc @phymbert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the confirmation. I'll leave my TODO here so that I can look into it in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, I also remove some changes to make this PR smaller, because initially I wanted this PR to be more about "fixing bugs" than "refectoring"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the whole thing clearer a lot, thanks
// early returning without changing the slot state will block the slot for ever | ||
// no one at the moment is checking the return value | ||
return false; | ||
send_error(slot, "failed processing images"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this change be backported somewhere else ?
My idea in this PR:
update_slots()
torun_slots()
to better reflect its function. This is the function that really does the inference.run_slots()
where we callreturn
that skip jobs of other slots. It now returns an error.llama_client_slot
, reduce complexity of code inrun_slots()
functionThe only test case that is currently failed is "Multi users with total number of tokens to predict exceeds the KV Cache size", I'm still investigating it.UPDATE: All test cases passed