Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server: Improve work queue stability #5710

Closed
wants to merge 11 commits into from

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Feb 25, 2024

My idea in this PR:

  • Change function name update_slots() to run_slots() to better reflect its function. This is the function that really does the inference.
  • Add documentations so that it's more clear which function does what.
  • Replace many occurences in run_slots() where we call return that skip jobs of other slots. It now returns an error.
  • Move all group attention logics to llama_client_slot, reduce complexity of code in run_slots() function

The only test case that is currently failed is "Multi users with total number of tokens to predict exceeds the KV Cache size", I'm still investigating it. UPDATE: All test cases passed

@phymbert
Copy link
Collaborator

thanks for letting me know, I will merge #5708 and #5700 as I need them for tomorrow, then I will focus on a kubernetes example.

@ngxson
Copy link
Collaborator Author

ngxson commented Feb 25, 2024

@phymbert Seems like your fix for the condition_results.notify_all does also fix the test case that I failed. Thanks for that. I'm waiting for the CI run to confirm.

I'll wait for you to merge the 2 PRs you mentioned (whenever you want), then I'll update mine with the main branch (expected to have some conflicts, but not a problem for me)

slot.n_past_se += n_tokens;
// TODO @ngxson: What happen if we're retrying with smaller n_batch?
// By the second time we retry, "grp_attn_shift" has already been called
slot.grp_attn_shift(ctx, n_tokens);
Copy link
Collaborator Author

@ngxson ngxson Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov I noticed a potential bug here where llama_kv_cache_seq_shift and llama_kv_cache_seq_div may be called multiple times when we retry llama_decode with different batch size. Can you please have a look to see if that's the case? Thanks.

(The group attention mechanism is still too complex for me to really understand, I'm not sure what I'm doing here is correct or not)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to reimplement the self-extend logic in the following days. Even if there is a bug here, we'll fix it in the new implementation, so don't worry for now

Btw, it would be very useful to add a passkey test that works with server with extended context. This is the command that we run using the passkey example:

make -j && ./passkey ./models/llama-7b/ggml-model-f16.gguf 250 4 50

This generates a prompt of about ~6k tokens and puts a number (the "pass key") at the start. It uses self-extend with a factor of 4, so that even a 2k model like LLaMA v1 will be able to recall it.

The test would be too heavy for the GH CI, so it should only run locally. Probably a simple curl command that sends a similar prompt as the example I've shown above. It could even be a multi-user test, so we can test that self-extend works with more than one prompts in parallel

cc @phymbert

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the confirmation. I'll leave my TODO here so that I can look into it in the future.

Copy link
Collaborator Author

@ngxson ngxson Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I also remove some changes to make this PR smaller, because initially I wanted this PR to be more about "fixing bugs" than "refectoring"

@ngxson ngxson marked this pull request as ready for review February 25, 2024 20:51
@ngxson ngxson requested review from ggerganov and phymbert February 25, 2024 20:52
Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the whole thing clearer a lot, thanks

// early returning without changing the slot state will block the slot for ever
// no one at the moment is checking the return value
return false;
send_error(slot, "failed processing images");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this change be backported somewhere else ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants