-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add beam search #631
base: main
Are you sure you want to change the base?
Add beam search #631
Conversation
you can pass False with an empty string |
@mattpulver Guess now you can un-draft this PR. |
3b3c1d6
to
3fce944
Compare
When I add
In the meantime I changed the default setting for |
Testing
If you would like to see how the beams evolve and their probabilities, uncomment:
in |
Resolves #184 |
@mattpulver great work here, I'll review this and should have it merged this week. Cheers |
@abetlen Thanks. Perhaps the most intrusive integration change is changing the default value of the This actually matches the default value of llama.cpp so in a broader sense this makes sense IMO but it may break some existing functionality. |
@mattpulver just a quick update, I'm going to hold of merging this until after #771 because that's going to have some big impact on how we use the llama.cpp api internally in the |
beam_search_dictionary = {} | ||
|
||
# beam_search_callback() must flag beams when they reach end-of-sentence. | ||
# TODO: Use stop_sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO the reason I'm seeing this in the debug output? Note the EOS and BOS. The prompt is not code-related, FWIW.
beams[0] p(0.493342787027359): <0x0A></s><s><0x0A>#include▁<iostream><0x0A>#include▁<cmath.h><0x0A>#include▁<c
beams[1] p(0.5066572427749634): <0x0A></s><s><0x0A>#include▁<iostream><0x0A>#include▁<cmath.h><0x0A>#include▁<vector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean exactly by "Note the EOS and BOS."
The TODO note relates to the is_at_eob()
function above. Currently, EOB (end-of-beam) is determined by the character llama_cpp.llama_token_eos(ctx)
. If EOB is to be generalized to user-defined EOB sequences, then this would be the function to add the logic to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that </s>
(EOS) is generated by the model, but the beam search keeps going (onto BOS, and then it starts making up something unrelated). I think this shouldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. To answer your original question, yes, that is exactly what the TODO is talking about.
A good follow-up item would be to add stop_sequences
to the class beam_search_callback_data
and set them to custom stop sequences (e.g. </s>
) when the class is instantiated below. Then pass it to is_at_eob()
when called from beam_search_callback()
.
It may require a bit more logic to accommodate the possibility of stop sequences being split across separate tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, this is not a custom stop sequence, this is just the regular EOS token (AFAIK), which is rendered this way in the output. You say EOB is determined by llama_token_eos, but that doesn't seem to work for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to debug this is to modify the line above
string += llama_cpp.llama_token_get_text(ctx, beam_view.tokens[i]).decode("utf-8")
to something that appends both the numeric token id beam_view.tokens[i]
along with the decoded substring. If you're really encountering the llama_token_eos()
token, last I checked, it should have token id 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Sorry, this is one is my fault. llama_token_eos takes model now, not ctx, and I missed that when I merged this PR into my local branch. Unfortunately, ctypes gives no indication of pointer type mismatches. Normally I would use mypy, but it seems as though llama-cpp-python is not tested against it - there are many type errors and other complaints.
If I wait a little longer, I consistently hit
// beam is not at end-of-sentence, so branch with next top_k tokens.
if (!beam.tokens.empty()) {
llama_decode(ctx, llama_batch_get_one(beam.tokens.data(), beam.tokens.size(), n_past, 0));
} backtrace (line numbers may not be accurate):
I can reproduce it within 30 seconds or so with a 7B model on CUDA on commit 1a1c3dc. On earlier commits, it seems to just hang. Fixing the EOS issue on my end does not resolve this. |
8c93cf8
to
cc0fe43
Compare
Thanks @mattpulver for the PR (both here and in the llama.cpp repo) Curious if abetlen, cebtenzzre, or someone else knows if/when this will get merged? I've found beam search extremely helpful for code generation, and would love to know if it'll be supported with the main |
Hey @rishsriv I'm still planning to merge this however I'm currently grinding through the batch processing support first as it requires a bunch of internal refactoring, after that I was planning on coming back in and merging this. Can't give an eta though, likely in the next few weeks if I had to guess. |
Got it – thank you for the response! If there are things you think external contributors will be able to fix, please do open an issue and will be happy to help fix it. |
Possibly related: ggerganov/llama.cpp#6664 |
Any update on this ? |
Alright so I copied the changed into a local clone and it seems to be working, or running at least. First it spend a crazy time on
HEAD MASTER : llama_print_timings: load time = 154.96 ms
llama_print_timings: sample time = 1534.47 ms / 548 runs ( 2.80 ms per token, 357.13 tokens per second)
llama_print_timings: prompt eval time = 371.38 ms / 1661 tokens ( 0.22 ms per token, 4472.54 tokens per second)
llama_print_timings: eval time = 4816.73 ms / 547 runs ( 8.81 ms per token, 113.56 tokens per second) This fork : llama_print_timings: load time = 127.43 ms
llama_print_timings: sample time = 59429.62 ms / 1 runs (59429.62 ms per token, 0.02 tokens per second)
llama_print_timings: prompt eval time = 338.67 ms / 1661 tokens ( 0.20 ms per token, 4904.55 tokens per second)
llama_print_timings: eval time = 58430.05 ms / 6530 runs ( 8.95 ms per token, 111.76 tokens per second)
59429.62 ms per token |
Invoke by adding "beam_width": 2 (for example) to /v1/completions POST.
This PR will be moved out of Draft mode after ggerganov/llama.cpp#2267 is merged. Closes #145 Closes #340 Closes #185
In the meantime, there is a question:
How does one specify
--logits_all False
when invoked from the command line?results in
settings.logits_all=True
on startup.