Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speculative: Ensure draft and target model vocab matches #3812

Merged
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 22 additions & 1 deletion examples/speculative/speculative.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,26 @@ int main(int argc, char ** argv) {
params.n_gpu_layers = params.n_gpu_layers_draft;
std::tie(model_dft, ctx_dft) = llama_init_from_gpt_params(params);

{
int n_vocab_tgt = llama_n_vocab(model_tgt);
if (n_vocab_tgt != llama_n_vocab(model_dft)) {
Copy link
Owner

@ggerganov ggerganov Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal. Codellama 7B and 13B have vocab size of 32000 while Codellama 34B has vocab size of 32016. It's the same vocab but with some extra tokens.

We should not disallow such cases. Maybe just print errors / warnings, but still continue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we error if the size differs by more than 100, and also check the content of min(n_vocab_tgt, n_vocab_dft) tokens? Maybe even start 5 or something, to allow for cases where something like BOS or EOS has different content.

100 and 5 are just random numbers I plucked out of the air. The actual values can be whatever you prefer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thinks it would work

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it. Also made the token content mismatch message a bit more helpful. For example, trying to use Orca 3B to draft Mistral 7B:

main: error: draft model vocab must match target model to use speculation but token 259 content differs - target '  ', draft ' t'

fprintf(stderr, "%s: error: draft model vocab must match target model to use speculation but ", __func__);
fprintf(stderr, "target vocab size %d does not match draft vocab size %d\n",
n_vocab_tgt, llama_n_vocab(model_dft));
return 1;
}

for (int i = 0; i < n_vocab_tgt; i++) {
const char * token_text_tgt = llama_token_get_text(model_tgt, i);
const char * token_text_dft = llama_token_get_text(model_dft, i);
if (std::strcmp(token_text_tgt, token_text_dft) != 0) {
fprintf(stderr, "%s: error: draft model vocab must match target model to use speculation but ", __func__);
fprintf(stderr, "token %d content differs\n", i);
return 1;
}
}
}

// tokenize the prompt
std::vector<llama_token> inp;
inp = ::llama_tokenize(ctx_tgt, params.prompt, true);
Expand Down Expand Up @@ -227,6 +247,7 @@ int main(int argc, char ** argv) {
llama_batch_add (batch_dft, id, n_past_dft, { 0 }, true);

llama_kv_cache_seq_rm(ctx_dft, 0, n_past_dft, -1);
// LOG("dft batch: %s\n", LOG_BATCH_TOSTR_PRETTY(ctx_dft, batch_dft).c_str());
llama_decode (ctx_dft, batch_dft);

++n_past_dft;
Expand Down Expand Up @@ -370,7 +391,7 @@ int main(int argc, char ** argv) {
llama_kv_cache_seq_cp(ctx_tgt, 0, s, -1, -1);
}

//LOG("target batch: %s\n", LOG_BATCH_TOSTR_PRETTY(ctx_tgt, batch_tgt));
// LOG("target batch: %s\n", LOG_BATCH_TOSTR_PRETTY(ctx_tgt, batch_tgt).c_str());
llama_decode(ctx_tgt, batch_tgt);
++n_past_tgt;
}
Expand Down