-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GQA export, better run.c, Support tinyllama-1.1B #410
base: master
Are you sure you want to change the base?
Conversation
Backported from karpathy/llama2.c#410
Current chat schemas in run.c are based on LLama 2 // render user/system prompts into the Llama 2 Chat schema
if (pos == 0 && system_prompt[0] != '\0') {
char system_template[] = "[INST] <<SYS>>\n%s\n<</SYS>>\n\n%s [/INST]";
sprintf(rendered_prompt, system_template, system_prompt, user_prompt);
} else {
char user_template[] = "[INST] %s [/INST]";
sprintf(rendered_prompt, user_template, user_prompt);
} But you may want to use tinyllama's ones instead:
In general chat templates should be bounded to the loaded pre-trained model, so maybe they should be a configuration parameter in the .bin file |
@@ -368,11 +368,12 @@ def load_hf_model(model_path): | |||
config.dim = hf_model.config.hidden_size | |||
config.n_layers = hf_model.config.num_hidden_layers | |||
config.n_heads = hf_model.config.num_attention_heads | |||
config.n_kv_heads = hf_model.config.num_attention_heads | |||
config.n_kv_heads = hf_model.config.num_key_value_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For MHA model, the number of kv heads equals q heads.
However, for GQA model like llama2-70b, tinyllama1.1B, the number of kv heads and q head are different.
@@ -451,7 +455,12 @@ void safe_printf(char *piece) { | |||
|
|||
int str_lookup(char *str, TokenIndex *sorted_vocab, int vocab_size) { | |||
// efficiently find the perfect match for str in vocab, return its index or -1 if not found | |||
TokenIndex tok = { .str = str }; // acts as the key to search for | |||
char *input = "<0x0A>"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this delta here done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether I convert the tokenizer correctly. After I convert the tinyllama-1.1B's tokenizer. The run.c gets <0x0A>
instead of \n
. I'm trying to figure out how to convert the tokenizer better to remove these line.
Besides, I notice that our run.c can not deal with \n
in the input (for tinystory 260k,15m,110m model). They will treat it as \\
and n
.
In llama.cpp, they hardcode to convert \\n
to \n
.
This is cool, I wasn't aware of the TinyLlama 1.1B run. Sounds very nice and useful for this repo to support. |
There isn't notable architectural changes. |
Add support to tinyllama-1.1B
Add support to convert GQA model (learned from ggerganov/llama.cpp#3364)
Better run.c
\n