-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update gpt2 preprocess and add deepseek coder preprocess #4070
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just a cursory pass, I'm not sure if C++ tokenizer changes are acceptable, since I'm not very familiar with how tokenizers work. I'll leave review of that part to someone else.
current bpe_gpt2_preprocess behave different from falcon model's tokenizer(not sure if it is your target). src: '99000'
res: '99000'
tok: 2512 1321
main : failed test: '99000'
main : detokenized to: '99000' instead of '99000'
main : expected tokens: 30825, 527,
main : got tokens: 2512, 1321, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python looks alright to me, as I said I'm not familiar with tokenization stuff so I'll leave the C++ part to someone else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the tokenization time for wikitext 2 before and after this PR?
Run the perplexity
tool - it is printed at the start
@ggerganov |
I'd love to try out if you have any suggestions on imporving the performance |
The performance is also important. This performance hit of 28x is quite big, so we cannot accept it like this. |
Hi, I just finished the regex optimization by changing the utf-8 string and regex to wchar(utf32), now it can finish the in 2040ms |
@ggerganov Would you please review this pr or assign some one to review this pr, it is required for gguf to work with models with hf tokenizer which cannot provide setencepiece tokenizer https://github.com/deepseek-ai/DeepSeek-Coder#7-qa, thanks |
llama.cpp
Outdated
@@ -2599,7 +2605,7 @@ static void llm_load_print_meta(llama_model_loader & ml, llama_model & model) { | |||
// hparams | |||
LLAMA_LOG_INFO("%s: format = %s\n", __func__, llama_file_version_name(ml.fver)); | |||
LLAMA_LOG_INFO("%s: arch = %s\n", __func__, LLM_ARCH_NAMES.at(model.arch).c_str()); | |||
LLAMA_LOG_INFO("%s: vocab type = %s\n", __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : "BPE"); // TODO: fix | |||
LLAMA_LOG_INFO("%s: vocab type = %s\n", __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : (vocab.type == LLAMA_VOCAB_TYPE_BPE ? "BPE" : "DEEPSEEKCODER")); // TODO: fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is SPM?
>>> from transformers import AutoTokenizer
>>> model_directory = "models/deepseek-ai/deepseek-coder-6.7b-instruct"
>>> tokenizer = AutoTokenizer.from_pretrained(model_directory)
>>> tokenizer.vocab_files_names
{'vocab_file': 'tokenizer.model', 'tokenizer_file': 'tokenizer.json'}
The tokenizer tests for falcon and deepseek-coder are not passing: make tests We need to fix this before merging |
make test all the tests are passed. Could you check if your HEAD is the same as the image above? |
It works on Linux, but it fails on Mac OS and Windows: https://github.com/ggerganov/llama.cpp/actions/runs/6952586370/job/19127179619 |
It seems like wchar_t on windows is implemented with char16_t |
There's another active llama.cpp PR which enables making DeepSeek GGUFs, and it's the one I used to make my DeepSeek GGUFs on HF: #3633 This other PR uses Hugging Face AutoTokenizer, so the GGUF is produced directly from I don't know if there will be any difference in output from the PR I used, vs this PR here. Anyone got any thoughts on that? |
@ggerganov @TheBloke Hi guys, I just take a brief look on #3633 , that PR mainly focus on loading the vocab from tokenizer.json which has no difference with this PR. The mismatched problem originates from the pre-tokenizer mechanism of HuggingFace Tokenizer https://huggingface.co/docs/tokenizers/api/pre-tokenizers. It will split the input string to substrings and token across substrings will not be merged. Take the Falcon tokenizer I mentioned before as an example, the pretokenzier will split the string "99000" to "990" and "00" according to the regex pre-tokenizer {
"type": "Split",
"pattern": {
"Regex": "[0-9][0-9][0-9]"
},
"behavior": "Isolated",
"invert": false
} and it will be tokenized to "990" -> 30825, "00"-> 527. In this process function, 99000 will not be splitted and BPE will give result of "99" "000" since "000" has higher priority compared with "990", which leads to the unmatched tokenizer encode result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change in the tokenizer and I'm a little worried it might break something.
Tested LLaMA and Falcon:
- SPM (LLaMA) produces the same results
- BPE (Falcon) tokenization is quite slower, but the PPL for wikitext is better, so it should be OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI is still failing - need to fix this before merging
@DOGEwbx , How did you convert the utf-8 regexes given by the RE-flex library to wchar ones? These wchar regexes are in utf32 format. But i guess, if they can be changed to utf16 format, then they can be used in platforms having wchar_t size with 16bit. |
@dragnil1 Sorry for the late reply. It has been a while since last time I run the code. I checked my workspace and found a piece of code that might help. It seem's like I copied the unicode range from cpp_ranges = """
static const std::vector<std::pair<uint32_t, uint32_t>> punctuation_ranges = {
{0x21, 0x23}, {0x25, 0x2A}, {0x2C, 0x2F}, {0x3A, 0x3B}, {0x3F, 0x40}, {0x5B, 0x5D}, {0x5F, 0x5F}, {0x7B, 0x7B}, {0x7D, 0x7D}, {0xA1, 0xA1}, {0xA7, 0xA7}, {0xAB, 0xAB}, {0xB6, 0xB7}, {0xBB, 0xBB},
{0xBF, 0xBF}, {0x37E, 0x37E}, {0x387, 0x387}, {0x55A, 0x55F}, {0x589, 0x58A}, {0x5BE, 0x5BE}, {0x5C0, 0x5C0}, {0x5C3, 0x5C3}, {0x5C6, 0x5C6}, {0x5F3, 0x5F4}, {0x609, 0x60A}, {0x60C, 0x60D},
{0x61B, 0x61B}, {0x61E, 0x61F}, {0x66A, 0x66D}, {0x6D4, 0x6D4}, {0x700, 0x70D}, {0x7F7, 0x7F9}, {0x830, 0x83E}, {0x85E, 0x85E}, {0x964, 0x965}, {0x970, 0x970}, {0x9FD, 0x9FD}, {0xA76, 0xA76},
{0xAF0, 0xAF0}, {0xC77, 0xC77}, {0xC84, 0xC84}, {0xDF4, 0xDF4}, {0xE4F, 0xE4F}, {0xE5A, 0xE5B}, {0xF04, 0xF12}, {0xF14, 0xF14}, {0xF3A, 0xF3D}, {0xF85, 0xF85}, {0xFD0, 0xFD4}, {0xFD9, 0xFDA},
{0x104A, 0x104F}, {0x10FB, 0x10FB}, {0x1360, 0x1368}, {0x1400, 0x1400}, {0x166E, 0x166E}, {0x169B, 0x169C}, {0x16EB, 0x16ED}, {0x1735, 0x1736}, {0x17D4, 0x17D6}, {0x17D8, 0x17DA}, {0x1800, 0x180A},
{0x1944, 0x1945}, {0x1A1E, 0x1A1F}, {0x1AA0, 0x1AA6}, {0x1AA8, 0x1AAD}, {0x1B5A, 0x1B60}, {0x1BFC, 0x1BFF}, {0x1C3B, 0x1C3F}, {0x1C7E, 0x1C7F}, {0x1CC0, 0x1CC7}, {0x1CD3, 0x1CD3}, {0x2010, 0x2027},
{0x2030, 0x2043}, {0x2045, 0x2051}, {0x2053, 0x205E}, {0x207D, 0x207E}, {0x208D, 0x208E}, {0x2308, 0x230B}, {0x2329, 0x232A}, {0x2768, 0x2775}, {0x27C5, 0x27C6}, {0x27E6, 0x27EF}, {0x2983, 0x2998},
{0x29D8, 0x29DB}, {0x29FC, 0x29FD}, {0x2CF9, 0x2CFC}, {0x2CFE, 0x2CFF}, {0x2D70, 0x2D70}, {0x2E00, 0x2E2E}, {0x2E30, 0x2E4F}, {0x2E52, 0x2E52}, {0x3001, 0x3003}, {0x3008, 0x3011}, {0x3014, 0x301F},
{0x3030, 0x3030}, {0x303D, 0x303D}, {0x30A0, 0x30A0}, {0x30FB, 0x30FB}, {0xA4FE, 0xA4FF}, {0xA60D, 0xA60F}, {0xA673, 0xA673}, {0xA67E, 0xA67E}, {0xA6F2, 0xA6F7}, {0xA874, 0xA877}, {0xA8CE, 0xA8CF},
{0xA8F8, 0xA8FA}, {0xA8FC, 0xA8FC}, {0xA92E, 0xA92F}, {0xA95F, 0xA95F}, {0xA9C1, 0xA9CD}, {0xA9DE, 0xA9DF}, {0xAA5C, 0xAA5F}, {0xAADE, 0xAADF}, {0xAAF0, 0xAAF1}, {0xABEB, 0xABEB}, {0xFD3E, 0xFD3F},
{0xFE10, 0xFE19}, {0xFE30, 0xFE52}, {0xFE54, 0xFE61}, {0xFE63, 0xFE63}, {0xFE68, 0xFE68}, {0xFE6A, 0xFE6B}, {0xFF01, 0xFF03}, {0xFF05, 0xFF0A}, {0xFF0C, 0xFF0F}, {0xFF1A, 0xFF1B}, {0xFF1F, 0xFF20},
{0xFF3B, 0xFF3D}, {0xFF3F, 0xFF3F}, {0xFF5B, 0xFF5B}, {0xFF5D, 0xFF5D}, {0xFF5F, 0xFF65}, {0x10100, 0x10102}, {0x1039F, 0x1039F}, {0x103D0, 0x103D0}, {0x1056F, 0x1056F}, {0x10857, 0x10857},
{0x1091F, 0x1091F}, {0x1093F, 0x1093F}, {0x10A50, 0x10A58}, {0x10A7F, 0x10A7F}, {0x10AF0, 0x10AF6}, {0x10B39, 0x10B3F}, {0x10B99, 0x10B9C}, {0x10EAD, 0x10EAD}, {0x10F55, 0x10F59}, {0x11047, 0x1104D},
{0x110BB, 0x110BC}, {0x110BE, 0x110C1}, {0x11140, 0x11143}, {0x11174, 0x11175}, {0x111C5, 0x111C8}, {0x111CD, 0x111CD}, {0x111DB, 0x111DB}, {0x111DD, 0x111DF}, {0x11238, 0x1123D}, {0x112A9, 0x112A9},
{0x1144B, 0x1144F}, {0x1145A, 0x1145B}, {0x1145D, 0x1145D}, {0x114C6, 0x114C6}, {0x115C1, 0x115D7}, {0x11641, 0x11643}, {0x11660, 0x1166C}, {0x1173C, 0x1173E}, {0x1183B, 0x1183B}, {0x11944, 0x11946},
{0x119E2, 0x119E2}, {0x11A3F, 0x11A46}, {0x11A9A, 0x11A9C}, {0x11A9E, 0x11AA2}, {0x11C41, 0x11C45}, {0x11C70, 0x11C71}, {0x11EF7, 0x11EF8}, {0x11FFF, 0x11FFF}, {0x12470, 0x12474}, {0x16A6E, 0x16A6F},
{0x16AF5, 0x16AF5}, {0x16B37, 0x16B3B}, {0x16B44, 0x16B44}, {0x16E97, 0x16E9A}, {0x16FE2, 0x16FE2}, {0x1BC9F, 0x1BC9F}, {0x1DA87, 0x1DA8B}, {0x1E95E, 0x1E95F},
};
"""
# convert to python
digit_ranges = []
for line in cpp_ranges.splitlines():
if not line.startswith("{"):
continue
line = line.strip("{}")
line = line.replace(" ", "")
line = line.replace("0x", "")
line = line.split(",")
for first, second in zip(line[::2],line[1::2]):
digit_ranges.append((int(first.strip("{}"), 16), int(second.strip("{}"), 16)))
# print the symbols as UTF-32 regex
print(("".join(["\\U%08x-\\U%08x" % (first, second) for first, second in digit_ranges])).upper()) |
@dragnil1 There are also some code piece for the failed re-flex tries, it will outoput utf-8 format but it's too slow, hope it'll help. #include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
#include <reflex/stdmatcher.h>
#include <locale>
#include <regex>
#include <iostream>
#include <string>
#include <cstring>
#include <vector>
int main()
{
std::vector<std::string> regex_list = {
"[\\p{P}\\$\\+<=>\\^~\\|]+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
"[0-9][0-9][0-9]"
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"\\p{N}",
};
for(auto s: regex_list){
std::string regex_s = reflex::StdMatcher::convert(s, reflex::convert_flag::unicode);
std::cout<<s<<std::endl<<regex_s<<std::endl;
}
} |
I get the deepseekcoder vocab with
The regex in unicode.h is extract with the brilliant repo https://github.com/Genivia/RE-flex with following script