Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gpt2 preprocess and add deepseek coder preprocess #4070

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

DOGEwbx
Copy link

@DOGEwbx DOGEwbx commented Nov 14, 2023

  1. Revise the initial GPT-2 preprocessing for compatibility with Falcon's behavior by utilizing Regex to ensure simplicity.
  2. Incorporate DeepSeek coder pre-processing and enable other models based on HuggingFace tokenizer to adhere to the same procedure to endorse their individual models.
  3. Implement tests for the modifications and update the hf-to-gguf conversion script.

I get the deepseekcoder vocab with

python convert-hf-to-gguf.py <MODEL_PATH> --outfile ./models/ggml-vocab-deepseek-coder.gguf --model-name DeepseekCoder --vocab-only

The regex in unicode.h is extract with the brilliant repo https://github.com/Genivia/RE-flex with following script

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
#include <reflex/stdmatcher.h>
#include <locale>
#include <regex>
#include <iostream>
#include <string>
#include <cstring>
#include <vector>


int main()
{
    std::vector<std::string> regex_list = {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "[0-9][0-9][0-9]"
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "\\p{N}",

    };
    for(auto s: regex_list){
        std::string regex_s = reflex::StdMatcher::convert(s, reflex::convert_flag::unicode);
        std::cout<<s<<std::endl<<regex_s<<std::endl;
        }    
}

Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a cursory pass, I'm not sure if C++ tokenizer changes are acceptable, since I'm not very familiar with how tokenizers work. I'll leave review of that part to someone else.

convert-hf-to-gguf.py Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
llama.cpp Outdated Show resolved Hide resolved
convert-hf-to-gguf.py Show resolved Hide resolved
@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 14, 2023

current bpe_gpt2_preprocess behave different from falcon model's tokenizer(not sure if it is your target).
For example, to string "99000", falcon tokenizer will pretokenize it to "990" and "00", will current prerocess will not cut it, result in different behaviour

src: '99000'
res: '99000'
tok: 2512 1321 
main : failed test:    '99000'
main : detokenized to: '99000' instead of '99000'
main : expected tokens:  30825,    527, 
main : got tokens:        2512,   1321, 

convert-hf-to-gguf.py Outdated Show resolved Hide resolved
@DOGEwbx DOGEwbx requested a review from Galunid November 16, 2023 07:55
Copy link
Collaborator

@Galunid Galunid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python looks alright to me, as I said I'm not familiar with tokenization stuff so I'll leave the C++ part to someone else

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the tokenization time for wikitext 2 before and after this PR?
Run the perplexity tool - it is printed at the start

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 16, 2023

@ggerganov
Before the pr, the tokenization time is 1021.49 ms
After the pr, the tokenization time is 28982.1 ms
Seems like there is a 28x performance gap due to the complicated regex (I can't find a better solution to use regex with unicode in cpp without adding extra dependecies like RE-flex)
However, I think the correctness and extendability is much more important for tokenizer didn't a long time compared with inference.

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 16, 2023

I'd love to try out if you have any suggestions on imporving the performance

@ggerganov
Copy link
Owner

However, I think the correctness and extendability is much more important for tokenizer didn't a long time compared with inference.

The performance is also important. This performance hit of 28x is quite big, so we cannot accept it like this.
I immediately see some unnecessary string copies that should be avoided. Probably things can be optimized further.

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 21, 2023

Hi, I just finished the regex optimization by changing the utf-8 string and regex to wchar(utf32), now it can finish the in 2040ms

@DOGEwbx DOGEwbx requested a review from ggerganov November 21, 2023 12:38
llama.cpp Outdated Show resolved Hide resolved
@yhyu13
Copy link

yhyu13 commented Nov 25, 2023

@ggerganov Would you please review this pr or assign some one to review this pr, it is required for gguf to work with models with hf tokenizer which cannot provide setencepiece tokenizer https://github.com/deepseek-ai/DeepSeek-Coder#7-qa, thanks

llama.cpp Outdated
@@ -2599,7 +2605,7 @@ static void llm_load_print_meta(llama_model_loader & ml, llama_model & model) {
// hparams
LLAMA_LOG_INFO("%s: format = %s\n", __func__, llama_file_version_name(ml.fver));
LLAMA_LOG_INFO("%s: arch = %s\n", __func__, LLM_ARCH_NAMES.at(model.arch).c_str());
LLAMA_LOG_INFO("%s: vocab type = %s\n", __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : "BPE"); // TODO: fix
LLAMA_LOG_INFO("%s: vocab type = %s\n", __func__, vocab.type == LLAMA_VOCAB_TYPE_SPM ? "SPM" : (vocab.type == LLAMA_VOCAB_TYPE_BPE ? "BPE" : "DEEPSEEKCODER")); // TODO: fix
Copy link
Contributor

@teleprint-me teleprint-me Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is SPM?

>>> from transformers import AutoTokenizer
>>> model_directory = "models/deepseek-ai/deepseek-coder-6.7b-instruct"
>>> tokenizer = AutoTokenizer.from_pretrained(model_directory)
>>> tokenizer.vocab_files_names
{'vocab_file': 'tokenizer.model', 'tokenizer_file': 'tokenizer.json'}

@ggerganov
Copy link
Owner

The tokenizer tests for falcon and deepseek-coder are not passing:

make tests

We need to fix this before merging

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 29, 2023

image
@ggerganov I just checkout to my branch and run

make test

all the tests are passed. Could you check if your HEAD is the same as the image above?

@ggerganov
Copy link
Owner

It works on Linux, but it fails on Mac OS and Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/6952586370/job/19127179619

@DOGEwbx
Copy link
Author

DOGEwbx commented Nov 29, 2023

It works on Linux, but it fails on Mac OS and Windows:

https://github.com/ggerganov/llama.cpp/actions/runs/6952586370/job/19127179619

It seems like wchar_t on windows is implemented with char16_t

@TheBloke
Copy link
Contributor

TheBloke commented Dec 1, 2023

There's another active llama.cpp PR which enables making DeepSeek GGUFs, and it's the one I used to make my DeepSeek GGUFs on HF: #3633

This other PR uses Hugging Face AutoTokenizer, so the GGUF is produced directly from tokenizer.json, ie it should in theory tokenise identically to the source model.

I don't know if there will be any difference in output from the PR I used, vs this PR here. Anyone got any thoughts on that?

@ggerganov
Copy link
Owner

@TheBloke Thanks for pointing this out. AFAIU this PR is still needed in addition to #3633 in order to fix some incorrectly tokenized strings (like certain multi-digit numbers, etc.). @DOGEwbx can you confirm this is the case?

@DOGEwbx
Copy link
Author

DOGEwbx commented Dec 2, 2023

@ggerganov @TheBloke Hi guys, I just take a brief look on #3633 , that PR mainly focus on loading the vocab from tokenizer.json which has no difference with this PR. The mismatched problem originates from the pre-tokenizer mechanism of HuggingFace Tokenizer https://huggingface.co/docs/tokenizers/api/pre-tokenizers. It will split the input string to substrings and token across substrings will not be merged.
image

Take the Falcon tokenizer I mentioned before as an example, the pretokenzier will split the string "99000" to "990" and "00" according to the regex pre-tokenizer

{
        "type": "Split",
        "pattern": {
          "Regex": "[0-9][0-9][0-9]"
        },
        "behavior": "Isolated",
        "invert": false
      }

and it will be tokenized to "990" -> 30825, "00"-> 527.
However, in current implementation of pre-tokenizer, this mechanisim is not supported and all the Byte-level BPE share the same preporcess method.

In this process function, 99000 will not be splitted and BPE will give result of "99" "000" since "000" has higher priority compared with "990", which leads to the unmatched tokenizer encode result.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big change in the tokenizer and I'm a little worried it might break something.

Tested LLaMA and Falcon:

  • SPM (LLaMA) produces the same results
  • BPE (Falcon) tokenization is quite slower, but the PPL for wikitext is better, so it should be OK

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is still failing - need to fix this before merging

@dragnil1
Copy link
Contributor

Hi, I just finished the regex optimization by changing the utf-8 string and regex to wchar(utf32), now it can finish the in 2040ms

@DOGEwbx , How did you convert the utf-8 regexes given by the RE-flex library to wchar ones? These wchar regexes are in utf32 format. But i guess, if they can be changed to utf16 format, then they can be used in platforms having wchar_t size with 16bit.

@DOGEwbx
Copy link
Author

DOGEwbx commented Apr 18, 2024

@dragnil1 Sorry for the late reply. It has been a while since last time I run the code. I checked my workspace and found a piece of code that might help. It seem's like I copied the unicode range from unicoe.h and convert it o utf-32 to replace regex groups like \p{L}, \p{P}

cpp_ranges = """
static const std::vector<std::pair<uint32_t, uint32_t>> punctuation_ranges = {
{0x21, 0x23}, {0x25, 0x2A}, {0x2C, 0x2F}, {0x3A, 0x3B}, {0x3F, 0x40}, {0x5B, 0x5D}, {0x5F, 0x5F}, {0x7B, 0x7B}, {0x7D, 0x7D}, {0xA1, 0xA1}, {0xA7, 0xA7}, {0xAB, 0xAB}, {0xB6, 0xB7}, {0xBB, 0xBB},
{0xBF, 0xBF}, {0x37E, 0x37E}, {0x387, 0x387}, {0x55A, 0x55F}, {0x589, 0x58A}, {0x5BE, 0x5BE}, {0x5C0, 0x5C0}, {0x5C3, 0x5C3}, {0x5C6, 0x5C6}, {0x5F3, 0x5F4}, {0x609, 0x60A}, {0x60C, 0x60D},
{0x61B, 0x61B}, {0x61E, 0x61F}, {0x66A, 0x66D}, {0x6D4, 0x6D4}, {0x700, 0x70D}, {0x7F7, 0x7F9}, {0x830, 0x83E}, {0x85E, 0x85E}, {0x964, 0x965}, {0x970, 0x970}, {0x9FD, 0x9FD}, {0xA76, 0xA76},
{0xAF0, 0xAF0}, {0xC77, 0xC77}, {0xC84, 0xC84}, {0xDF4, 0xDF4}, {0xE4F, 0xE4F}, {0xE5A, 0xE5B}, {0xF04, 0xF12}, {0xF14, 0xF14}, {0xF3A, 0xF3D}, {0xF85, 0xF85}, {0xFD0, 0xFD4}, {0xFD9, 0xFDA},
{0x104A, 0x104F}, {0x10FB, 0x10FB}, {0x1360, 0x1368}, {0x1400, 0x1400}, {0x166E, 0x166E}, {0x169B, 0x169C}, {0x16EB, 0x16ED}, {0x1735, 0x1736}, {0x17D4, 0x17D6}, {0x17D8, 0x17DA}, {0x1800, 0x180A},
{0x1944, 0x1945}, {0x1A1E, 0x1A1F}, {0x1AA0, 0x1AA6}, {0x1AA8, 0x1AAD}, {0x1B5A, 0x1B60}, {0x1BFC, 0x1BFF}, {0x1C3B, 0x1C3F}, {0x1C7E, 0x1C7F}, {0x1CC0, 0x1CC7}, {0x1CD3, 0x1CD3}, {0x2010, 0x2027},
{0x2030, 0x2043}, {0x2045, 0x2051}, {0x2053, 0x205E}, {0x207D, 0x207E}, {0x208D, 0x208E}, {0x2308, 0x230B}, {0x2329, 0x232A}, {0x2768, 0x2775}, {0x27C5, 0x27C6}, {0x27E6, 0x27EF}, {0x2983, 0x2998},
{0x29D8, 0x29DB}, {0x29FC, 0x29FD}, {0x2CF9, 0x2CFC}, {0x2CFE, 0x2CFF}, {0x2D70, 0x2D70}, {0x2E00, 0x2E2E}, {0x2E30, 0x2E4F}, {0x2E52, 0x2E52}, {0x3001, 0x3003}, {0x3008, 0x3011}, {0x3014, 0x301F},
{0x3030, 0x3030}, {0x303D, 0x303D}, {0x30A0, 0x30A0}, {0x30FB, 0x30FB}, {0xA4FE, 0xA4FF}, {0xA60D, 0xA60F}, {0xA673, 0xA673}, {0xA67E, 0xA67E}, {0xA6F2, 0xA6F7}, {0xA874, 0xA877}, {0xA8CE, 0xA8CF},
{0xA8F8, 0xA8FA}, {0xA8FC, 0xA8FC}, {0xA92E, 0xA92F}, {0xA95F, 0xA95F}, {0xA9C1, 0xA9CD}, {0xA9DE, 0xA9DF}, {0xAA5C, 0xAA5F}, {0xAADE, 0xAADF}, {0xAAF0, 0xAAF1}, {0xABEB, 0xABEB}, {0xFD3E, 0xFD3F},
{0xFE10, 0xFE19}, {0xFE30, 0xFE52}, {0xFE54, 0xFE61}, {0xFE63, 0xFE63}, {0xFE68, 0xFE68}, {0xFE6A, 0xFE6B}, {0xFF01, 0xFF03}, {0xFF05, 0xFF0A}, {0xFF0C, 0xFF0F}, {0xFF1A, 0xFF1B}, {0xFF1F, 0xFF20},
{0xFF3B, 0xFF3D}, {0xFF3F, 0xFF3F}, {0xFF5B, 0xFF5B}, {0xFF5D, 0xFF5D}, {0xFF5F, 0xFF65}, {0x10100, 0x10102}, {0x1039F, 0x1039F}, {0x103D0, 0x103D0}, {0x1056F, 0x1056F}, {0x10857, 0x10857},
{0x1091F, 0x1091F}, {0x1093F, 0x1093F}, {0x10A50, 0x10A58}, {0x10A7F, 0x10A7F}, {0x10AF0, 0x10AF6}, {0x10B39, 0x10B3F}, {0x10B99, 0x10B9C}, {0x10EAD, 0x10EAD}, {0x10F55, 0x10F59}, {0x11047, 0x1104D},
{0x110BB, 0x110BC}, {0x110BE, 0x110C1}, {0x11140, 0x11143}, {0x11174, 0x11175}, {0x111C5, 0x111C8}, {0x111CD, 0x111CD}, {0x111DB, 0x111DB}, {0x111DD, 0x111DF}, {0x11238, 0x1123D}, {0x112A9, 0x112A9},
{0x1144B, 0x1144F}, {0x1145A, 0x1145B}, {0x1145D, 0x1145D}, {0x114C6, 0x114C6}, {0x115C1, 0x115D7}, {0x11641, 0x11643}, {0x11660, 0x1166C}, {0x1173C, 0x1173E}, {0x1183B, 0x1183B}, {0x11944, 0x11946},
{0x119E2, 0x119E2}, {0x11A3F, 0x11A46}, {0x11A9A, 0x11A9C}, {0x11A9E, 0x11AA2}, {0x11C41, 0x11C45}, {0x11C70, 0x11C71}, {0x11EF7, 0x11EF8}, {0x11FFF, 0x11FFF}, {0x12470, 0x12474}, {0x16A6E, 0x16A6F},
{0x16AF5, 0x16AF5}, {0x16B37, 0x16B3B}, {0x16B44, 0x16B44}, {0x16E97, 0x16E9A}, {0x16FE2, 0x16FE2}, {0x1BC9F, 0x1BC9F}, {0x1DA87, 0x1DA8B}, {0x1E95E, 0x1E95F},
};
"""

# convert to python
digit_ranges = []
for line in cpp_ranges.splitlines():
    if not line.startswith("{"):
        continue
    line = line.strip("{}")
    line = line.replace(" ", "")
    line = line.replace("0x", "")
    line = line.split(",")
    for first, second in zip(line[::2],line[1::2]):
        digit_ranges.append((int(first.strip("{}"), 16), int(second.strip("{}"), 16)))

# print the symbols as UTF-32 regex
print(("".join(["\\U%08x-\\U%08x" % (first, second) for first, second in digit_ranges])).upper())

@DOGEwbx
Copy link
Author

DOGEwbx commented Apr 18, 2024

@dragnil1 There are also some code piece for the failed re-flex tries, it will outoput utf-8 format but it's too slow, hope it'll help.

#include <reflex/matcher.h> // reflex::Matcher, reflex::Input, reflex::Pattern
#include <reflex/stdmatcher.h>
#include <locale>
#include <regex>
#include <iostream>
#include <string>
#include <cstring>
#include <vector>


int main()
{
    std::vector<std::string> regex_list = {
        "[\\p{P}\\$\\+<=>\\^~\\|]+",
        "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
        "[0-9][0-9][0-9]"
        "\\s?\\p{L}+",
        "\\s?\\p{P}+",
        "\\p{N}",

    };
    for(auto s: regex_list){
        std::string regex_s = reflex::StdMatcher::convert(s, reflex::convert_flag::unicode);
        std::cout<<s<<std::endl<<regex_s<<std::endl;
        }    
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants