llama : fix pre-tokenization of non-special added tokens #8228

compilade · 2024-06-30T18:59:16Z

MPT and OLMo use a NeoX-style tokenizer, which has pre-normalized spaces in its added_tokens, while Gemma uses encoded (non-normalized) spaces in its added_tokens, and also has some HTML tags in there.

In the current state of src/llama.cpp, the tokenizer tests fails for MPT and OLMo when using parse_special == false.

HuggingFace's tokenizers pre-tokenizes the added_tokens before other pre-tokenization happens, see https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/tokenizer/mod.rs#L726

I've changed tokenizer_st_partition to be called even if parse_special is false to pre-tokenize user-defined tokens, but it processes control tokens only when parse_special is true. This allows pre-tokenizing non-special added tokens correctly at all times, while still allowing to safely not tokenize control tokens when parse_special is false.

This also fixes Gemma's tokenization of HTML tags, but it requires re-conversion because the added tokens were previously not using the correct types, and also because user-defined tokens are assumed to use bare spaces by llama.cpp, but Gemma uses "▁" (normal tokens are not affected).

TODO

Tests should pass
- ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mpt.gguf
  - Note that NFC normalization is still a problem to be solved, but everything else works.
- ./build/bin/test-tokenizer-0 ./models/ggml-vocab-olmo.gguf
- using tests/test-tokenizer-random.py with Gemma-2's tokenizer, there's no difference whatsoever to the reference tokenizer.model

I have read the contributing guidelines
Self-reported review complexity:
- Medium

jaime-m-p · 2024-07-01T22:05:53Z

@compilade
Can you confirm that the commit 68220fe still fails with MPT and OLMO?

I can see problems related to NFD/NFC normalization, but I can't find any problems with spaces.
All the combinations of the following brute-force added tokens with whitespaces seems to pass correctly (tokenizer and detokenizer):

llama.cpp/tests/test-tokenizer-random.py

Lines 272 to 281 in 68220fe

    
           def generator_added_lr_strip(tokenizer: TokenizerGroundtruth) -> Iterator[str]: 
        
               WHITESPACES = ["", " ", "  ", "\n", "\r\n", "\n\n", "\t", "\t\t"] 
        
               all_tokens = list(sorted(set(tokenizer.special_tokens + tokenizer.added_tokens))) 
        
               for token in all_tokens: 
        
                   for lstrip in WHITESPACES: 
        
                       for rstrip in WHITESPACES: 
        
                           yield lstrip + token + rstrip 
        
                           yield "a" + lstrip + token + rstrip 
        
                           yield lstrip + token + rstrip + "z" 
        
                           yield "a" + lstrip + token + rstrip + "z"

Please, can you provide me a failure case so I can check?

jaime-m-p · 2024-07-01T22:27:44Z

src/llama.cpp

@@ -13985,6 +14002,10 @@ struct llm_tokenizer_bpe {
    void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
        int final_prev_index = -1;

+        // FIXME: pre-tokenize added_tokens (user-defined tokens) before other pre-tokenization


I think you are looking for this:

llama.cpp/src/llama.cpp

Line 14789 in db2ffd5

if (parse_special) tokenizer_st_partition(vocab, fragment_buffer);

If enabled (parse_special), all added tokens are pre-tokenized even before regex splits.

Hmm, yes! This is it. But it makes me think there should be a distinction between special tokens and user-defined tokens. I think parse_special=false should still pre-tokenize non-special user-defined tokens before the regex splits.

compilade · 2024-07-01T23:09:28Z

Can you confirm that the commit 68220fe still fails with MPT and OLMO?

@jaime-m-p With that commit, with the MPT tokenizer, I can see that the spaces do get handled correctly with tests/test-tokenizer-0.cpp, although it seems to instead fail with the apostrophe in I've been (and others like I'll make), splitting it in I ' ve been instead of using the 've token), which was not a problem here.

However, correcting the regex_exprs of LLAMA_VOCAB_TYPE_PRE_MPT makes it work correctly in all the cases (at least for tests/test-tokenizer-0.cpp; I can't get tests/test-tokenizer-random.py to start correctly):

diff --git a/llama.cpp b/llama.cpp
index ab8620ec..58b84839 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -13222,13 +13222,7 @@ struct llm_tokenizer_bpe {
                 };
                 break;
             case LLAMA_VOCAB_PRE_TYPE_MPT:
-                // TODO: MPT pre-tokenization regexes are unknown
-                //       the following are close, but not exact. run the following:
-                //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
-                GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
                 regex_exprs = {
-                    "\\s?\\p{L}+",
-                    "\\s?\\p{P}+",
                     "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
                 };
                 break;

What you've done is definitely better than this PR, since this means MPT and OLMo and other NeoX-style tokenizers can simply directly use the GPT-2 pre-tokenizer, like they should.

I think I'll close this in favor of #8039.

compilade · 2024-07-01T23:51:17Z

From further testing, it seems like passing parse_special=true to llama_tokenize makes the correct multi-space tokens be used for NeoX-style tokenizers. That's what fixed it in 68220fe. However, when not using parse_special, this doesn't work anymore and instead uses the other multi-space tokens.

This PR makes it still pick the special spaces even when parse_special is false. Not sure if that's desired, maybe not. It depends on how models based on that kind of tokenizer behave with the non-special multi-space tokens.

Seems like NeoX tokenizers complicate the decision of whether user input should be exempt of special tokens or not.

Or maybe user-defined tokens should not be considered special as in parse_special, but still pre-tokenized separately?

jaime-m-p · 2024-07-02T15:42:00Z

@compilade

MPT and OLMo and other NeoX-style tokenizers can simply directly use the GPT-2 pre-tokenizer, like they should.

Oh, this seems to fix all MPT apostrophe problems.

This PR makes it still pick the special spaces even when parse_special is false. Not sure if that's desired, maybe not. It depends on how models based on that kind of tokenizer behave with the non-special multi-space tokens.
Seems like NeoX tokenizers complicate the decision of whether user input should be exempt of special tokens or not.
Or maybe user-defined tokens should not be considered special as in parse_special, but still pre-tokenized separately?

Note that AutoTokenizer does not even offer the optional parse_special feature.
I think the common use case is to use always parse_special = true (default True in AutoTokenizer) and then detokenize with unparse_special = true (skip_special_tokens = False in AutoTokenizer).

Only used in _set_vocab_gpt2() for now.

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens

compilade · 2024-07-08T04:41:51Z

In trying to fix the pre-tokenization for non-special added tokens for MPT and OLMo, I think I've also fixed Gemma and Gemma-2's pre-tokenization.

Gemma on master doesn't pre-tokenize HTML tags correctly, while in this branch it does! Unfortunately, it requires re-conversion because the user-defined tokens from tokenizer_config.json were not set to the right token types on master, and user-defined tokens in the rest of the codebase are assumed to use bare spaces instead of ▁.

On master:

$ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
     2 -> '<bos>'
235322 -> '<'
   973 -> 'blockquote'
  2577 -> '><'
235259 -> 'h'
235274 -> '1'
235313 -> '>'
  4521 -> 'Hello'
   727 -> '</'
235259 -> 'h'
235274 -> '1'
  3119 -> '></'
   973 -> 'blockquote'
235313 -> '>'

On this branch (after re-converting the model):

$ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
     2 -> '<bos>'
   191 -> '<blockquote>'
   185 -> '<h1>'
  4521 -> 'Hello'
   192 -> '</h1>'
   198 -> '</blockquote>'

There's a weird behavior of the HF tokenizer for Gemma which seems to prefer the 16-space token (id: 153) over longer ones:

INFO:test-tokenizer-random:generator_custom_text: ini
ERROR:test-tokenizer-random: Expected: [2, 153, 153, 155, 132133, 14130] ['<bos>', '                ', '                ', '                  ', 'fifty', ' spaces']
ERROR:test-tokenizer-random:   Result: [2, 168, 156, 132133, 14130] ['<bos>', '                               ', '                   ', 'fifty', ' spaces']

But when using tokenizer.model instead of tokenizer.json for the reference tokenizer, this error does not happen.

So I assume what llama.cpp is now doing on this branch is the right thing, since I've yet to find a tokenization difference with the reference tokenizer.model.

compilade · 2024-07-08T16:38:27Z

convert_hf_to_gguf.py

-        tokens, scores, toktypes = self._create_vocab_sentencepiece()
-        # hack: This is required so that we can properly use start/end-of-turn for chat template
-        for i in range(108):
-            # including <unusedX>, <start_of_turn>, <end_of_turn>
-            toktypes[i] = SentencePieceTokenTypes.CONTROL
-        self.gguf_writer.add_tokenizer_model("llama")
-        self.gguf_writer.add_tokenizer_pre("default")
-        self.gguf_writer.add_token_list(tokens)
-        self.gguf_writer.add_token_scores(scores)
-        self.gguf_writer.add_token_types(toktypes)
-
-        special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
-        special_vocab.add_to_gguf(self.gguf_writer)
+        self._set_vocab_sentencepiece()


The "hack" from #8244 is no longer required because the control tokens are now identified with tokenizer_config.json.

compilade · 2024-07-08T16:41:18Z

src/llama.cpp

-            if (!(vocab.id_to_token[id].attr & LLAMA_TOKEN_ATTR_NORMAL)) {
+            if (vocab.id_to_token[id].attr & (LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED)) {


It's necessary to limit the special token cache to CONTROL and USER_DEFINED tokens, because otherwise BYTE tokens would be pre-tokenized incorrectly (e.g. the string <0x20> would get tokenized as (a space)).

But maybe it's a bad idea to exclude other types of tokens, like UNUSED padding tokens which might in some cases (when?) be desired to tokenize to themselves?

I agree with LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED, but I think we need LLAMA_TOKEN_ATTR_UNKNOWN too.

I think is corrrect to drop UNUSED tokens but we need parse UNKNOWN token.
It is LLAMA_TOKEN_ATTR_NORMAL because currently UNUSED and UNKNOWN are wrong mixed.

Should UNKNOWN tokens only be parsed when parse_special == true, or all the time like USER_DEFINED tokens?

I'm not sure if UNKNOWN tokens should be specially parsed at all.

I would tend toward only parsing them with parse_special == true, like CONTROL tokens. (this is also the current behavior of master)

I'm trying to figure out where UNKNOWN tokens are used and if it's useful to specially parse them. But this might differ from HF's tokenizers, so I need a model with UNKNOWN tokens to test this out. If I don't find one, I could modify an existing one.

but we need parse UNKNOWN token.

I've fixed this in 98edea6. UNKNOWN tokens are parsed when parse_special == true, as on master.

compilade · 2024-07-08T16:44:44Z

convert_hf_to_gguf.py

                tokens.append(f"[PAD{i}]")
-                toktypes.append(gguf.TokenType.USER_DEFINED)
+                toktypes.append(gguf.TokenType.UNUSED)


Padding tokens are set as UNUSED to reflect how it was already done in _set_vocab_sentencepiece, and also to avoid wrongly (pre-)tokenizing strings which happen to correspond to a padding token. (since USER_DEFINED tokens are now always pre-tokenized specially)

bartowski1182 · 2024-07-08T18:21:20Z

convert_hf_to_gguf.py

+        seems_special = seems_special or (token_text.startswith("<｜") and token_text.endswith("｜>"))  # deepseek-coder
+
+        # TODO: should these be marked as UNUSED instead? (maybe not)
+        seems_special = seems_special or (token_text.startswith("<unused") and token_text.endswith(">"))  # gemma{,-2}


should things like this be defined in the conversion script under the specific model to avoid accidental false hits? if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that

maybe does_token_look_special should take in 2 lists: 1 list of strings of known special tokens, and a list of tuples of starts/ends with tokens

So for gemma2, we'd call it with:

special_tokens = ["<mask>", "<2mass>", "[@BOS@]"] special_tags = [("<unused", "|>")] self.does_token_look_special(token, special_tokens, special_tags)

and then here we'd have:

seems_special = token_text in special_tokens for start_tag, end_tag in special_tags: seems_special = seems_special or (token_text.startswith(start_tag) and token_text.endswith(end_tag)) return seems_special

oh i realize I misread the structure of the code, hmmm.. still not impossible but would have to be passed at a higher level

if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that

This only affects added tokens either from added_tokens in tokenizer.json or from added_tokens_decoder in tokenizer_config.json, so it does not affect normal tokens starting with <| in any way. Not all tokens of the vocab are checked with this, only the ones part of added_tokens (which are treated specially by HF tokenizers too anyway). And added tokens starting with <| and ending with |> are arguably always control tokens; this was added pretty much because some model makers wrongly marked those as non-special (notably, <|User|>, <|Assistant|> and <|EOT|> in deepseek-coder are supposedly non-special. Same with <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> for command-r).

I did not yet notice any conflict in the added_tokens to justify making model-specific checks instead of always checking for all known "special-but-arent-marked-special" tokens.

Also, this is a method of Model, so it technically can be overridden by subclasses should there ever be a model with conflicting added_tokens.

The order was previously wrong, which caused errors in some tests.

llama : fix mpt and olmo pre-tokenizer llama : pre-tokenize non-special user-defined tokens first llama : fix detection of control-like user-defined tokens convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. llama : fix command-r detokenization

llama : fix mpt and olmo pre-tokenizer llama : pre-tokenize non-special user-defined tokens first llama : fix detection of control-like user-defined tokens convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. llama : fix command-r detokenization llama : add UNKNOWN tokens in the special tokens cache

oldgithubman · 2024-07-09T02:58:34Z

Which models are going to need re-conversion? Just gemma2? deepseek2?

compilade · 2024-07-09T03:53:56Z

Which models are going to need re-conversion? Just gemma2? deepseek2?

@oldmanjk

Yes, Gemma, Gemma-2, Command-R, Command-R-Plus, and deepseek-coder (not sure which of their models use that pre-tokenizer) need re-conversion, because (except for Gemma and Gemma 2) this changes the type of some of their USER_DEFINED tokens to CONTROL for a safer tokenization when parse_special == false (e.g. with deepseek-coder, <|EOT|> was previously a USER_DEFINED token while it should have been marked as a CONTROL token). This is only relevant if you're tokenizing with parse_special == false, but most of the ways you're likely using the models use parse_special == true.

It's mostly Gemma and Gemma-2 which really need re-conversion for correctness (unfortunately), because all non-special added_tokens were previously marked as NORMAL in the GGUF models, while they should have been USER_DEFINED, so that pre-tokenization works correctly. (this mostly affects HTML tag tokenization in the case of Gemma (and Gemma 2))

oldgithubman · 2024-07-09T04:14:58Z

Which models are going to need re-conversion? Just gemma2? deepseek2?

@oldmanjk

Yes, Gemma, Gemma-2, Command-R, Command-R-Plus, and deepseek-coder (not sure which of their models use that pre-tokenizer) need re-conversion, because (except for Gemma and Gemma 2) this changes the type of some of their USER_DEFINED tokens to CONTROL for a safer tokenization when parse_special == false (e.g. with deepseek-coder, <|EOT|> was previously a USER_DEFINED token while it should have been marked as a CONTROL token). This is only relevant if you're tokenizing with parse_special == false, but most of the ways you're likely using the models use parse_special == true.

It's mostly Gemma and Gemma-2 which really need re-conversion for correctness (unfortunately), because all non-special added_tokens were previously marked as NORMAL in the GGUF models, while they should have been USER_DEFINED, so that pre-tokenization works correctly. (this mostly affects HTML tag tokenization in the case of Gemma (and Gemma 2))

So deepseek only really needs re-conversion if you're "tokenizing with parse_special == false?" What does that mean? When might someone do that? I'm really hoping I don't have to re-convert the ~terabyte behemoth that is deepseek-coder-v2 f32 for the fourth time. Each time takes my 13900K and 4090 about four days. The fans are driving my family insane and I'd love to finish this project before the heat death of the universe...

compilade · 2024-07-09T04:44:00Z

So deepseek only really needs re-conversion if you're "tokenizing with parse_special == false?" What does that mean? When might someone do that?

This only happens when using --in-prefix and/or --in-suffix with llama-cli (aka main.cpp), because the other ways chat templates are handled always use parse_special == true.

llama-server always uses parse_special == true for now, so nothing really changes over there.

deepseek-coder-v2

I checked, and https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct is a bit less affected by this than the older deepseek-coder models. At least this one sets <|EOT|> as special, but there are still some other special-looking added_tokens which are not marked special (like <｜User｜> and <｜Assistant｜>), but which should arguably have been.

Each time takes my 13900K and 4090 about four days. The fans are driving my family insane and I'd love to finish this project before the heat death of the universe...

@oldmanjk It's not necessary to reconvert, especially if your use-case is not affected, and especially if it's that demanding. You're likely fine with your existing deepseek-coder-v2 model.

Makes me think there should really be a way to update the metadata of GGUF models without rebuilding them all over again. (gguf_set_metadata.py has its limits, and gguf_new_metadata.py writes a copy of the model, but they both don't allow "applying" what convert_hf_to_gguf.py would do. This does seem like something to think about.)

oldgithubman · 2024-07-09T05:12:54Z

Thanks for the fast and thorough responses! I really appreciate it.

This only happens when using --in-prefix and/or --in-suffix with llama-cli (aka main.cpp), because the other ways chat templates are handled always use parse_special == true.

llama-server always uses parse_special == true for now, so nothing really changes over there.

I think this rules out my personal use case. Unfortunately, it sounds like, to release these imatrices and/or quants, one should still probably re-convert. My motivation to do so is dying. Luckily, there are others, but I feel for them as well.

@oldmanjk It's not necessary to reconvert, especially if your use-case is not affected, and especially if it's that demanding. You're likely fine with your existing deepseek-coder-v2 model.

Makes me think there should really be a way to update the metadata of GGUF models without rebuilding them all over again. (gguf_set_metadata.py has its limits, and gguf_new_metadata.py writes a copy of the model, but they both don't allow "applying" what convert_hf_to_gguf.py would do. This does seem like something to think about.)

To be clear, most of the time/resources is/are spent creating imatrices. I don't know if this changes anything. I assume they'd need to be recreated either way, right? (I don't mean for my personal use case, but for both release and if there was a way to update the metadata). bf16 cuda support can't come fast enough. It would cut these re-conversion penalties at least in half

compilade · 2024-07-09T05:14:37Z

to release these imatrices and/or quants, one should still probably re-convert.

imatrix ~~uses parse_special == true, so it's likely not affected either.~~

EDIT: it uses parse_special == false,

llama.cpp/examples/imatrix/imatrix.cpp

Line 443 in 7d0e23d

std::vector<llama_token> tokens = ::llama_tokenize(ctx, params.prompt, true);

(the fourth argument (missing) is parse_special, and defaults to false)

But in any case the calibration datasets usually don't contain special tokens (I think?), if they do, this does seem like this line in imatrix.cpp is wrong.

But this won't really change the importance matrix much, not in a noticeable way anyway (especially if there are no special tokens in it). So the existing imatrices can likely still be used.

oldgithubman · 2024-07-09T05:19:59Z

@compilade
What exactly are the special tokens? I can search all my calibration datasets (which include most of the popular ones) and report back

compilade · 2024-07-09T06:05:11Z

What exactly are the special tokens?

@oldmanjk It's the added_tokens in tokenizer.json. Some are USER_DEFINED, some are CONTROL tokens. But they're all treated differently than NORMAL tokens. To get a list of them, assuming you have jq, and grep on a Unix-like system:

$ jq .added_tokens < tokenizer.json

If you want to grep for special tokens which look like control tokens in a particular text file:

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

While if you want to search for any special token, simply remove the middle grep -z '^<' like so:

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

oldgithubman · 2024-07-09T06:52:44Z

$ jq .added_tokens < deepseek-ai_DeepSeek-Coder-V2-Instruct/tokenizer.json

[
  {
    "id": 100000,
    "content": "<｜begin▁of▁sentence｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100001,
    "content": "<｜end▁of▁sentence｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100002,
    "content": "<｜fim▁hole｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100003,
    "content": "<｜fim▁begin｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100004,
    "content": "<｜fim▁end｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100005,
    "content": "<｜completion｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100006,
    "content": "<｜User｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100007,
    "content": "<｜Assistant｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100008,
    "content": "<|EOT|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100009,
    "content": "<｜tool▁calls▁begin｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100010,
    "content": "<｜tool▁calls▁end｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100011,
    "content": "<｜tool▁call▁begin｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100012,
    "content": "<｜tool▁call▁end｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100013,
    "content": "<｜tool▁outputs▁begin｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100014,
    "content": "<｜tool▁outputs▁end｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100015,
    "content": "<｜tool▁output▁begin｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100016,
    "content": "<｜tool▁output▁end｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100017,
    "content": "<｜tool▁sep｜>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  }
]

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

$ jq --raw-output0 .added_tokens[].content < deepseek-ai_DeepSeek-Coder-V2-Instruct/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} dataset.txt
jq: Unknown option --raw-output0
Use jq --help for help with command-line options,
or see the jq manpage, or online docs  at https://stedolan.github.io/jq

I manually searched for the string, "<|", which is present in all the deepseek-coder-v2 special tokens, in my datasets. The popular "code.txt" dataset does contain them ("<|system_prompt|>" and "<|user_prompt|>"), but, luckily, those don't exist in deepseek-coder-v2's special tokens. The popular "tiny.txt" dataset contains 316 "<|endoftext|>" tokens, which, again, luckily, doesn't exist in deepseek-coder-v2's special tokens. None of the other datasets I use, including the popular "badwords.txt", "technical.txt", "groups_merged-enhancedV3.txt", and "c4.txt", contain any instances of "<|".
Am I right to conclude this means I don't have to re-create my imatrices?

Edit - This also makes me think - should we be modifying our datasets to conform to the particular model? For example, should I be replacing "<|endoftext|>" with "<|EOT|>" when creating an imatrix for deepseek-coder-v2 from tiny.txt?

oldgithubman · 2024-07-10T19:27:35Z

Is there anything I can help with to speed this up?

This makes the changes from #8321 more consistent with the other changes made here.

oldgithubman

I don't really know what I'm doing, so forgive me if this isn't helpful

oldgithubman · 2024-07-11T18:40:20Z

convert_hf_to_gguf.py

@@ -373,6 +373,29 @@ def from_model_architecture(cls, arch: str) -> type[Model]:
        except KeyError:
            raise NotImplementedError(f'Architecture {arch!r} not supported!') from None

+    def does_token_look_special(self, token: str | bytes) -> bool:


"Method 'does_token_look_special' may be 'static'"

I prefer not to make it a @staticmethod, to allow overriding it in the subclasses of Model if needed.

src/llama.cpp

oldgithubman · 2024-07-11T22:26:18Z

tests/test-tokenizer-random.py

@@ -143,7 +143,7 @@ def __init__(self, dir_tokenizer: str):
        self.vocab = list(sorted(self.vocab))
        # tokens and lists
        self.special_tokens = list(self.model.all_special_tokens)
-        self.added_tokens   = list(self.model.added_tokens_encoder)
+        self.added_tokens   = self.model.batch_decode(self.model.added_tokens_encoder.values(), skip_special_tokens=False)


"PEP 8: E221 multiple spaces before operator" (just FYI - looks good to me)
"PEP 8: E501 line too long (122 > 120 characters)"

tests/test-tokenizer-random.py

oldgithubman · 2024-07-11T22:27:06Z

tests/test-tokenizer-random.py

-            logger.error(" Expected: " + str(ids1))
-            logger.error("   Result: " + str(ids2))
+            logger.error(" Expected: " + str(ids1) + f" {[tokenizer1.decode([id]) for id in ids1]}")
+            logger.error("   Result: " + str(ids2) + f" {[tokenizer2.decode([id]) for id in ids2]}")


"Shadows built-in name 'id'"

I've reverted these two lines in 59ce853, even though they were useful for debugging, because this is a test script anyway, and because these lines also will unnecessarily conflict with #8379

* test-tokenizer-random : add a failing edge case for falcon

jaime-m-p · 2024-07-14T16:21:24Z

Sorry for the review delay.

There is only one problem I can't see how to fix related to pre-normalization in the convert script:

llama.cpp/convert_hf_to_gguf.py

Line 424 in 1caa20f

    
           token = token.replace(b"\xe2\x96\x81".decode("utf-8"), " ")  # pre-normalize user-defined spaces

INFO VOCABFILE: 'models/ggml-vocab-gemma-2.gguf'
INFO compare_vocabs: ini
ERROR  detokenize=False id=139 expected='▁▁' result='  '
ERROR  detokenize=False id=140 expected='▁▁▁' result='   '
ERROR  detokenize=False id=141 expected='▁▁▁▁' result='    '
ERROR  detokenize=False id=142 expected='▁▁▁▁▁' result='     '
ERROR  detokenize=False id=143 expected='▁▁▁▁▁▁' result='      '
ERROR  detokenize=False id=144 expected='▁▁▁▁▁▁▁' result='       '
ERROR  detokenize=False id=145 expected='▁▁▁▁▁▁▁▁' result='        '
ERROR  detokenize=False id=146 expected='▁▁▁▁▁▁▁▁▁' result='         '
ERROR  detokenize=False id=147 expected='▁▁▁▁▁▁▁▁▁▁' result='          '
INFO compare_vocabs: end

Note that this is not a real error, this only tells that vocabs are different (AutoTokenizer vs Llama.cpp). Tokenizing an detokenizing seems correct.

Instead of modifying the vocab words, there is an alternative (doing it the other way around).
Just pre-normalize the input so added tokens with spaces can be found in the raw added tokens:

static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab, std::string raw_text, bool add_special, bool parse_special) {
    std::vector<llama_vocab::id> output;
    std::forward_list<fragment_buffer_variant> fragment_buffer;

+    // tokernizer.json: "normalizer": { "type": "Replace", "pattern": { "String": " " }, "content": "▁" }
+    if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {  //TODO: use vocab.tokenizer_escape_whitespaces ?
+        llama_escape_whitespace(raw_text);
+    }

    if (!raw_text.empty()) {
        fragment_buffer.emplace_front(raw_text, 0, raw_text.length());
        tokenizer_st_partition(vocab, fragment_buffer, parse_special);
    }

...

                        // prefix with space if previous is special
                        if (vocab.tokenizer_add_space_prefix && is_prev_special) {
-                            raw_text = " " + raw_text;
+                            raw_text = "\xe2\x96\x81" + raw_text;
                        }

...

                        llm_tokenizer_spm tokenizer(vocab);
-                       llama_escape_whitespace(raw_text);
                        tokenizer.tokenize(raw_text, output);
                        is_prev_special = false;

I think this follows better the tokenizer pipeline and generalizes using the config flag vocab.tokenizer_escape_whitespaces.

I still doing experiments, but derives to simplifications and removing special cases (indication that the path is correct):

llama.cpp/src/llama.cpp

Lines 21170 to 21188 in 1caa20f

    
           const std::string & token_text = model->vocab.id_to_token[token].text; 
        
           switch (llama_vocab_get_type(model->vocab)) { 
        
               case LLAMA_VOCAB_TYPE_WPM: 
        
               case LLAMA_VOCAB_TYPE_SPM: 
        
               case LLAMA_VOCAB_TYPE_UGM: { 
        
                   // NOTE: we accept all unsupported token types, 
        
                   // suppressing them like CONTROL tokens. 
        
                   if (attr & (attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED)) { 
        
                       return _try_copy(token_text.data(), token_text.size()); 
        
                   } else if (attr & LLAMA_TOKEN_ATTR_NORMAL) { 
        
                       std::string result = token_text; 
        
                       llama_unescape_whitespace(result); 
        
                       return _try_copy(result.data(), result.size()); 
        
                   } else if (attr & LLAMA_TOKEN_ATTR_BYTE) { 
        
                       char byte = (char) llama_token_to_byte(model->vocab, token); 
        
                       return _try_copy((char*) &byte, 1); 
        
                   } 
        
                   break; 
        
               }

I guess this will converge to something like:

        const std::string & token_text = model->vocab.id_to_token[token].text;
        switch (llama_vocab_get_type(model->vocab)) {
            case LLAMA_VOCAB_TYPE_WPM:
            case LLAMA_VOCAB_TYPE_SPM:
            case LLAMA_VOCAB_TYPE_UGM: {
                std::string unscaped_text = llama_unescape_whitespace(token_text);
                return _try_copy(unscaped_text.data(), unscaped_text.size());
            }
            case LLAMA_VOCAB_TYPE_BPE: {
                std::string decoded_text = llama_decode_text(token_text);
                return _try_copy(decoded_text.data(), decoded_text.size());
            }
            ...

compilade · 2024-07-14T17:32:11Z

Instead of modifying the vocab words, there is an alternative (doing it the other way around).
Just pre-normalize the input so added tokens with spaces can be found in the raw added tokens:

I agree it would be better to keep the vocab text verbatim. But unconditionally pre-escaping the input would lose the distinction between "\x20" and "\xe2\x96\x81". I think this can only be fixed with token-level attributes like LLAMA_TOKEN_ATTR_NORMALIZED once they'll be included in GGUF models, and then conditionally normalizing the tokens when using the vocab. Otherwise this would break tokenization for deepseek-coder.

(EDIT: on further thought, deepseek-coder uses a BPE tokenizer, not SPM, so I think pre-escaping the input might be correct for SPM (try to tokenize the string ▁▁▁▁ with Gemma-2, it sees it as spaces, but llama.cpp doesn't merge them for some reason (saw the same problem with falcon with \x0b\x20, so this space problem might not be SPM-specific)))

            case LLAMA_VOCAB_TYPE_WPM:
            case LLAMA_VOCAB_TYPE_SPM:
            case LLAMA_VOCAB_TYPE_UGM: {
                std::string unscaped_text = llama_unescape_whitespace(token_text);
                return _try_copy(unscaped_text.data(), unscaped_text.size());
            }

For control tokens, I'm not sure if the whitespaces should be unescaped.
For example, deepseek-coder uses <｜begin▁of▁sentence｜>, with the "\xe2\x96\x81" spaces. I just ran a test with that token in a string, as well as <｜fim▁begin｜>, and the escaped spaces are kept with HF tokenizers (same as current llama.cpp behavior). The same text without the escaped spaces does not match these tokens. (These tokens are marked as "normalized": true in the added_tokens)

I also don't see a way around special-casing BYTE tokens with llama_token_to_byte. These tokens are usually in the form <0x[0-9A-Fa-f]{2}>

I think it should be more like:

const std::string & token_text = model->vocab.id_to_token[token].text;
switch (llama_vocab_get_type(model->vocab)) {
    case LLAMA_VOCAB_TYPE_WPM:
    case LLAMA_VOCAB_TYPE_SPM:
    case LLAMA_VOCAB_TYPE_UGM: {
        // NOTE: we accept all unsupported token types,
        // suppressing them like CONTROL tokens.
        if (attr & (attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED) && (attr & LLAMA_TOKEN_ATTR_NORMALIZED)) {
            return _try_copy(token_text.data(), token_text.size());
        } else if (attr & LLAMA_TOKEN_ATTR_BYTE) {
            char byte = (char) llama_token_to_byte(model->vocab, token);
            return _try_copy((char*) &byte, 1);
        } else if (attr & (LLAMA_TOKEN_ATTR_NORMAL | attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED)) {
            std::string result = token_text;
            llama_unescape_whitespace(result);
            return _try_copy(result.data(), result.size());
        }
        break;
    }
    ...

But I'm not sure, since this doesn't really look simpler.

If no SPM tokenizers use pre-normalized added_tokens (I've yet to find a counter-example), then your approach (if also considering BYTE tokens) would work, though.

) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon

llama : fix mpt and olmo pre-tokenizer

db2ffd5

compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jun 30, 2024

compilade requested a review from jaime-m-p June 30, 2024 18:59

jaime-m-p reviewed Jul 1, 2024

View reviewed changes

compilade marked this pull request as draft July 1, 2024 23:52

compilade closed this Jul 6, 2024

compilade added 3 commits July 7, 2024 11:36

Merge branch 'master' into compilade/fix-mpt-pretok

ac0f33c

llama : pre-tokenize non-special user-defined tokens first

d5d30b2

Merge branch 'master' into compilade/fix-mpt-pretok

6b961e3

compilade reopened this Jul 7, 2024

github-actions bot added the testing Everything test related label Jul 7, 2024

compilade added 2 commits July 7, 2024 16:13

llama : fix detection of control-like user-defined tokens

56df1fc

convert_hf : identify which user-defined tokens are control tokens

6e351e0

Only used in _set_vocab_gpt2() for now.

github-actions bot added the python python script changes label Jul 7, 2024

compilade changed the title ~~llama : fix mpt and olmo pre-tokenizer~~ llama : fix pre-tokenization of non-special added tokens Jul 8, 2024

compilade marked this pull request as ready for review July 8, 2024 05:19

compilade mentioned this pull request Jul 8, 2024

Investigate gemma 2 generation quality #8240

Closed

compilade commented Jul 8, 2024

View reviewed changes

bartowski1182 reviewed Jul 8, 2024

View reviewed changes

llama : fix Viking pre-tokenizer regex

31a1b0e

The order was previously wrong, which caused errors in some tests.

compilade added 2 commits July 10, 2024 15:32

Merge branch 'master' into compilade/fix-mpt-pretok

afa6119

convert_hf : reduce usages of UNKNOWN for InternLM2

1caa20f

This makes the changes from #8321 more consistent with the other changes made here.

compilade mentioned this pull request Jul 10, 2024

tokenize : add --no-parse-special option #8423

Merged

2 tasks

oldgithubman suggested changes Jul 11, 2024

View reviewed changes

ggerganov approved these changes Jul 12, 2024

View reviewed changes

ggerganov requested a review from jaime-m-p July 12, 2024 08:05

test-tokenizer-random : reduce potential confilcts with #8379

59ce853

* test-tokenizer-random : add a failing edge case for falcon

compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 13, 2024

compilade merged commit fa79495 into master Jul 14, 2024
57 checks passed

This was referenced Jul 16, 2024

Fix func call tokens for internlm2 #8506

Closed

convert_hf : fix Gemma v1 conversion #8597

Merged

compilade mentioned this pull request Jul 24, 2024

Add support for XLMRoberta embedding models #8658

Merged

4 tasks

compilade mentioned this pull request Sep 8, 2024

llama : set attrs of mislabelled EOT/EOM tokens #9348

Merged

4 tasks

compilade mentioned this pull request Dec 18, 2024

Add Falcon3 support and Fix issue #10875 #10883

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix pre-tokenization of non-special added tokens #8228

llama : fix pre-tokenization of non-special added tokens #8228

compilade commented Jun 30, 2024 •

edited

Loading

jaime-m-p commented Jul 1, 2024

jaime-m-p Jul 1, 2024

compilade Jul 1, 2024 •

edited

Loading

compilade commented Jul 1, 2024 •

edited

Loading

compilade commented Jul 1, 2024 •

edited

Loading

jaime-m-p commented Jul 2, 2024

compilade commented Jul 8, 2024 •

edited

Loading

compilade Jul 8, 2024

compilade Jul 8, 2024

jaime-m-p Jul 9, 2024

compilade Jul 9, 2024 •

edited

Loading

compilade Jul 9, 2024

compilade Jul 8, 2024

bartowski1182 Jul 8, 2024

bartowski1182 Jul 8, 2024

compilade Jul 8, 2024 •

edited

Loading

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024 •

edited

Loading

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024 •

edited

Loading

oldgithubman commented Jul 9, 2024 •

edited

Loading

compilade commented Jul 9, 2024

oldgithubman commented Jul 9, 2024 •

edited

Loading

oldgithubman commented Jul 10, 2024

oldgithubman left a comment

oldgithubman Jul 11, 2024

compilade Jul 11, 2024

oldgithubman Jul 11, 2024

oldgithubman Jul 11, 2024

compilade Jul 13, 2024 •

edited

Loading

jaime-m-p commented Jul 14, 2024 •

edited

Loading

compilade commented Jul 14, 2024 •

edited

Loading

		if (!(vocab.id_to_token[id].attr & LLAMA_TOKEN_ATTR_NORMAL)) {
		if (vocab.id_to_token[id].attr & (LLAMA_TOKEN_ATTR_CONTROL \| LLAMA_TOKEN_ATTR_USER_DEFINED)) {

llama : fix pre-tokenization of non-special added tokens #8228

llama : fix pre-tokenization of non-special added tokens #8228

Conversation

compilade commented Jun 30, 2024 • edited Loading

TODO

jaime-m-p commented Jul 1, 2024

Choose a reason for hiding this comment

compilade Jul 1, 2024 • edited Loading

Choose a reason for hiding this comment

compilade commented Jul 1, 2024 • edited Loading

compilade commented Jul 1, 2024 • edited Loading

jaime-m-p commented Jul 2, 2024

compilade commented Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024 • edited Loading

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024

oldgithubman commented Jul 9, 2024

compilade commented Jul 9, 2024 • edited Loading

oldgithubman commented Jul 9, 2024 • edited Loading

compilade commented Jul 9, 2024

oldgithubman commented Jul 9, 2024 • edited Loading

oldgithubman commented Jul 10, 2024

oldgithubman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

compilade Jul 13, 2024 • edited Loading

Choose a reason for hiding this comment

jaime-m-p commented Jul 14, 2024 • edited Loading

compilade commented Jul 14, 2024 • edited Loading

compilade commented Jun 30, 2024 •

edited

Loading

compilade Jul 1, 2024 •

edited

Loading

compilade commented Jul 1, 2024 •

edited

Loading

compilade commented Jul 1, 2024 •

edited

Loading

compilade commented Jul 8, 2024 •

edited

Loading

compilade Jul 9, 2024 •

edited

Loading

compilade Jul 8, 2024 •

edited

Loading

compilade commented Jul 9, 2024 •

edited

Loading

compilade commented Jul 9, 2024 •

edited

Loading

oldgithubman commented Jul 9, 2024 •

edited

Loading

oldgithubman commented Jul 9, 2024 •

edited

Loading

compilade Jul 13, 2024 •

edited

Loading

jaime-m-p commented Jul 14, 2024 •

edited

Loading

compilade commented Jul 14, 2024 •

edited

Loading