Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : fix pre-tokenization of non-special added tokens #8228

Merged
merged 14 commits into from
Jul 14, 2024

Conversation

compilade
Copy link
Collaborator

@compilade compilade commented Jun 30, 2024

MPT and OLMo use a NeoX-style tokenizer, which has pre-normalized spaces in its added_tokens, while Gemma uses encoded (non-normalized) spaces in its added_tokens, and also has some HTML tags in there.

In the current state of src/llama.cpp, the tokenizer tests fails for MPT and OLMo when using parse_special == false.

HuggingFace's tokenizers pre-tokenizes the added_tokens before other pre-tokenization happens, see https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/tokenizer/mod.rs#L726

I've changed tokenizer_st_partition to be called even if parse_special is false to pre-tokenize user-defined tokens, but it processes control tokens only when parse_special is true. This allows pre-tokenizing non-special added tokens correctly at all times, while still allowing to safely not tokenize control tokens when parse_special is false.

This also fixes Gemma's tokenization of HTML tags, but it requires re-conversion because the added tokens were previously not using the correct types, and also because user-defined tokens are assumed to use bare spaces by llama.cpp, but Gemma uses "▁" (normal tokens are not affected).

TODO

  • Tests should pass
    • ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mpt.gguf
      • Note that NFC normalization is still a problem to be solved, but everything else works.
    • ./build/bin/test-tokenizer-0 ./models/ggml-vocab-olmo.gguf
    • using tests/test-tokenizer-random.py with Gemma-2's tokenizer, there's no difference whatsoever to the reference tokenizer.model

@compilade compilade added bugfix fixes an issue or bug Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jun 30, 2024
@compilade compilade requested a review from jaime-m-p June 30, 2024 18:59
@jaime-m-p
Copy link
Collaborator

@compilade
Can you confirm that the commit 68220fe still fails with MPT and OLMO?

I can see problems related to NFD/NFC normalization, but I can't find any problems with spaces.
All the combinations of the following brute-force added tokens with whitespaces seems to pass correctly (tokenizer and detokenizer):

def generator_added_lr_strip(tokenizer: TokenizerGroundtruth) -> Iterator[str]:
WHITESPACES = ["", " ", " ", "\n", "\r\n", "\n\n", "\t", "\t\t"]
all_tokens = list(sorted(set(tokenizer.special_tokens + tokenizer.added_tokens)))
for token in all_tokens:
for lstrip in WHITESPACES:
for rstrip in WHITESPACES:
yield lstrip + token + rstrip
yield "a" + lstrip + token + rstrip
yield lstrip + token + rstrip + "z"
yield "a" + lstrip + token + rstrip + "z"

Please, can you provide me a failure case so I can check?

src/llama.cpp Outdated
@@ -13985,6 +14002,10 @@ struct llm_tokenizer_bpe {
void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) {
int final_prev_index = -1;

// FIXME: pre-tokenize added_tokens (user-defined tokens) before other pre-tokenization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are looking for this:

if (parse_special) tokenizer_st_partition(vocab, fragment_buffer);

If enabled (parse_special), all added tokens are pre-tokenized even before regex splits.

Copy link
Collaborator Author

@compilade compilade Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes! This is it. But it makes me think there should be a distinction between special tokens and user-defined tokens. I think parse_special=false should still pre-tokenize non-special user-defined tokens before the regex splits.

@compilade
Copy link
Collaborator Author

compilade commented Jul 1, 2024

Can you confirm that the commit 68220fe still fails with MPT and OLMO?

@jaime-m-p With that commit, with the MPT tokenizer, I can see that the spaces do get handled correctly with tests/test-tokenizer-0.cpp, although it seems to instead fail with the apostrophe in I've been (and others like I'll make), splitting it in I ' ve been instead of using the 've token), which was not a problem here.

However, correcting the regex_exprs of LLAMA_VOCAB_TYPE_PRE_MPT makes it work correctly in all the cases (at least for tests/test-tokenizer-0.cpp; I can't get tests/test-tokenizer-random.py to start correctly):

diff --git a/llama.cpp b/llama.cpp
index ab8620ec..58b84839 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -13222,13 +13222,7 @@ struct llm_tokenizer_bpe {
                 };
                 break;
             case LLAMA_VOCAB_PRE_TYPE_MPT:
-                // TODO: MPT pre-tokenization regexes are unknown
-                //       the following are close, but not exact. run the following:
-                //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
-                GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
                 regex_exprs = {
-                    "\\s?\\p{L}+",
-                    "\\s?\\p{P}+",
                     "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
                 };
                 break;

What you've done is definitely better than this PR, since this means MPT and OLMo and other NeoX-style tokenizers can simply directly use the GPT-2 pre-tokenizer, like they should.

I think I'll close this in favor of #8039.

@compilade
Copy link
Collaborator Author

compilade commented Jul 1, 2024

From further testing, it seems like passing parse_special=true to llama_tokenize makes the correct multi-space tokens be used for NeoX-style tokenizers. That's what fixed it in 68220fe. However, when not using parse_special, this doesn't work anymore and instead uses the other multi-space tokens.

This PR makes it still pick the special spaces even when parse_special is false. Not sure if that's desired, maybe not. It depends on how models based on that kind of tokenizer behave with the non-special multi-space tokens.

Seems like NeoX tokenizers complicate the decision of whether user input should be exempt of special tokens or not.

Or maybe user-defined tokens should not be considered special as in parse_special, but still pre-tokenized separately?

@compilade compilade marked this pull request as draft July 1, 2024 23:52
@jaime-m-p
Copy link
Collaborator

@compilade

MPT and OLMo and other NeoX-style tokenizers can simply directly use the GPT-2 pre-tokenizer, like they should.

Oh, this seems to fix all MPT apostrophe problems.

This PR makes it still pick the special spaces even when parse_special is false. Not sure if that's desired, maybe not. It depends on how models based on that kind of tokenizer behave with the non-special multi-space tokens.
Seems like NeoX tokenizers complicate the decision of whether user input should be exempt of special tokens or not.
Or maybe user-defined tokens should not be considered special as in parse_special, but still pre-tokenized separately?

Note that AutoTokenizer does not even offer the optional parse_special feature.
I think the common use case is to use always parse_special = true (default True in AutoTokenizer) and then detokenize with unparse_special = true (skip_special_tokens = False in AutoTokenizer).

@compilade compilade closed this Jul 6, 2024
@compilade compilade reopened this Jul 7, 2024
@github-actions github-actions bot added the testing Everything test related label Jul 7, 2024
@github-actions github-actions bot added the python python script changes label Jul 7, 2024
This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens
@compilade compilade changed the title llama : fix mpt and olmo pre-tokenizer llama : fix pre-tokenization of non-special added tokens Jul 8, 2024
@compilade
Copy link
Collaborator Author

compilade commented Jul 8, 2024

In trying to fix the pre-tokenization for non-special added tokens for MPT and OLMo, I think I've also fixed Gemma and Gemma-2's pre-tokenization.

Gemma on master doesn't pre-tokenize HTML tags correctly, while in this branch it does! Unfortunately, it requires re-conversion because the user-defined tokens from tokenizer_config.json were not set to the right token types on master, and user-defined tokens in the rest of the codebase are assumed to use bare spaces instead of .

On master:

$ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
     2 -> '<bos>'
235322 -> '<'
   973 -> 'blockquote'
  2577 -> '><'
235259 -> 'h'
235274 -> '1'
235313 -> '>'
  4521 -> 'Hello'
   727 -> '</'
235259 -> 'h'
235274 -> '1'
  3119 -> '></'
   973 -> 'blockquote'
235313 -> '>'

On this branch (after re-converting the model):

$ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
     2 -> '<bos>'
   191 -> '<blockquote>'
   185 -> '<h1>'
  4521 -> 'Hello'
   192 -> '</h1>'
   198 -> '</blockquote>'

There's a weird behavior of the HF tokenizer for Gemma which seems to prefer the 16-space token (id: 153) over longer ones:

INFO:test-tokenizer-random:generator_custom_text: ini
ERROR:test-tokenizer-random: Expected: [2, 153, 153, 155, 132133, 14130] ['<bos>', '                ', '                ', '                  ', 'fifty', ' spaces']
ERROR:test-tokenizer-random:   Result: [2, 168, 156, 132133, 14130] ['<bos>', '                               ', '                   ', 'fifty', ' spaces']

But when using tokenizer.model instead of tokenizer.json for the reference tokenizer, this error does not happen.

So I assume what llama.cpp is now doing on this branch is the right thing, since I've yet to find a tokenization difference with the reference tokenizer.model.

@compilade compilade added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level generation quality Quality of model output and removed Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels Jul 8, 2024
@compilade compilade marked this pull request as ready for review July 8, 2024 05:19
Comment on lines -2402 to +2445
tokens, scores, toktypes = self._create_vocab_sentencepiece()
# hack: This is required so that we can properly use start/end-of-turn for chat template
for i in range(108):
# including <unusedX>, <start_of_turn>, <end_of_turn>
toktypes[i] = SentencePieceTokenTypes.CONTROL
self.gguf_writer.add_tokenizer_model("llama")
self.gguf_writer.add_tokenizer_pre("default")
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_scores(scores)
self.gguf_writer.add_token_types(toktypes)

special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens))
special_vocab.add_to_gguf(self.gguf_writer)
self._set_vocab_sentencepiece()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "hack" from #8244 is no longer required because the control tokens are now identified with tokenizer_config.json.

src/llama.cpp Outdated
if (!(vocab.id_to_token[id].attr & LLAMA_TOKEN_ATTR_NORMAL)) {
if (vocab.id_to_token[id].attr & (LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED)) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's necessary to limit the special token cache to CONTROL and USER_DEFINED tokens, because otherwise BYTE tokens would be pre-tokenized incorrectly (e.g. the string <0x20> would get tokenized as (a space)).

But maybe it's a bad idea to exclude other types of tokens, like UNUSED padding tokens which might in some cases (when?) be desired to tokenize to themselves?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED, but I think we need LLAMA_TOKEN_ATTR_UNKNOWN too.

I think is corrrect to drop UNUSED tokens but we need parse UNKNOWN token.
It is LLAMA_TOKEN_ATTR_NORMAL because currently UNUSED and UNKNOWN are wrong mixed.

Copy link
Collaborator Author

@compilade compilade Jul 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should UNKNOWN tokens only be parsed when parse_special == true, or all the time like USER_DEFINED tokens?

I'm not sure if UNKNOWN tokens should be specially parsed at all.

I would tend toward only parsing them with parse_special == true, like CONTROL tokens. (this is also the current behavior of master)

I'm trying to figure out where UNKNOWN tokens are used and if it's useful to specially parse them. But this might differ from HF's tokenizers, so I need a model with UNKNOWN tokens to test this out. If I don't find one, I could modify an existing one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we need parse UNKNOWN token.

I've fixed this in 98edea6. UNKNOWN tokens are parsed when parse_special == true, as on master.

Comment on lines 586 to +587
tokens.append(f"[PAD{i}]")
toktypes.append(gguf.TokenType.USER_DEFINED)
toktypes.append(gguf.TokenType.UNUSED)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Padding tokens are set as UNUSED to reflect how it was already done in _set_vocab_sentencepiece, and also to avoid wrongly (pre-)tokenizing strings which happen to correspond to a padding token. (since USER_DEFINED tokens are now always pre-tokenized specially)

seems_special = seems_special or (token_text.startswith("<|") and token_text.endswith("|>")) # deepseek-coder

# TODO: should these be marked as UNUSED instead? (maybe not)
seems_special = seems_special or (token_text.startswith("<unused") and token_text.endswith(">")) # gemma{,-2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should things like this be defined in the conversion script under the specific model to avoid accidental false hits? if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that

maybe does_token_look_special should take in 2 lists: 1 list of strings of known special tokens, and a list of tuples of starts/ends with tokens

So for gemma2, we'd call it with:

special_tokens = ["<mask>", "<2mass>", "[@BOS@]"]
special_tags = [("<unused", "|>")]

self.does_token_look_special(token, special_tokens, special_tags)

and then here we'd have:

seems_special = token_text in special_tokens
for start_tag, end_tag in special_tags:
  seems_special = seems_special or (token_text.startswith(start_tag) and token_text.endswith(end_tag))

return seems_special

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i realize I misread the structure of the code, hmmm.. still not impossible but would have to be passed at a higher level

Copy link
Collaborator Author

@compilade compilade Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that

This only affects added tokens either from added_tokens in tokenizer.json or from added_tokens_decoder in tokenizer_config.json, so it does not affect normal tokens starting with <| in any way. Not all tokens of the vocab are checked with this, only the ones part of added_tokens (which are treated specially by HF tokenizers too anyway). And added tokens starting with <| and ending with |> are arguably always control tokens; this was added pretty much because some model makers wrongly marked those as non-special (notably, <|User|>, <|Assistant|> and <|EOT|> in deepseek-coder are supposedly non-special. Same with <|START_OF_TURN_TOKEN|> and <|END_OF_TURN_TOKEN|> for command-r).

I did not yet notice any conflict in the added_tokens to justify making model-specific checks instead of always checking for all known "special-but-arent-marked-special" tokens.

Also, this is a method of Model, so it technically can be overridden by subclasses should there ever be a model with conflicting added_tokens.

The order was previously wrong, which caused errors in some tests.
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024


llama : fix mpt and olmo pre-tokenizer

llama : pre-tokenize non-special user-defined tokens first

llama : fix detection of control-like user-defined tokens

convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

llama : fix command-r detokenization
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 9, 2024


llama : fix mpt and olmo pre-tokenizer

llama : pre-tokenize non-special user-defined tokens first

llama : fix detection of control-like user-defined tokens

convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

llama : fix command-r detokenization

llama : add UNKNOWN tokens in the special tokens cache
@oldgithubman
Copy link

Which models are going to need re-conversion? Just gemma2? deepseek2?

@compilade
Copy link
Collaborator Author

compilade commented Jul 9, 2024

Which models are going to need re-conversion? Just gemma2? deepseek2?

@oldmanjk

Yes, Gemma, Gemma-2, Command-R, Command-R-Plus, and deepseek-coder (not sure which of their models use that pre-tokenizer) need re-conversion, because (except for Gemma and Gemma 2) this changes the type of some of their USER_DEFINED tokens to CONTROL for a safer tokenization when parse_special == false (e.g. with deepseek-coder, <|EOT|> was previously a USER_DEFINED token while it should have been marked as a CONTROL token). This is only relevant if you're tokenizing with parse_special == false, but most of the ways you're likely using the models use parse_special == true.

It's mostly Gemma and Gemma-2 which really need re-conversion for correctness (unfortunately), because all non-special added_tokens were previously marked as NORMAL in the GGUF models, while they should have been USER_DEFINED, so that pre-tokenization works correctly. (this mostly affects HTML tag tokenization in the case of Gemma (and Gemma 2))

@oldgithubman
Copy link

Which models are going to need re-conversion? Just gemma2? deepseek2?

@oldmanjk

Yes, Gemma, Gemma-2, Command-R, Command-R-Plus, and deepseek-coder (not sure which of their models use that pre-tokenizer) need re-conversion, because (except for Gemma and Gemma 2) this changes the type of some of their USER_DEFINED tokens to CONTROL for a safer tokenization when parse_special == false (e.g. with deepseek-coder, <|EOT|> was previously a USER_DEFINED token while it should have been marked as a CONTROL token). This is only relevant if you're tokenizing with parse_special == false, but most of the ways you're likely using the models use parse_special == true.

It's mostly Gemma and Gemma-2 which really need re-conversion for correctness (unfortunately), because all non-special added_tokens were previously marked as NORMAL in the GGUF models, while they should have been USER_DEFINED, so that pre-tokenization works correctly. (this mostly affects HTML tag tokenization in the case of Gemma (and Gemma 2))

So deepseek only really needs re-conversion if you're "tokenizing with parse_special == false?" What does that mean? When might someone do that? I'm really hoping I don't have to re-convert the ~terabyte behemoth that is deepseek-coder-v2 f32 for the fourth time. Each time takes my 13900K and 4090 about four days. The fans are driving my family insane and I'd love to finish this project before the heat death of the universe...

@compilade
Copy link
Collaborator Author

So deepseek only really needs re-conversion if you're "tokenizing with parse_special == false?" What does that mean? When might someone do that?

This only happens when using --in-prefix and/or --in-suffix with llama-cli (aka main.cpp), because the other ways chat templates are handled always use parse_special == true.

llama-server always uses parse_special == true for now, so nothing really changes over there.

deepseek-coder-v2

I checked, and https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct is a bit less affected by this than the older deepseek-coder models. At least this one sets <|EOT|> as special, but there are still some other special-looking added_tokens which are not marked special (like <|User|> and <|Assistant|>), but which should arguably have been.

Each time takes my 13900K and 4090 about four days. The fans are driving my family insane and I'd love to finish this project before the heat death of the universe...

@oldmanjk It's not necessary to reconvert, especially if your use-case is not affected, and especially if it's that demanding. You're likely fine with your existing deepseek-coder-v2 model.

Makes me think there should really be a way to update the metadata of GGUF models without rebuilding them all over again. (gguf_set_metadata.py has its limits, and gguf_new_metadata.py writes a copy of the model, but they both don't allow "applying" what convert_hf_to_gguf.py would do. This does seem like something to think about.)

@oldgithubman
Copy link

Thanks for the fast and thorough responses! I really appreciate it.

This only happens when using --in-prefix and/or --in-suffix with llama-cli (aka main.cpp), because the other ways chat templates are handled always use parse_special == true.

llama-server always uses parse_special == true for now, so nothing really changes over there.

I think this rules out my personal use case. Unfortunately, it sounds like, to release these imatrices and/or quants, one should still probably re-convert. My motivation to do so is dying. Luckily, there are others, but I feel for them as well.

@oldmanjk It's not necessary to reconvert, especially if your use-case is not affected, and especially if it's that demanding. You're likely fine with your existing deepseek-coder-v2 model.

Makes me think there should really be a way to update the metadata of GGUF models without rebuilding them all over again. (gguf_set_metadata.py has its limits, and gguf_new_metadata.py writes a copy of the model, but they both don't allow "applying" what convert_hf_to_gguf.py would do. This does seem like something to think about.)

To be clear, most of the time/resources is/are spent creating imatrices. I don't know if this changes anything. I assume they'd need to be recreated either way, right? (I don't mean for my personal use case, but for both release and if there was a way to update the metadata). bf16 cuda support can't come fast enough. It would cut these re-conversion penalties at least in half

@compilade
Copy link
Collaborator Author

compilade commented Jul 9, 2024

to release these imatrices and/or quants, one should still probably re-convert.

imatrix uses parse_special == true, so it's likely not affected either.

EDIT: it uses parse_special == false,

std::vector<llama_token> tokens = ::llama_tokenize(ctx, params.prompt, true);

(the fourth argument (missing) is parse_special, and defaults to false)

But in any case the calibration datasets usually don't contain special tokens (I think?), if they do, this does seem like this line in imatrix.cpp is wrong.

But this won't really change the importance matrix much, not in a noticeable way anyway (especially if there are no special tokens in it). So the existing imatrices can likely still be used.

@oldgithubman
Copy link

oldgithubman commented Jul 9, 2024

@compilade
What exactly are the special tokens? I can search all my calibration datasets (which include most of the popular ones) and report back

@compilade
Copy link
Collaborator Author

What exactly are the special tokens?

@oldmanjk It's the added_tokens in tokenizer.json. Some are USER_DEFINED, some are CONTROL tokens. But they're all treated differently than NORMAL tokens. To get a list of them, assuming you have jq, and grep on a Unix-like system:

$ jq .added_tokens < tokenizer.json

If you want to grep for special tokens which look like control tokens in a particular text file:

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

While if you want to search for any special token, simply remove the middle grep -z '^<' like so:

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt

@oldgithubman
Copy link

oldgithubman commented Jul 9, 2024

$ jq .added_tokens < deepseek-ai_DeepSeek-Coder-V2-Instruct/tokenizer.json

[
  {
    "id": 100000,
    "content": "<|begin▁of▁sentence|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100001,
    "content": "<|end▁of▁sentence|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100002,
    "content": "<|fim▁hole|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100003,
    "content": "<|fim▁begin|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100004,
    "content": "<|fim▁end|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100005,
    "content": "<|completion|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100006,
    "content": "<|User|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100007,
    "content": "<|Assistant|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100008,
    "content": "<|EOT|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": true
  },
  {
    "id": 100009,
    "content": "<|tool▁calls▁begin|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100010,
    "content": "<|tool▁calls▁end|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100011,
    "content": "<|tool▁call▁begin|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100012,
    "content": "<|tool▁call▁end|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100013,
    "content": "<|tool▁outputs▁begin|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100014,
    "content": "<|tool▁outputs▁end|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100015,
    "content": "<|tool▁output▁begin|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100016,
    "content": "<|tool▁output▁end|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  },
  {
    "id": 100017,
    "content": "<|tool▁sep|>",
    "single_word": false,
    "lstrip": false,
    "rstrip": false,
    "normalized": true,
    "special": false
  }
]

$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt
$ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt
$ jq --raw-output0 .added_tokens[].content < deepseek-ai_DeepSeek-Coder-V2-Instruct/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} dataset.txt
jq: Unknown option --raw-output0
Use jq --help for help with command-line options,
or see the jq manpage, or online docs  at https://stedolan.github.io/jq

I manually searched for the string, "<|", which is present in all the deepseek-coder-v2 special tokens, in my datasets. The popular "code.txt" dataset does contain them ("<|system_prompt|>" and "<|user_prompt|>"), but, luckily, those don't exist in deepseek-coder-v2's special tokens. The popular "tiny.txt" dataset contains 316 "<|endoftext|>" tokens, which, again, luckily, doesn't exist in deepseek-coder-v2's special tokens. None of the other datasets I use, including the popular "badwords.txt", "technical.txt", "groups_merged-enhancedV3.txt", and "c4.txt", contain any instances of "<|".
Am I right to conclude this means I don't have to re-create my imatrices?

Edit - This also makes me think - should we be modifying our datasets to conform to the particular model? For example, should I be replacing "<|endoftext|>" with "<|EOT|>" when creating an imatrix for deepseek-coder-v2 from tiny.txt?

@oldgithubman
Copy link

Is there anything I can help with to speed this up?

This makes the changes from #8321 more consistent
with the other changes made here.
Copy link

@oldgithubman oldgithubman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really know what I'm doing, so forgive me if this isn't helpful

@@ -373,6 +373,29 @@ def from_model_architecture(cls, arch: str) -> type[Model]:
except KeyError:
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None

def does_token_look_special(self, token: str | bytes) -> bool:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Method 'does_token_look_special' may be 'static'"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to make it a @staticmethod, to allow overriding it in the subclasses of Model if needed.

src/llama.cpp Show resolved Hide resolved
@@ -143,7 +143,7 @@ def __init__(self, dir_tokenizer: str):
self.vocab = list(sorted(self.vocab))
# tokens and lists
self.special_tokens = list(self.model.all_special_tokens)
self.added_tokens = list(self.model.added_tokens_encoder)
self.added_tokens = self.model.batch_decode(self.model.added_tokens_encoder.values(), skip_special_tokens=False)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"PEP 8: E221 multiple spaces before operator" (just FYI - looks good to me)
"PEP 8: E501 line too long (122 > 120 characters)"

tests/test-tokenizer-random.py Outdated Show resolved Hide resolved
logger.error(" Expected: " + str(ids1))
logger.error(" Result: " + str(ids2))
logger.error(" Expected: " + str(ids1) + f" {[tokenizer1.decode([id]) for id in ids1]}")
logger.error(" Result: " + str(ids2) + f" {[tokenizer2.decode([id]) for id in ids2]}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Shadows built-in name 'id'"

Copy link
Collaborator Author

@compilade compilade Jul 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted these two lines in 59ce853, even though they were useful for debugging, because this is a test script anyway, and because these lines also will unnecessarily conflict with #8379

@ggerganov ggerganov requested a review from jaime-m-p July 12, 2024 08:05
* test-tokenizer-random : add a failing edge case for falcon
@compilade compilade added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label Jul 13, 2024
@compilade compilade merged commit fa79495 into master Jul 14, 2024
57 checks passed
@jaime-m-p
Copy link
Collaborator

jaime-m-p commented Jul 14, 2024

Sorry for the review delay.

There is only one problem I can't see how to fix related to pre-normalization in the convert script:

token = token.replace(b"\xe2\x96\x81".decode("utf-8"), " ") # pre-normalize user-defined spaces

INFO VOCABFILE: 'models/ggml-vocab-gemma-2.gguf'
INFO compare_vocabs: ini
ERROR  detokenize=False id=139 expected='▁▁' result='  '
ERROR  detokenize=False id=140 expected='▁▁▁' result='   '
ERROR  detokenize=False id=141 expected='▁▁▁▁' result='    '
ERROR  detokenize=False id=142 expected='▁▁▁▁▁' result='     '
ERROR  detokenize=False id=143 expected='▁▁▁▁▁▁' result='      '
ERROR  detokenize=False id=144 expected='▁▁▁▁▁▁▁' result='       '
ERROR  detokenize=False id=145 expected='▁▁▁▁▁▁▁▁' result='        '
ERROR  detokenize=False id=146 expected='▁▁▁▁▁▁▁▁▁' result='         '
ERROR  detokenize=False id=147 expected='▁▁▁▁▁▁▁▁▁▁' result='          '
INFO compare_vocabs: end

Note that this is not a real error, this only tells that vocabs are different (AutoTokenizer vs Llama.cpp). Tokenizing an detokenizing seems correct.

Instead of modifying the vocab words, there is an alternative (doing it the other way around).
Just pre-normalize the input so added tokens with spaces can be found in the raw added tokens:

static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab, std::string raw_text, bool add_special, bool parse_special) {
    std::vector<llama_vocab::id> output;
    std::forward_list<fragment_buffer_variant> fragment_buffer;

+    // tokernizer.json: "normalizer": { "type": "Replace", "pattern": { "String": " " }, "content": "▁" }
+    if (vocab.type == LLAMA_VOCAB_TYPE_SPM) {  //TODO: use vocab.tokenizer_escape_whitespaces ?
+        llama_escape_whitespace(raw_text);
+    }

    if (!raw_text.empty()) {
        fragment_buffer.emplace_front(raw_text, 0, raw_text.length());
        tokenizer_st_partition(vocab, fragment_buffer, parse_special);
    }

...

                        // prefix with space if previous is special
                        if (vocab.tokenizer_add_space_prefix && is_prev_special) {
-                            raw_text = " " + raw_text;
+                            raw_text = "\xe2\x96\x81" + raw_text;
                        }

...

                        llm_tokenizer_spm tokenizer(vocab);
-                       llama_escape_whitespace(raw_text);
                        tokenizer.tokenize(raw_text, output);
                        is_prev_special = false;

I think this follows better the tokenizer pipeline and generalizes using the config flag vocab.tokenizer_escape_whitespaces.

I still doing experiments, but derives to simplifications and removing special cases (indication that the path is correct):

llama.cpp/src/llama.cpp

Lines 21170 to 21188 in 1caa20f

const std::string & token_text = model->vocab.id_to_token[token].text;
switch (llama_vocab_get_type(model->vocab)) {
case LLAMA_VOCAB_TYPE_WPM:
case LLAMA_VOCAB_TYPE_SPM:
case LLAMA_VOCAB_TYPE_UGM: {
// NOTE: we accept all unsupported token types,
// suppressing them like CONTROL tokens.
if (attr & (attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED)) {
return _try_copy(token_text.data(), token_text.size());
} else if (attr & LLAMA_TOKEN_ATTR_NORMAL) {
std::string result = token_text;
llama_unescape_whitespace(result);
return _try_copy(result.data(), result.size());
} else if (attr & LLAMA_TOKEN_ATTR_BYTE) {
char byte = (char) llama_token_to_byte(model->vocab, token);
return _try_copy((char*) &byte, 1);
}
break;
}

I guess this will converge to something like:

        const std::string & token_text = model->vocab.id_to_token[token].text;
        switch (llama_vocab_get_type(model->vocab)) {
            case LLAMA_VOCAB_TYPE_WPM:
            case LLAMA_VOCAB_TYPE_SPM:
            case LLAMA_VOCAB_TYPE_UGM: {
                std::string unscaped_text = llama_unescape_whitespace(token_text);
                return _try_copy(unscaped_text.data(), unscaped_text.size());
            }
            case LLAMA_VOCAB_TYPE_BPE: {
                std::string decoded_text = llama_decode_text(token_text);
                return _try_copy(decoded_text.data(), decoded_text.size());
            }
            ...

@compilade
Copy link
Collaborator Author

compilade commented Jul 14, 2024

Instead of modifying the vocab words, there is an alternative (doing it the other way around).
Just pre-normalize the input so added tokens with spaces can be found in the raw added tokens:

I agree it would be better to keep the vocab text verbatim. But unconditionally pre-escaping the input would lose the distinction between "\x20" and "\xe2\x96\x81". I think this can only be fixed with token-level attributes like LLAMA_TOKEN_ATTR_NORMALIZED once they'll be included in GGUF models, and then conditionally normalizing the tokens when using the vocab. Otherwise this would break tokenization for deepseek-coder.

(EDIT: on further thought, deepseek-coder uses a BPE tokenizer, not SPM, so I think pre-escaping the input might be correct for SPM (try to tokenize the string ▁▁▁▁ with Gemma-2, it sees it as spaces, but llama.cpp doesn't merge them for some reason (saw the same problem with falcon with \x0b\x20, so this space problem might not be SPM-specific)))

            case LLAMA_VOCAB_TYPE_WPM:
            case LLAMA_VOCAB_TYPE_SPM:
            case LLAMA_VOCAB_TYPE_UGM: {
                std::string unscaped_text = llama_unescape_whitespace(token_text);
                return _try_copy(unscaped_text.data(), unscaped_text.size());
            }

For control tokens, I'm not sure if the whitespaces should be unescaped.
For example, deepseek-coder uses <|begin▁of▁sentence|>, with the "\xe2\x96\x81" spaces. I just ran a test with that token in a string, as well as <|fim▁begin|>, and the escaped spaces are kept with HF tokenizers (same as current llama.cpp behavior). The same text without the escaped spaces does not match these tokens. (These tokens are marked as "normalized": true in the added_tokens)

I also don't see a way around special-casing BYTE tokens with llama_token_to_byte. These tokens are usually in the form <0x[0-9A-Fa-f]{2}>

I think it should be more like:

const std::string & token_text = model->vocab.id_to_token[token].text;
switch (llama_vocab_get_type(model->vocab)) {
    case LLAMA_VOCAB_TYPE_WPM:
    case LLAMA_VOCAB_TYPE_SPM:
    case LLAMA_VOCAB_TYPE_UGM: {
        // NOTE: we accept all unsupported token types,
        // suppressing them like CONTROL tokens.
        if (attr & (attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED) && (attr & LLAMA_TOKEN_ATTR_NORMALIZED)) {
            return _try_copy(token_text.data(), token_text.size());
        } else if (attr & LLAMA_TOKEN_ATTR_BYTE) {
            char byte = (char) llama_token_to_byte(model->vocab, token);
            return _try_copy((char*) &byte, 1);
        } else if (attr & (LLAMA_TOKEN_ATTR_NORMAL | attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED)) {
            std::string result = token_text;
            llama_unescape_whitespace(result);
            return _try_copy(result.data(), result.size());
        }
        break;
    }
    ...

But I'm not sure, since this doesn't really look simpler.

If no SPM tokenizers use pre-normalized added_tokens (I've yet to find a counter-example), then your approach (if also considering BYTE tokens) would work, though.

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Jul 15, 2024
)

* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from ggerganov#8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with ggerganov#8379

* test-tokenizer-random : add a failing edge case for falcon
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Jul 27, 2024
)

* llama : fix mpt and olmo pre-tokenizer

* llama : pre-tokenize non-special user-defined tokens first

* llama : fix detection of control-like user-defined tokens

* convert_hf : identify which user-defined tokens are control tokens

Only used in _set_vocab_gpt2() for now.

* convert_hf : identify more added control tokens for SPM tokenziers

This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly,
including HTML tags and consecutive spaces,
but it unfortunately requires model re-conversion.

There seems to be a weird behavior of the HF tokenizer for Gemma,
which prefers to use the 16-space token over more lengthy space tokens,
while using the SentencePiece tokenizer does not do this.
(the implementation in llama.cpp has the same behavior as SentencePiece)

* llama : fix wrong pre-tokenization of byte tokens

* llama : fix Viking pre-tokenizer regex

The order was previously wrong, which caused errors in some tests.

* llama : fix command-r detokenization

* convert_hf : reduce usages of the UNKNOWN token type

* llama : add UNKNOWN tokens in the special tokens cache

* convert_hf : reduce usages of UNKNOWN for InternLM2

This makes the changes from ggerganov#8321 more consistent
with the other changes made here.

* test-tokenizer-random : reduce potential confilcts with ggerganov#8379

* test-tokenizer-random : add a failing edge case for falcon
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug generation quality Quality of model output merge ready indicates that this may be ready to merge soon and is just holding out in case of objections python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants