-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : fix pre-tokenization of non-special added tokens #8228
Conversation
@compilade I can see problems related to NFD/NFC normalization, but I can't find any problems with spaces. llama.cpp/tests/test-tokenizer-random.py Lines 272 to 281 in 68220fe
Please, can you provide me a failure case so I can check? |
src/llama.cpp
Outdated
@@ -13985,6 +14002,10 @@ struct llm_tokenizer_bpe { | |||
void tokenize(const std::string & text, std::vector<llama_vocab::id> & output) { | |||
int final_prev_index = -1; | |||
|
|||
// FIXME: pre-tokenize added_tokens (user-defined tokens) before other pre-tokenization |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are looking for this:
Line 14789 in db2ffd5
if (parse_special) tokenizer_st_partition(vocab, fragment_buffer); |
If enabled (parse_special
), all added tokens are pre-tokenized even before regex splits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes! This is it. But it makes me think there should be a distinction between special tokens and user-defined tokens. I think parse_special=false
should still pre-tokenize non-special user-defined tokens before the regex splits.
@jaime-m-p With that commit, with the MPT tokenizer, I can see that the spaces do get handled correctly with However, correcting the diff --git a/llama.cpp b/llama.cpp
index ab8620ec..58b84839 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -13222,13 +13222,7 @@ struct llm_tokenizer_bpe {
};
break;
case LLAMA_VOCAB_PRE_TYPE_MPT:
- // TODO: MPT pre-tokenization regexes are unknown
- // the following are close, but not exact. run the following:
- // ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf
- GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed");
regex_exprs = {
- "\\s?\\p{L}+",
- "\\s?\\p{P}+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
};
break; What you've done is definitely better than this PR, since this means MPT and OLMo and other NeoX-style tokenizers can simply directly use the GPT-2 pre-tokenizer, like they should. I think I'll close this in favor of #8039. |
From further testing, it seems like passing This PR makes it still pick the special spaces even when Seems like NeoX tokenizers complicate the decision of whether user input should be exempt of special tokens or not. Or maybe user-defined tokens should not be considered special as in |
Oh, this seems to fix all MPT apostrophe problems.
Note that AutoTokenizer does not even offer the optional |
Only used in _set_vocab_gpt2() for now.
This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens
In trying to fix the pre-tokenization for non-special added tokens for MPT and OLMo, I think I've also fixed Gemma and Gemma-2's pre-tokenization. Gemma on On $ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
2 -> '<bos>'
235322 -> '<'
973 -> 'blockquote'
2577 -> '><'
235259 -> 'h'
235274 -> '1'
235313 -> '>'
4521 -> 'Hello'
727 -> '</'
235259 -> 'h'
235274 -> '1'
3119 -> '></'
973 -> 'blockquote'
235313 -> '>' On this branch (after re-converting the model): $ ./build/bin/llama-tokenize -m models/ggml-vocab-gemma-2.gguf -p "<blockquote><h1>Hello</h1></blockquote>"
...
2 -> '<bos>'
191 -> '<blockquote>'
185 -> '<h1>'
4521 -> 'Hello'
192 -> '</h1>'
198 -> '</blockquote>' There's a weird behavior of the HF tokenizer for Gemma which seems to prefer the 16-space token (id: 153) over longer ones:
But when using So I assume what |
tokens, scores, toktypes = self._create_vocab_sentencepiece() | ||
# hack: This is required so that we can properly use start/end-of-turn for chat template | ||
for i in range(108): | ||
# including <unusedX>, <start_of_turn>, <end_of_turn> | ||
toktypes[i] = SentencePieceTokenTypes.CONTROL | ||
self.gguf_writer.add_tokenizer_model("llama") | ||
self.gguf_writer.add_tokenizer_pre("default") | ||
self.gguf_writer.add_token_list(tokens) | ||
self.gguf_writer.add_token_scores(scores) | ||
self.gguf_writer.add_token_types(toktypes) | ||
|
||
special_vocab = gguf.SpecialVocab(self.dir_model, n_vocab=len(tokens)) | ||
special_vocab.add_to_gguf(self.gguf_writer) | ||
self._set_vocab_sentencepiece() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "hack" from #8244 is no longer required because the control tokens are now identified with tokenizer_config.json
.
src/llama.cpp
Outdated
if (!(vocab.id_to_token[id].attr & LLAMA_TOKEN_ATTR_NORMAL)) { | ||
if (vocab.id_to_token[id].attr & (LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's necessary to limit the special token cache to CONTROL
and USER_DEFINED
tokens, because otherwise BYTE
tokens would be pre-tokenized incorrectly (e.g. the string <0x20>
would get tokenized as
(a space)).
But maybe it's a bad idea to exclude other types of tokens, like UNUSED
padding tokens which might in some cases (when?) be desired to tokenize to themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with LLAMA_TOKEN_ATTR_CONTROL | LLAMA_TOKEN_ATTR_USER_DEFINED
, but I think we need LLAMA_TOKEN_ATTR_UNKNOWN
too.
I think is corrrect to drop UNUSED tokens but we need parse UNKNOWN token.
It is LLAMA_TOKEN_ATTR_NORMAL
because currently UNUSED and UNKNOWN are wrong mixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should UNKNOWN tokens only be parsed when parse_special == true
, or all the time like USER_DEFINED
tokens?
I'm not sure if UNKNOWN tokens should be specially parsed at all.
I would tend toward only parsing them with parse_special == true
, like CONTROL tokens. (this is also the current behavior of master
)
I'm trying to figure out where UNKNOWN tokens are used and if it's useful to specially parse them. But this might differ from HF's tokenizers
, so I need a model with UNKNOWN tokens to test this out. If I don't find one, I could modify an existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we need parse UNKNOWN token.
I've fixed this in 98edea6. UNKNOWN tokens are parsed when parse_special == true
, as on master
.
tokens.append(f"[PAD{i}]") | ||
toktypes.append(gguf.TokenType.USER_DEFINED) | ||
toktypes.append(gguf.TokenType.UNUSED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Padding tokens are set as UNUSED
to reflect how it was already done in _set_vocab_sentencepiece
, and also to avoid wrongly (pre-)tokenizing strings which happen to correspond to a padding token. (since USER_DEFINED
tokens are now always pre-tokenized specially)
seems_special = seems_special or (token_text.startswith("<|") and token_text.endswith("|>")) # deepseek-coder | ||
|
||
# TODO: should these be marked as UNUSED instead? (maybe not) | ||
seems_special = seems_special or (token_text.startswith("<unused") and token_text.endswith(">")) # gemma{,-2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should things like this be defined in the conversion script under the specific model to avoid accidental false hits? if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that
maybe does_token_look_special should take in 2 lists: 1 list of strings of known special tokens, and a list of tuples of starts/ends with tokens
So for gemma2, we'd call it with:
special_tokens = ["<mask>", "<2mass>", "[@BOS@]"]
special_tags = [("<unused", "|>")]
self.does_token_look_special(token, special_tokens, special_tags)
and then here we'd have:
seems_special = token_text in special_tokens
for start_tag, end_tag in special_tags:
seems_special = seems_special or (token_text.startswith(start_tag) and token_text.endswith(end_tag))
return seems_special
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i realize I misread the structure of the code, hmmm.. still not impossible but would have to be passed at a higher level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if, for some weird reason, a model comes around with a non-special token that starts with <|, would be annoying to avoid that
This only affects added tokens either from added_tokens
in tokenizer.json
or from added_tokens_decoder
in tokenizer_config.json
, so it does not affect normal tokens starting with <|
in any way. Not all tokens of the vocab are checked with this, only the ones part of added_tokens
(which are treated specially by HF tokenizers
too anyway). And added tokens starting with <|
and ending with |>
are arguably always control tokens; this was added pretty much because some model makers wrongly marked those as non-special (notably, <|User|>
, <|Assistant|>
and <|EOT|>
in deepseek-coder
are supposedly non-special. Same with <|START_OF_TURN_TOKEN|>
and <|END_OF_TURN_TOKEN|>
for command-r
).
I did not yet notice any conflict in the added_tokens
to justify making model-specific checks instead of always checking for all known "special-but-arent-marked-special" tokens.
Also, this is a method of Model
, so it technically can be overridden by subclasses should there ever be a model with conflicting added_tokens.
The order was previously wrong, which caused errors in some tests.
llama : fix mpt and olmo pre-tokenizer llama : pre-tokenize non-special user-defined tokens first llama : fix detection of control-like user-defined tokens convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. llama : fix command-r detokenization
llama : fix mpt and olmo pre-tokenizer llama : pre-tokenize non-special user-defined tokens first llama : fix detection of control-like user-defined tokens convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. llama : fix command-r detokenization llama : add UNKNOWN tokens in the special tokens cache
Which models are going to need re-conversion? Just gemma2? deepseek2? |
@oldmanjk Yes, Gemma, Gemma-2, Command-R, Command-R-Plus, and deepseek-coder (not sure which of their models use that pre-tokenizer) need re-conversion, because (except for Gemma and Gemma 2) this changes the type of some of their It's mostly Gemma and Gemma-2 which really need re-conversion for correctness (unfortunately), because all non-special |
So deepseek only really needs re-conversion if you're "tokenizing with |
This only happens when using
I checked, and https://huggingface.co/deepseek-ai/DeepSeek-Coder-V2-Instruct is a bit less affected by this than the older
@oldmanjk It's not necessary to reconvert, especially if your use-case is not affected, and especially if it's that demanding. You're likely fine with your existing Makes me think there should really be a way to update the metadata of GGUF models without rebuilding them all over again. ( |
Thanks for the fast and thorough responses! I really appreciate it.
I think this rules out my personal use case. Unfortunately, it sounds like, to release these imatrices and/or quants, one should still probably re-convert. My motivation to do so is dying. Luckily, there are others, but I feel for them as well.
To be clear, most of the time/resources is/are spent creating imatrices. I don't know if this changes anything. I assume they'd need to be recreated either way, right? (I don't mean for my personal use case, but for both release and if there was a way to update the metadata). bf16 cuda support can't come fast enough. It would cut these re-conversion penalties at least in half |
EDIT: it uses llama.cpp/examples/imatrix/imatrix.cpp Line 443 in 7d0e23d
(the fourth argument (missing) is But in any case the calibration datasets usually don't contain special tokens (I think?), if they do, this does seem like this line in But this won't really change the importance matrix much, not in a noticeable way anyway (especially if there are no special tokens in it). So the existing imatrices can likely still be used. |
@compilade |
@oldmanjk It's the $ jq .added_tokens < tokenizer.json If you want to $ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | grep -z '^<' | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt While if you want to search for any special token, simply remove the middle $ jq --raw-output0 .added_tokens[].content < ./path/to/tokenizer.json | xargs -I{} -0 grep --color -F {} ./path/to/dataset.txt |
$ jq .added_tokens < deepseek-ai_DeepSeek-Coder-V2-Instruct/tokenizer.json
I manually searched for the string, "<|", which is present in all the deepseek-coder-v2 special tokens, in my datasets. The popular "code.txt" dataset does contain them ("<|system_prompt|>" and "<|user_prompt|>"), but, luckily, those don't exist in deepseek-coder-v2's special tokens. The popular "tiny.txt" dataset contains 316 "<|endoftext|>" tokens, which, again, luckily, doesn't exist in deepseek-coder-v2's special tokens. None of the other datasets I use, including the popular "badwords.txt", "technical.txt", "groups_merged-enhancedV3.txt", and "c4.txt", contain any instances of "<|". Edit - This also makes me think - should we be modifying our datasets to conform to the particular model? For example, should I be replacing "<|endoftext|>" with "<|EOT|>" when creating an imatrix for deepseek-coder-v2 from tiny.txt? |
Is there anything I can help with to speed this up? |
This makes the changes from #8321 more consistent with the other changes made here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really know what I'm doing, so forgive me if this isn't helpful
@@ -373,6 +373,29 @@ def from_model_architecture(cls, arch: str) -> type[Model]: | |||
except KeyError: | |||
raise NotImplementedError(f'Architecture {arch!r} not supported!') from None | |||
|
|||
def does_token_look_special(self, token: str | bytes) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Method 'does_token_look_special' may be 'static'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to make it a @staticmethod
, to allow overriding it in the subclasses of Model
if needed.
@@ -143,7 +143,7 @@ def __init__(self, dir_tokenizer: str): | |||
self.vocab = list(sorted(self.vocab)) | |||
# tokens and lists | |||
self.special_tokens = list(self.model.all_special_tokens) | |||
self.added_tokens = list(self.model.added_tokens_encoder) | |||
self.added_tokens = self.model.batch_decode(self.model.added_tokens_encoder.values(), skip_special_tokens=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"PEP 8: E221 multiple spaces before operator" (just FYI - looks good to me)
"PEP 8: E501 line too long (122 > 120 characters)"
tests/test-tokenizer-random.py
Outdated
logger.error(" Expected: " + str(ids1)) | ||
logger.error(" Result: " + str(ids2)) | ||
logger.error(" Expected: " + str(ids1) + f" {[tokenizer1.decode([id]) for id in ids1]}") | ||
logger.error(" Result: " + str(ids2) + f" {[tokenizer2.decode([id]) for id in ids2]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Shadows built-in name 'id'"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* test-tokenizer-random : add a failing edge case for falcon
Sorry for the review delay. There is only one problem I can't see how to fix related to pre-normalization in the convert script: llama.cpp/convert_hf_to_gguf.py Line 424 in 1caa20f
Note that this is not a real error, this only tells that vocabs are different (AutoTokenizer vs Llama.cpp). Tokenizing an detokenizing seems correct. Instead of modifying the vocab words, there is an alternative (doing it the other way around). static std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab, std::string raw_text, bool add_special, bool parse_special) {
std::vector<llama_vocab::id> output;
std::forward_list<fragment_buffer_variant> fragment_buffer;
+ // tokernizer.json: "normalizer": { "type": "Replace", "pattern": { "String": " " }, "content": "▁" }
+ if (vocab.type == LLAMA_VOCAB_TYPE_SPM) { //TODO: use vocab.tokenizer_escape_whitespaces ?
+ llama_escape_whitespace(raw_text);
+ }
if (!raw_text.empty()) {
fragment_buffer.emplace_front(raw_text, 0, raw_text.length());
tokenizer_st_partition(vocab, fragment_buffer, parse_special);
}
...
// prefix with space if previous is special
if (vocab.tokenizer_add_space_prefix && is_prev_special) {
- raw_text = " " + raw_text;
+ raw_text = "\xe2\x96\x81" + raw_text;
}
...
llm_tokenizer_spm tokenizer(vocab);
- llama_escape_whitespace(raw_text);
tokenizer.tokenize(raw_text, output);
is_prev_special = false; I think this follows better the tokenizer pipeline and generalizes using the config flag I still doing experiments, but derives to simplifications and removing special cases (indication that the path is correct): Lines 21170 to 21188 in 1caa20f
I guess this will converge to something like: const std::string & token_text = model->vocab.id_to_token[token].text;
switch (llama_vocab_get_type(model->vocab)) {
case LLAMA_VOCAB_TYPE_WPM:
case LLAMA_VOCAB_TYPE_SPM:
case LLAMA_VOCAB_TYPE_UGM: {
std::string unscaped_text = llama_unescape_whitespace(token_text);
return _try_copy(unscaped_text.data(), unscaped_text.size());
}
case LLAMA_VOCAB_TYPE_BPE: {
std::string decoded_text = llama_decode_text(token_text);
return _try_copy(decoded_text.data(), decoded_text.size());
}
... |
I agree it would be better to keep the vocab text verbatim. But unconditionally pre-escaping the input would lose the distinction between (EDIT: on further thought,
For control tokens, I'm not sure if the whitespaces should be unescaped. I also don't see a way around special-casing BYTE tokens with I think it should be more like: const std::string & token_text = model->vocab.id_to_token[token].text;
switch (llama_vocab_get_type(model->vocab)) {
case LLAMA_VOCAB_TYPE_WPM:
case LLAMA_VOCAB_TYPE_SPM:
case LLAMA_VOCAB_TYPE_UGM: {
// NOTE: we accept all unsupported token types,
// suppressing them like CONTROL tokens.
if (attr & (attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED) && (attr & LLAMA_TOKEN_ATTR_NORMALIZED)) {
return _try_copy(token_text.data(), token_text.size());
} else if (attr & LLAMA_TOKEN_ATTR_BYTE) {
char byte = (char) llama_token_to_byte(model->vocab, token);
return _try_copy((char*) &byte, 1);
} else if (attr & (LLAMA_TOKEN_ATTR_NORMAL | attr_special | LLAMA_TOKEN_ATTR_USER_DEFINED)) {
std::string result = token_text;
llama_unescape_whitespace(result);
return _try_copy(result.data(), result.size());
}
break;
}
... But I'm not sure, since this doesn't really look simpler. If no SPM tokenizers use pre-normalized added_tokens (I've yet to find a counter-example), then your approach (if also considering BYTE tokens) would work, though. |
) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon
) * llama : fix mpt and olmo pre-tokenizer * llama : pre-tokenize non-special user-defined tokens first * llama : fix detection of control-like user-defined tokens * convert_hf : identify which user-defined tokens are control tokens Only used in _set_vocab_gpt2() for now. * convert_hf : identify more added control tokens for SPM tokenziers This makes Gemma and Gemma-2 tokenize pretty much EVERYTHING correctly, including HTML tags and consecutive spaces, but it unfortunately requires model re-conversion. There seems to be a weird behavior of the HF tokenizer for Gemma, which prefers to use the 16-space token over more lengthy space tokens, while using the SentencePiece tokenizer does not do this. (the implementation in llama.cpp has the same behavior as SentencePiece) * llama : fix wrong pre-tokenization of byte tokens * llama : fix Viking pre-tokenizer regex The order was previously wrong, which caused errors in some tests. * llama : fix command-r detokenization * convert_hf : reduce usages of the UNKNOWN token type * llama : add UNKNOWN tokens in the special tokens cache * convert_hf : reduce usages of UNKNOWN for InternLM2 This makes the changes from ggerganov#8321 more consistent with the other changes made here. * test-tokenizer-random : reduce potential confilcts with ggerganov#8379 * test-tokenizer-random : add a failing edge case for falcon
MPT and OLMo use a NeoX-style tokenizer, which has pre-normalized spaces in its added_tokens, while Gemma uses encoded (non-normalized) spaces in its added_tokens, and also has some HTML tags in there.
In the current state of
src/llama.cpp
, the tokenizer tests fails for MPT and OLMo when usingparse_special == false
.HuggingFace's
tokenizers
pre-tokenizes theadded_tokens
before other pre-tokenization happens, see https://github.com/huggingface/tokenizers/blob/fdd26ba9a3f0c133427aab0423888cbde91362d7/tokenizers/src/tokenizer/mod.rs#L726I've changed
tokenizer_st_partition
to be called even ifparse_special
is false to pre-tokenize user-defined tokens, but it processes control tokens only whenparse_special
is true. This allows pre-tokenizing non-special added tokens correctly at all times, while still allowing to safely not tokenize control tokens whenparse_special
is false.This also fixes Gemma's tokenization of HTML tags, but it requires re-conversion because the added tokens were previously not using the correct types, and also because user-defined tokens are assumed to use bare spaces by
llama.cpp
, but Gemma uses "▁" (normal tokens are not affected).TODO
./build/bin/test-tokenizer-0 ./models/ggml-vocab-mpt.gguf
./build/bin/test-tokenizer-0 ./models/ggml-vocab-olmo.gguf
tests/test-tokenizer-random.py
with Gemma-2's tokenizer, there's no difference whatsoever to the referencetokenizer.model