-
Notifications
You must be signed in to change notification settings - Fork 11k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPT: tokenization crashes #3604
Comments
I'm also seeing the same crash for mpt-30b-instruct (which is no wonder, same vocab/tokenizer). |
@goerch Please see above. With this "obvious" hack/fix it does not crash:
But I don't understand how the particular hardcoded character ranges in unicode_to_bytes_map_bpe were selected, so not sure if this is the correct approach. |
This is because of the reasons described in #3405. I have not yet identified how to port the changes in that pull request to goerch's new tokenizer code. How do we currently differentiate between normal and added tokens for MPT in the C++ code? |
Looking at
There is no validation before I think it should be at least something along the lines of:
|
I think that's wrong. This code is assuming that all tokens in the vocabulary are specially encoded so they are valid JSON strings (which makes me wonder why we are now saving them this way in the GGUF files, as GGUF does not have the same restrictions). But added tokens are not specially encoded, so we must not call unicode_to_bytes_bpe on them. All normal tokens are guaranteed to be encoded this way. |
@cebtenzzre I meant not as a fix but in general, the map doesn't contain all possible strings so map.at() will crash if something not in the map gets passed as the argument. It's not that important, doesn't seem to be the cause. I sort of think I know what might be going on here Unicode codepoint for space is the same as ascii 0x20 (32)
When the map is created, space is added by the last for loop, at index with offset of 256: Except that space comes from https://github.com/ggerganov/llama.cpp/blob/1e0e873c373c33989beb6bc64d83cd572ab7fe2b/llama.cpp#L9437 Which doesn't use any offsets and returns a normal space for that codepoint: But tracing this back even further, the " " comes directly from the vocab / Tracing it even further back, wouldn't it mean @cebtenzzre, is that what you meant ? Am I understanding this correctly ? Even if I'm wrong here, I think this bit of code could really use some, even rudimentary, comments |
Not anymore, because goerch recently moved all of the vocab conversion to llama.cpp. It is no longer the conversion script's responsibility. The different parts of the JSON have different ways of encoding the vocabulary - added tokens are stored as-is, whereas normal tokens need bytes_to_unicode/unicode_to_bytes. My PR changed the conversion script to only decode the normal tokens, but we no longer decode any tokens in the conversion script, so that will not work. I believe in the |
I was testing something else when I noticed the entire vocab for So if all (BPE) tokens are marked as normal in the .gguf, how can we tell which tokens should get the bytes_to_unicode/unicode_to_bytes treatment ? In case of I'm not sure if I'm missing something or I got this wrong. |
I suspect you're right and the convert scripts (not just for MPT, but for any model with BPE tokenizer) should be updated to classify tokens (based on #3538 (comment)) as CONTROL (if found in added_tokens and special: true in tokenizer.json), USER_DEFINED (if found in added_tokens and special: false in tokenizer.json), UNUSED (if artificially added just to make vocab match model.vocab_size like the MPT pad tokens), NORMAL otherwise. But as it stands, CONTROL is used for all BPE added_tokens in convert.py (regardless of the "special" flag)... and all the other convert scripts just output all tokens as NORMAL... |
I don't think the merge was too soon - at least in the sense that it doesn't make matters worse than they were without the merge. The bug is in most likely in the convert scripts, and these was already merged in earlier (I suspect I didn't catch the crash in my prior MPT testing because I was not generating random enough content to cause the special token to appear in the output). I think it would make sense if @goerch also had a look at this issue - because convert-mpt was essentially me copying the convert-gptneox script with small adjustments. |
With #3728 and
I get
Not sure if we have a different problem, though. |
@goerch With that particular model, you have to let it spin for quite a while sometimes to trigger the problem, and at some point it will produce |
Thanks, the original report states:
and I let it run for the requested 1024 tokens. Not sure how often I should repeat that? Edit: BTW the patch should make sure that added tokens are |
I can confirm that the code from #3728 no longer crashes. If you wish to reproduce the crash, you can do so using llama.cpp code from master (22c69a2), but only if you convert the model without USER_DEFINED, just using NORMAL tokens (i.e. using master's version of the conversion script). Note that code on master does not crash when running with a model converted by #3728. And interestingly, #3728 does not crash even if only NORMAL tokens are output. |
Then the following old code in
could be the culprit.
Not sure if we should revert the changes in |
Just another smoke test:
|
@goerch I run into the same issue in #3586. A simple way to reproduce it is to just pass
|
It doesn't seem so. |
Without looking into |
Oh, I forgot about it, tokenized prompt gets immediately detokenized when it's printed to the log. This seems like a nice fast way to reproduce it without having to let it generate. The problem (this particular one) is still with the detokenizer rather than tokenizer. Edit: Vocab preprocesing adds a single print to the output with special token stats and verification result, you can use it to quickly see if a model vocab is correct or not:
|
Or with token classification in the conversion? |
That seems to be the case tokenizer.json (added tokens only) "added_tokens": [
{
"id": 0,
"content": "<|endoftext|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 1,
"content": "<|padding|>",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 50254,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50255,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50256,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50257,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50258,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50259,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50260,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50261,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50262,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50263,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50264,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50265,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50266,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50267,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50268,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50269,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50270,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50271,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50272,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50273,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50274,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50275,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
{
"id": 50276,
"content": " ",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
}
], |
From the looks of it #3728 could be applicable. |
From what I understand, yes, that would be the source cause, and detokenizer is where it actually becomes a problem. |
I looked at this code and am having doubts if we should count CONTROL and BYTE tokens as special tokens, for example. No need to change this yet, just a heads up. I'll have to understand which tokens you add to the special tokens cache. |
That bit of code, is for simple sanity check, and currently only is used to print to the log whether token labels in the vocab match the manually extracted special tokens. I am still considering this approach, of manually extracting special tokens from the vocab, as nothing more than a stable workaround and eventual fallback solution. Method I used is quite simple Iterate over the entire vocab and for each token take its string representation For the given token string representation, split it in two substrings ( at 1st byte, 2nd byte,...,strlen-1 byte ) and check if for any of those splits, both halves have a matching token in the vocab, if not, mark as special token. That's pretty much it. |
Are we good here for now? I intend to follow up on the multiple space issue. Should we track it here? |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Testing against 5974d61 and #3538
While running
It consistently crashes after a few (~hundred) tokens with this backtrace:
(The GGUF/vocab was exported using convert-mpt-hf-to-gguf.py from the aforementioned commit.)
The text was updated successfully, but these errors were encountered: