-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: SPM tokenization breaks in at least one specific case. #7629
Comments
The 2 tokens Looks like a mistake in the model configuration to me |
That is strange. I'll convert the model again with that fixed. I'm not sure why that'd cause this problem though. Because it's |
I found more known problems. Fails with
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": true, <--
"single_word": false,
"special": false
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false, <--
"single_word": false,
"special": true
}, In phi-3 all added tokens have I don't see any way to solve this correctly without having per-token flags/properties. The only thing I can do is to initialize |
It seems that if we want to be able to handle arbitrary models then we need to level up the tokenizer to take all of the data into account. Hard-coding model names in the tokenizer might be a quick fix, but it's gonna need ongoing maintenance. That's fragile and messy. But, it's better than it just plain not working I guess. In my case, I'm skipping llama_tokenize and going right to the low-level tokenizer to take control over the process. |
I agree, meanwhile #7685. |
@jaime-m-p @giladgd I'm verifying now. |
The problem persists in 3b38d4. No change. |
@snichols Actually this is the unexpected output. The correct output seems to have an additional space: model = "./models/tokenizers/dolphin-2.8-mistral-7b-v02"
tokenizer = AutoTokenizer.from_pretrained(model)
text = "<|im_start|><|im_end|>\n"
ids = tokenizer.encode(text, add_special_tokens=False)
re = tokenizer.decode(ids)
print(repr(text)) # '<|im_start|><|im_end|>\n'
print(repr(re)) # '<|im_start|><|im_end|> \n'
print(ids) # [32001, 32000, 28705, 13] |
@jaime-m-p Well, then all is as expected. Woot! |
@jaime-m-p What is considered "correct" is not universal. For example, if I add from transformers import AutoTokenizer
model = "cognitivecomputations/dolphin-2.8-mistral-7b-v02"
tokenizer = AutoTokenizer.from_pretrained(model, legacy=False)
text = "<|im_start|><|im_end|>\n"
ids = tokenizer.encode(text, add_special_tokens=False)
re = tokenizer.decode(ids)
print(repr(text)) # '<|im_start|><|im_end|>\n'
print(repr(re)) # '<|im_start|><|im_end|>\n'
print(ids) # [32001, 32000, 13] The situation with handling of spaces in SPM tokenizer is unfortunate, see #3664. My position, based on practical approach, is that instead of trying to follow some standard here, it is better to give each model what it expects. If a particular model works better with a space after
|
I think I understand what you are stating here, but isn't it contradictory?
Somehow this better alternative should be stored in the config json files.
Then we need to manage this config parameter too (currently not implemented in Llama.cpp). Also note, in this particular case for If this model was trained with
I think this is the way, but you only require manual editing if you want override model training config values. Actually we have only this tokenizer flags implemented: Lines 2608 to 2615 in f1948f1
Can you achieve what you expect using this flags? or adding another config json flag? Maybe implementing a function like llama_config_tokenizer(...) to override this tokenizer flags.
If we add a non standard flag, how can we test it for all models? |
With regard to small details like this, model’s configuration files can can contain errors and sub-optimal parameters. For any particular model, I’m not sure what exact tokenization quirks were in effect during training, so I would not be able to find an example of a model that performs better with different rules for inserting spaces between training and inference. But I can give a theoretical example where trying to match training parameters doesn’t make sense: merged model where source models used different modes. I think, such model should work well both with and without extra spaces, but still, which mode is better can only be found out via testing. And here, I suspect, different quantization types may favor different options. If we can store in model file all the information needed to reproduce the desired variation of tokenization, then in some cases, that information may need to be written after quantization is done. Given that who creates, converts, and uses a model may be 3 different entities, it’s good to give the user an option to easily select which variation to use. As to how to store that information, I’m not sure what would be the best way, but here is one:
This allows for flexibility like adding a space after im_start, but not after im_end. I don’t know if any model would benefit from it, but if one does, it will be covered. |
As a side note, this issue concerns the case when When #3664 will be resolved, if ever, and when a better way to handle special tokens in prompt formats/templates will be implemented (this is planned), this will need to be considered again. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
What happened?
Consider this code snippet:
With the latest version this is generating the following output:
In an earlier version of llama.cpp, the correct tokenization was generated:
This work is based on https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02.
If I tokenize each component separately, I get the correct results for each token. However, tokenizing
<|im_end|>\n
results in an extra 28705 token in the output. Interestingly enough,<|im_start|>\n
is also correct. There's something extra special about this<|im_end|>
. I haven't methodically gone over previous commits to see when this problem was introduced. Let me know if that'll help narrow the cause down.I'm pretty confident that I can work around this problem just by tokenizing each element separately. I'll do that and run the model through some tests. That being said, there may be some other tokenization issues in the code that are being surfaced by this.
Name and Version
This is a custom app using tag
b3040
of llama.cpp. https://github.com/ggerganov/llama.cpp/releases/tag/b3040What operating system are you seeing the problem on?
Linux
Relevant log output
No response
The text was updated successfully, but these errors were encountered: