-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix detokenization of non-special added-tokens #3760
Conversation
return -result.length(); | ||
} | ||
memcpy(buf, result.c_str(), result.length()); | ||
return result.length(); | ||
} else if (llama_is_control_token(model->vocab, token)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems redundant, since we ignore the token regardless of it's a control one or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But these are the token types I'd expect to see. In an older version the code asserted that no other token type could occur, but I'm not sure how we should react (assertion, exception or ignorance).
I tested it with #3586 and it works when passing
|
When checking
with I checked the behaviour of the HF tokenizers with
and get for
and for
and for
and for
This indicates to me our special token handling (CONTROL) is different and our handling of padding tokens could be problematic. |
Referencing a comment with my understanding of the situation I see a couple of options:
I'll need some guidance here on how to best proceed. |
Weren't the padding tokens originally a hack anyway? I don't know that they should be included in the actual tokenizer vocabulary. They only exist because GGUF doesn't (currently) have a way to specify vocab_size. llama.cpp/convert-gptneox-hf-to-gguf.py Lines 121 to 122 in ad93962
|
I don't think that's the case, though not sure about the origin of it. Quite often the python HF models are released with tensors and a vocabulary sized larger than the actually real vocabulary is. OpenAssistant (Falcon) for example has 12 missing tokens compared to the tensor size and to the actual vocab_size specified. I believe the correct way is to generate PAD tokens, they can not be generated anyway it's a placeholder. |
As far as I understand, HF transformers does not need these PAD tokens in the vocabulary, so why not make things simple and try to be as close to their behavior as we can? We are trying to get the tokenizer to ignore these tokens when they don't really exist in the first place. We could use the UNUSED token type, but it seems like unnecessary effort to generate placeholders just so they can be totally ignored by the implementation. |
Yes, it'd be awesome if there was a fix that meant these extra tokens weren't needed. Recently CausalLM released with a whopping 213 missing tokens, comparing config.vocab_size to supplied vocabulary. My 'fix' for that until now has been to add a bunch of I agree with @cebtenzzre that it'd be really great if llama.cpp could just ignore these 'phantom' tokens, like transformers does. As to why models do this - in some cases it's because they're rounding the vocab size up to a multiple so that tensor parallelism works. Eg models that have a vocab size of 32032 or 32128 - their actual vocab might be, say, 32003, but that breaks TP so they then round up to the next nearest multiple. Though that's not why OpenAsisstant did it. |
You are right, ignoring the mismatch is a better way to solve it. But I'd guess the stored vocab size in the converted model should actually reflect the true vocabulary size and not continue with the wrong one. Basically it's misbehavior on the HF config side to claim a false vocabulary size, most likely to satisfy something in the transformers libs which then in sequence ignore the wrong number or internally generate PADs. Adding PAD tokens is/was a quick fix for a wrong config, so either adding PAD tokens or correcting the wrong size number and relying on tensor size for calculations. It probably needs a fix in all eval routines and checks that rely on vocab size. |
Continued here: #4916 |
Checked with
Before:
After:
Also fixes build problems with
clang
onWindows
(not detected by CI?). We can't link withclip
andcommon
here due to duplicate symbol errors.