-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization is not equal to Meta's tokenization. #2310
Comments
There are known problems with the tokenizer, |
because Tokenizer.tokenize() always add a space in front of the string passed in. 29871 is the space |
just to explain what I meant by "space", see here: https://github.com/google/sentencepiece/blob/master/README.md#whitespace-is-treated-as-a-basic-symbol |
I don't think it does that, because I just tested it here. But may be this happens elsewhere. |
Running the tests I see the Meta tokens now. But I surely need guidance on how to integrate the necessary changes into BTW: I'd expect a couple of more problems for other tests. |
Waiting for the fallout ...
@viniciusarruda and others ; nice find, indeed! Now that I've learned a bit about the tokenizer: this kind of problem could not only occur with |
i am curious, from openai repo, How_to_count_tokens_with_tiktoken,it says Encodings specify how text is converted into tokens. Different models use different encodings,so i wonder if we are on the right way? did we only need to match the meta? |
I'd like to add to this discussion that the real llama tokenizer eats spaces at the beginning making it very bad for continuous generation. Llamacpp's tokenizer was previously tested to give the same results as the HF tokenizer with our workaround. So for continuous generation the behavior might be better. If this is changed in the code it would be nice if this would be an optional change since if the tokenizer begins eating spaces this breaks use cases. The tokenizer eating spaces is only desirable for applications with a seperate output window. |
Do you have a possible test scenario for the proposed PR in mind? I don't know how to check that otherwise. |
Yes, and the upstream tokenizer workarounds are not perfect so I will give you a method to test the ideal scenario. So imagine a sentence that does not end with a space such as "This is an example." in a UI that lets users add to that text. The correct scenario would be this generation " It works very well." which combined forms "This is an example. This works very well." The original Llama tokenizer at least on HF doesn't do this because they trained sentencepiece in a mode where it swallows the first space. So you'd get this "It works very well." resulting in "This is an example.It works very well". Likewise you also don't want to always insert the space if you don't have to, "My favorite sport is foot" should auto complete to "ball and I play it every week" not " ball and I play it every week". Adhering to that behavior would make it usable for continuous generation use cases because then the text flows naturally and the tokenizer picks the most appropriate token for the end result, rather than never picking a space or always picking a space. On the huggingface side we force the tokenizers hand, we insert a random token such as a comma at the front of the input, and then after its tokenized we remove the first fake token which results in it doing a proper continuous generation. The behavior we typically observed from Llamacpp is very similar to that workaround. And we originally implemented workarounds like this because Llamacpp did a better job and its helped us in the past to identify things were wrong. |
Ad 1.
With main:
With PR:
Ad 2.
With main:
With PR:
|
I believe to understand the current |
Maybe this warrants further investigation. I just ran
Main:
With PR:
|
Found this which seems to fix at least the test cases I have. I don't know why and I didn't test setting I updated the code to produce the results by adding the It might be helpful to understand the behavior of sentencepiece. In this notebook cell ("Load model from byte stream"), it includes a |
@viniciusarruda : tokenizer fixes are in via the |
In the meantime I tried to understand possible invariants for
What would you expect to see?
9601 >▁< > <
2: >tcNSM8fcwo< >nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6< >tcNSM8fcwo nmDL5G 2xEwwnFiX9zbp4E XZlKuKbViNTGRNMcd yI31iwR6LKa6<
[snip]
When rechecking |
When I added the BPE tokenizer for Falcon, I disabled |
A fast moving project, I see. I'm working on fixes, now that I have some kind of test bed. But this looks like a tough one, so don't expect results immediately. |
@goerch Thank you for looking into it. Things are indeed moving fast and mistakes can happen. The top priority is having the SPM tokenizer work 100% correct. After #2810 I thought we have achieved this, but based on your findings, it might not be the case yet. The |
@ggerganov : I believe some of the remaining issues could be related to this comment. @howard0su , @klosax : AFAIU nobody looked into this so far, any reason for it? |
Nice. I'll try this one next to cover the Unicode character set (that should at least demonstrate my 'high byte'(>0xff) problem) . Any thoughts on Unicode normalization as mentioned here? |
Let's create a PR with a few failing cases and will try to give it more visibility so we get some help with it. I'm afraid I'm out of my depth here and won't be able to understand how the normalization works and what it takes to implement it, so hopefully people familiar with it will help out |
I was mistaken: testing high byte output with
and
shows identical output. |
If @viniciusarruda and @ggerganov don't disagree : I think we should close this issue and wait for possible Unicode normalization issues to show up (otherwise I need help to design a test case). |
I'm comparing the tokenization between original Meta repo and llama.cpp with LLaMA (also had same issue with LLaMA v2).
For example, tokenizing the prompt "Hello world" and " Hello world" gives the following:
Exploring the tokens, doing the detokenization, I got:
Exploring each token above with the
id_to_piece
functionality:But, using the detokenizer in each id individually:
The code used to produce this results can be seen here.
Use this file for the Meta tokenizer.
The model
ggml-model-f16.bin
is the 7B LLaMA model after using theconvert.py
script as mentioned here.The text was updated successfully, but these errors were encountered: