Fix detokenization of non-special added-tokens #3760

goerch · 2023-10-24T11:50:54Z

Checked with

.\build\bin\Release\main.exe -m models\mpt-7B-storywriter\ggml-model-f16.gguf -p "Once upon a time  there" --temp 0 -n 32

Before:

Once upon a timethere

After:

Once upon a time  there

Also fixes build problems with clang on Windows (not detected by CI?). We can't link with clip and common here due to duplicate symbol errors.

Galunid · 2023-10-24T12:11:00Z

llama.cpp

+                    return -result.length();
+                }
+                memcpy(buf, result.c_str(), result.length());
+                return result.length();
            } else if (llama_is_control_token(model->vocab, token)) {


Seems redundant, since we ignore the token regardless of it's a control one or not.

Yes. But these are the token types I'd expect to see. In an older version the code asserted that no other token type could occur, but I'm not sure how we should react (assertion, exception or ignorance).

Galunid · 2023-10-24T12:33:12Z

I tested it with #3586 and it works when passing (replaced with __) as part of the prompt, thanks!

The best music is__the kind that makes you feel as though it was written for you, ad nothing more. There's a special way to listen to this music: the way a friend might play an old recording of their favourite band, over and over again, until the grooves are worn down to nubs. Or the way someone might tell a story they've never told anyone else about how they met that

goerch · 2023-10-24T13:28:16Z

When checking

    std::vector<llama_token> tokens = llama_tokenize(ctx, "[PAD50277]", false, true);

with mpt this resolves to token 50277. ~~This looks wrong to me, should I try to fix this in the PR, too?~~

I checked the behaviour of the HF tokenizers with

# tests with HF tokenizer

import argparse

from transformers import AutoTokenizer

parser = argparse.ArgumentParser()
parser.add_argument("dir_tokenizer", help="directory containing 'tokenizer.model' file")
args = parser.parse_args()

dir_tokenizer = args.dir_tokenizer

tokenizer = AutoTokenizer.from_pretrained(dir_tokenizer)

added_vocab = tokenizer.get_added_vocab()
reverse_vocab = {id: encoded_tok for encoded_tok, id in tokenizer.vocab.items()}

with open("result.txt", "w", encoding="utf-8") as file:
    for i in range(tokenizer.vocab_size + len([id for id in added_vocab.values() if id >= tokenizer.vocab_size]) + 1):
        s = tokenizer.decode([i])
        e = tokenizer.encode(s)
        if i not in reverse_vocab:
            file.write(f"pad {i} {s} {e}\n")
        elif reverse_vocab[i] in added_vocab:
            if tokenizer.added_tokens_decoder[i].special:
                file.write(f"special {i} {s} {e}\n")
            else:
                file.write(f"added {i} {s} {e}\n")
        # else:
        #     file.write(f"normal {i} {s} {e}\n")

and get for llama-2-7B

special 0 <unk> [1, 0]
special 1 <s> [1, 1]
special 2 </s> [1, 2]
pad 32000  [1]

and for mpt

special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254                          [50254]
added 50255                         [50255]
added 50256                        [50256]
added 50257                       [50257]
added 50258                      [50258]
added 50259                     [50259]
added 50260                    [50260]
added 50261                   [50261]
added 50262                  [50262]
added 50263                 [50263]
added 50264                [50264]
added 50265               [50265]
added 50266              [50266]
added 50267             [50267]
added 50268            [50268]
added 50269           [50269]
added 50270          [50270]
added 50271         [50271]
added 50272        [50272]
added 50273       [50273]
added 50274      [50274]
added 50275     [50275]
added 50276    [50276]
pad 50277  []

and for causalml

special 151643 <|endoftext|> [151643]
special 151644 <|im_start|> [151644]
special 151645 <|im_end|> [151645]
pad 151851  []

and for stablelm (stablelm-3b-4e1t)

^special 0 <|endoftext|> [0]
special 1 <|padding|> [1]
added 50254                          [50254]
added 50255                         [50255]
added 50256                        [50256]
added 50257                       [50257]
added 50258                      [50258]
added 50259                     [50259]
added 50260                    [50260]
added 50261                   [50261]
added 50262                  [50262]
added 50263                 [50263]
added 50264                [50264]
added 50265               [50265]
added 50266              [50266]
added 50267             [50267]
added 50268            [50268]
added 50269           [50269]
added 50270          [50270]
added 50271         [50271]
added 50272        [50272]
added 50273       [50273]
added 50274      [50274]
added 50275     [50275]
added 50276    [50276]
pad 50277  []

This indicates to me our special token handling (CONTROL) is different and our handling of padding tokens could be problematic.

goerch · 2023-10-24T18:54:11Z

Referencing a comment with my understanding of the situation I see a couple of options:

more compatibility with HF tokenizers (might be reasonably easy for our tokenizers, but with some risk breaking the remainder of the stack?)
do nothing (token classification might be questionable for added special tokens and padding tokens)
fix the problems(bug?) with added special tokens being classified as USER_DEFINED (some risk breaking the remainder of the stack)
fix the problem with padding tokens being classified as USER_DEFINED (what should the token type be: UNUSED?)

I'll need some guidance here on how to best proceed.

cebtenzzre · 2023-10-25T00:15:04Z

Weren't the padding tokens originally a hack anyway? I don't know that they should be included in the actual tokenizer vocabulary. They only exist because GGUF doesn't (currently) have a way to specify vocab_size.

llama.cpp/convert-gptneox-hf-to-gguf.py

Lines 121 to 122 in ad93962

    
           # The number of tokens in tokenizer.json can differ from the expected vocab size. 
        
           # This causes downstream issues with mismatched tensor sizes when running the inference

cmp-nct · 2023-10-26T13:39:21Z

Weren't the padding tokens originally a hack anyway? I don't know that they should be included in the actual tokenizer vocabulary. They only exist because GGUF doesn't (currently) have a way to specify vocab_size.

llama.cpp/convert-gptneox-hf-to-gguf.py

Lines 121 to 122 in ad93962

# The number of tokens in tokenizer.json can differ from the expected vocab size.

# This causes downstream issues with mismatched tensor sizes when running the inference

I don't think that's the case, though not sure about the origin of it.

Quite often the python HF models are released with tensors and a vocabulary sized larger than the actually real vocabulary is. OpenAssistant (Falcon) for example has 12 missing tokens compared to the tensor size and to the actual vocab_size specified.

I believe the correct way is to generate PAD tokens, they can not be generated anyway it's a placeholder.
Originally it's probably also a placeholder for continued fine-tuning that will unlikely ever be utilized.

cebtenzzre · 2023-10-26T15:29:39Z

I believe the correct way is to generate PAD tokens

As far as I understand, HF transformers does not need these PAD tokens in the vocabulary, so why not make things simple and try to be as close to their behavior as we can? We are trying to get the tokenizer to ignore these tokens when they don't really exist in the first place. We could use the UNUSED token type, but it seems like unnecessary effort to generate placeholders just so they can be totally ignored by the implementation.

TheBloke · 2023-10-26T15:44:33Z

Yes, it'd be awesome if there was a fix that meant these extra tokens weren't needed. Recently CausalLM released with a whopping 213 missing tokens, comparing config.vocab_size to supplied vocabulary.

My 'fix' for that until now has been to add a bunch of <dummyXXX>tokens to added_tokens.json. Kerfluffle recently provided an official implementation of that via --padvocab which I appreciate, though it's obviously still not a very elegant solution overall.

I agree with @cebtenzzre that it'd be really great if llama.cpp could just ignore these 'phantom' tokens, like transformers does.

As to why models do this - in some cases it's because they're rounding the vocab size up to a multiple so that tensor parallelism works. Eg models that have a vocab size of 32032 or 32128 - their actual vocab might be, say, 32003, but that breaks TP so they then round up to the next nearest multiple. Though that's not why OpenAsisstant did it.

cmp-nct · 2023-10-26T16:27:16Z

You are right, ignoring the mismatch is a better way to solve it. But I'd guess the stored vocab size in the converted model should actually reflect the true vocabulary size and not continue with the wrong one.

Basically it's misbehavior on the HF config side to claim a false vocabulary size, most likely to satisfy something in the transformers libs which then in sequence ignore the wrong number or internally generate PADs.

Adding PAD tokens is/was a quick fix for a wrong config, so either adding PAD tokens or correcting the wrong size number and relying on tensor size for calculations. It probably needs a fix in all eval routines and checks that rely on vocab size.

ggerganov · 2024-01-13T16:28:23Z

Continued here: #4916

Fix detokenization of non-special added-tokens

d2f8c9d

goerch mentioned this pull request Oct 24, 2023

StableLM support #3586

Merged

12 tasks

Galunid reviewed Oct 24, 2023

View reviewed changes

goerch mentioned this pull request Oct 24, 2023

gguf-py: Add support for loading merges.txt #3743

Closed

ggerganov mentioned this pull request Jan 13, 2024

llama : fix detokenization of non-special added-tokens #4916

Merged

ggerganov closed this Jan 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix detokenization of non-special added-tokens #3760

Fix detokenization of non-special added-tokens #3760

goerch commented Oct 24, 2023 •

edited

Loading

Galunid Oct 24, 2023

goerch Oct 24, 2023 •

edited

Loading

Galunid commented Oct 24, 2023

goerch commented Oct 24, 2023 •

edited

Loading

goerch commented Oct 24, 2023

cebtenzzre commented Oct 25, 2023

cmp-nct commented Oct 26, 2023

cebtenzzre commented Oct 26, 2023

TheBloke commented Oct 26, 2023

cmp-nct commented Oct 26, 2023 •

edited

Loading

ggerganov commented Jan 13, 2024 •

edited

Loading

Fix detokenization of non-special added-tokens #3760

Fix detokenization of non-special added-tokens #3760

Conversation

goerch commented Oct 24, 2023 • edited Loading

Galunid Oct 24, 2023

Choose a reason for hiding this comment

goerch Oct 24, 2023 • edited Loading

Choose a reason for hiding this comment

Galunid commented Oct 24, 2023

goerch commented Oct 24, 2023 • edited Loading

goerch commented Oct 24, 2023

cebtenzzre commented Oct 25, 2023

cmp-nct commented Oct 26, 2023

cebtenzzre commented Oct 26, 2023

TheBloke commented Oct 26, 2023

cmp-nct commented Oct 26, 2023 • edited Loading

ggerganov commented Jan 13, 2024 • edited Loading

goerch commented Oct 24, 2023 •

edited

Loading

goerch Oct 24, 2023 •

edited

Loading

goerch commented Oct 24, 2023 •

edited

Loading

cmp-nct commented Oct 26, 2023 •

edited

Loading

ggerganov commented Jan 13, 2024 •

edited

Loading