Correctly identify LF token for GPT-2 style BPE tokenizer #11496

mgroeber9110 · 2025-01-29T19:45:43Z

This fix makes the reported linefeed token for models such as Llama-3 and Qwen consistent with what other sources say (verified with https://huggingface.co/spaces/Xenova/the-tokenizer-playground).

This bug is unlikely to have any impact on current llama.cpp functionality, because the llama_vocab_nl() API function is never called. However, the incorrect token IDs appear to have caused some confusion, e.g. here: https://www.reddit.com/r/LocalLLaMA/comments/1cpv7np/why_does_llama3_use_lf_token_128_%C3%A4.

The change is necessary because the llama_vocab::impl::tokenize() method takes the original orthography in UTF-8 as input, rather than the modified internal representation of the BPE vocabulary (U+010A for "\n"), which is only created in unicode_byte_to_utf8_map().

This leads to the correct tokens IDs being shown in the trace:

Before

Llama-3:
print_info: LF token = 128 'Ä'
Qwen:
print_info: LF token = 148848 'ÄĬ'

After

Llama-3:
print_info: LF token = 198 'Ċ'
Qwen2:
print_info: LF token = 198 'Ċ

Correctly identify LF token for GPT-2 style BPE tokenizer

fe8d4df

ggerganov approved these changes Jan 30, 2025

View reviewed changes

ggerganov merged commit ffd0821 into ggerganov:master Jan 30, 2025
45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly identify LF token for GPT-2 style BPE tokenizer #11496

Correctly identify LF token for GPT-2 style BPE tokenizer #11496

mgroeber9110 commented Jan 29, 2025

Correctly identify LF token for GPT-2 style BPE tokenizer #11496

Correctly identify LF token for GPT-2 style BPE tokenizer #11496

Conversation

mgroeber9110 commented Jan 29, 2025

Before

After