Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly identify LF token for GPT-2 style BPE tokenizer #11496

Merged
merged 1 commit into from
Jan 30, 2025

Conversation

mgroeber9110
Copy link
Contributor

This fix makes the reported linefeed token for models such as Llama-3 and Qwen consistent with what other sources say (verified with https://huggingface.co/spaces/Xenova/the-tokenizer-playground).

This bug is unlikely to have any impact on current llama.cpp functionality, because the llama_vocab_nl() API function is never called. However, the incorrect token IDs appear to have caused some confusion, e.g. here: https://www.reddit.com/r/LocalLLaMA/comments/1cpv7np/why_does_llama3_use_lf_token_128_%C3%A4.

The change is necessary because the llama_vocab::impl::tokenize() method takes the original orthography in UTF-8 as input, rather than the modified internal representation of the BPE vocabulary (U+010A for "\n"), which is only created in unicode_byte_to_utf8_map().

This leads to the correct tokens IDs being shown in the trace:

Before

Llama-3:
print_info: LF token = 128 'Ä'
Qwen:
print_info: LF token = 148848 'ÄĬ'

After

Llama-3:
print_info: LF token = 198 'Ċ'
Qwen2:
print_info: LF token = 198 'Ċ

@ggerganov ggerganov merged commit ffd0821 into ggerganov:master Jan 30, 2025
45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants