"Qwen2Tokenizer" cannot encode some words correctly #199

Line290 · 2024-03-22T07:26:06Z

Below is the BUG:
transformers==4.37.1

t = AutoTokenizer.from_pretrained("Qwen1.5-72B-Chat", trust_remote_code=True, use_fast=False)
t_fast = AutoTokenizer.from_pretrained("Qwen1.5-72B-Chat", trust_remote_code=True, use_fast=True)

t.encode("#######")
# Out: [2, 2, 2, 2, 2, 2, 2]

t_fast.encode("#######")
# Out: [97864]

There are 120+ words in the vocabulary with the same BUG.

jklj077 · 2024-03-22T15:54:05Z

For the issues related to "#" in the slow tokenizer (Qwen2Tokenizer), please try this branch and see if the problem could be fixed. There should be 121 tokens affected by this.

Line290 · 2024-03-25T02:19:26Z

It was fixed after I followed your branch and modified the code at here. Thank you.

github-actions · 2025-03-08T08:02:28Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Line290 closed this as completed Mar 25, 2024

jklj077 mentioned this issue Mar 28, 2024

Fix Qwen2Tokenizer huggingface/transformers#29929

Merged

5 tasks

github-actions bot locked as resolved and limited conversation to collaborators Mar 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Qwen2Tokenizer" cannot encode some words correctly #199

"Qwen2Tokenizer" cannot encode some words correctly #199

Line290 commented Mar 22, 2024

jklj077 commented Mar 22, 2024

Line290 commented Mar 25, 2024

github-actions bot commented Mar 8, 2025

"Qwen2Tokenizer" cannot encode some words correctly #199

"Qwen2Tokenizer" cannot encode some words correctly #199

Comments

Line290 commented Mar 22, 2024

jklj077 commented Mar 22, 2024

Line290 commented Mar 25, 2024

github-actions bot commented Mar 8, 2025