[ `TokenizationLlama`] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

ArthurZucker · 2024-03-05T05:59:49Z

What does this PR do?

Fixes #29452. Before the prefix_space added by add_dummy_prefix_space was eaten by the decode function. This is silent, not expected and thus should not be there.

input_ids = tokenizer.encode("hello", add_special_tokens=True) would produce ["<s>", "▁hello"] but decoded as "<s>hello"

HuggingFaceDocBuilderDev · 2024-03-05T06:22:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…lama-decode

amyeroberts

Thanks for fixing!

nit

7c81726

ArthurZucker changed the title ~~nit~~ [ TokenizationLlama] fix the way we convert tokens to strings to not always strip prefix space Mar 5, 2024

ArthurZucker added 3 commits March 7, 2024 16:16

Merge branch 'main' of github.com:huggingface/transformers into fix-l…

d23128d

…lama-decode

update test and fix test

fd3a244

fixup

de2ba08

ArthurZucker marked this pull request as ready for review March 7, 2024 07:29

ArthurZucker changed the title ~~[ TokenizationLlama] fix the way we convert tokens to strings to not always strip prefix space~~ [ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces Mar 7, 2024

ArthurZucker requested a review from LysandreJik March 7, 2024 09:15

ArthurZucker requested a review from amyeroberts March 22, 2024 09:17

amyeroberts approved these changes Mar 22, 2024

View reviewed changes

ArthurZucker changed the title ~~[ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces~~ [ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix Mar 28, 2024

ArthurZucker merged commit a2a7f71 into main Mar 28, 2024
19 checks passed

ArthurZucker deleted the fix-llama-decode branch March 28, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ `TokenizationLlama`] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

[ `TokenizationLlama`] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

ArthurZucker commented Mar 5, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 5, 2024

amyeroberts left a comment

[ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

[ TokenizationLlama] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

Conversation

ArthurZucker commented Mar 5, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 5, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

[ `TokenizationLlama`] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

[ `TokenizationLlama`] fix the way we convert tokens to strings to keep leading spaces 🚨 breaking fix #29453

ArthurZucker commented Mar 5, 2024 •

edited

Loading