Fix Llama 3 TikToken conversion #33538

pcuenca · 2024-09-17T13:38:51Z

What does this PR do?

Fixes a tokenizer conversion problem for Llama 3, possibly introduced in #31656

How to reproduce:

The following fails in main and works with this PR:

python src/transformers/models/llama/convert_llama_weights_to_hf.py \
--input_dir path_to_original_checkpoint \
--model_size tokenizer_only \
--output_dir converted_tokenizer \
--llama_version 3.1

The same issue may happen with Gemma models as well. I didn't test yet, but can verify once we agree on a suitable approach.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@itazap @ArthurZucker

pcuenca · 2024-09-17T13:40:22Z

src/transformers/models/llama/convert_llama_weights_to_hf.py

@@ -332,7 +332,7 @@ def permute(w, n_heads, dim1=dim, dim2=dim):

 class Llama3Converter(TikTokenConverter):
    def __init__(self, vocab_file, special_tokens=None, instruct=False, model_max_length=None, **kwargs):
-        super().__init__(vocab_file, **kwargs)
+        super().__init__(vocab_file, additional_special_tokens=special_tokens, **kwargs)


Line 348 below could possibly be removed, as the superclass will handle it.

Yes agreed it can be removed! Thanks

ArthurZucker

Good catch and indeed since now it is natively supported let's use super's functionalities!

HuggingFaceDocBuilderDev · 2024-09-18T12:09:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

* Fix Llama 3 TikToken conversion * No need to add tokens again

Fix Llama 3 TikToken conversion

01cc9cd

pcuenca commented Sep 17, 2024

View reviewed changes

ArthurZucker reviewed Sep 18, 2024

View reviewed changes

ArthurZucker approved these changes Sep 18, 2024

View reviewed changes

No need to add tokens again

1a0506b

itazap approved these changes Sep 19, 2024

View reviewed changes

pcuenca merged commit 0c718f1 into main Sep 19, 2024
9 checks passed

pcuenca deleted the fix-llama-tokenizer-conversion branch September 19, 2024 23:28

BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024

Fix Llama 3 TikToken conversion (huggingface#33538)

432d58c

* Fix Llama 3 TikToken conversion * No need to add tokens again

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Llama 3 TikToken conversion #33538

Fix Llama 3 TikToken conversion #33538

pcuenca commented Sep 17, 2024

pcuenca Sep 17, 2024

itazap Sep 17, 2024

ArthurZucker left a comment

HuggingFaceDocBuilderDev commented Sep 18, 2024

Fix Llama 3 TikToken conversion #33538

Fix Llama 3 TikToken conversion #33538

Conversation

pcuenca commented Sep 17, 2024

What does this PR do?

How to reproduce:

Before submitting

Who can review?

pcuenca Sep 17, 2024

Choose a reason for hiding this comment

itazap Sep 17, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 18, 2024