Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer bug #10

Open
bratao opened this issue Jan 29, 2025 · 1 comment
Open

Tokenizer bug #10

bratao opened this issue Jan 29, 2025 · 1 comment

Comments

@bratao
Copy link

bratao commented Jan 29, 2025

Hello!

Thank you for this awesome project. T5 still pack a punch!

I was using your code to train a Brazillian T5, and the tokenizer was really bad.

After some debugging I see a bug:

examples/fat5-fr/train_tokenizer.py

pre_tokenizer = Sequence([Split(pattern=pat_str, behavior="isolated")])

Should be

pre_tokenizer = Sequence([Split(pattern=Regex(pat_str), behavior="isolated")])

After this change the results were so much better. But looking at https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json it seems that you used another tokenizer.

@bourdoiscatie
Copy link
Contributor

Hi!

Sorry for the delay, strangely I didn't get any notification of the opening of your issue.

You're right, the tokenizer code currently in the repo contains an error and it's definitely a different one in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json.
We had three versions of the tokenizer for our model.
The first version uses a sentencepiece as in the original T5.
A second one where we decided to use a BPE instead (it's this code that's currently available in the repo) but which contained the error you point out.
The third is the one used in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json but which we forgot to push to the repo. My ex-colleague Boris made this code, so I'll have to check with him to make sure he hasn't made any other changes other than the one you're pointing out.

Note that in this third version, we also forgot to add the <s> sentence start token, which complicates the finetuning of certain tasks (QA particularly). I'll add it when I push the 3rd version of the tokenizer.

Otherwise, I'm glad you're building a Brazillian T5. I'd love to know if it leads to anything conclusive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants