Tokenizer bug #10

bratao · 2025-01-29T18:47:18Z

Hello!

Thank you for this awesome project. T5 still pack a punch!

I was using your code to train a Brazillian T5, and the tokenizer was really bad.

After some debugging I see a bug:

examples/fat5-fr/train_tokenizer.py

pre_tokenizer = Sequence([Split(pattern=pat_str, behavior="isolated")])

Should be

pre_tokenizer = Sequence([Split(pattern=Regex(pat_str), behavior="isolated")])

After this change the results were so much better. But looking at https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json it seems that you used another tokenizer.

The text was updated successfully, but these errors were encountered:

bourdoiscatie · 2025-02-04T09:08:27Z

Hi!

Sorry for the delay, strangely I didn't get any notification of the opening of your issue.

You're right, the tokenizer code currently in the repo contains an error and it's definitely a different one in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json.
We had three versions of the tokenizer for our model.
The first version uses a sentencepiece as in the original T5.
A second one where we decided to use a BPE instead (it's this code that's currently available in the repo) but which contained the error you point out.
The third is the one used in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json but which we forgot to push to the repo. My ex-colleague Boris made this code, so I'll have to check with him to make sure he hasn't made any other changes other than the one you're pointing out.

Note that in this third version, we also forgot to add the <s> sentence start token, which complicates the finetuning of certain tasks (QA particularly). I'll add it when I push the 3rd version of the tokenizer.

Otherwise, I'm glad you're building a Brazillian T5. I'd love to know if it leads to anything conclusive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer bug #10

Tokenizer bug #10

bratao commented Jan 29, 2025

bourdoiscatie commented Feb 4, 2025

Tokenizer bug #10

Tokenizer bug #10

Comments

bratao commented Jan 29, 2025

bourdoiscatie commented Feb 4, 2025