You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sorry for the delay, strangely I didn't get any notification of the opening of your issue.
You're right, the tokenizer code currently in the repo contains an error and it's definitely a different one in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json.
We had three versions of the tokenizer for our model.
The first version uses a sentencepiece as in the original T5.
A second one where we decided to use a BPE instead (it's this code that's currently available in the repo) but which contained the error you point out.
The third is the one used in https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json but which we forgot to push to the repo. My ex-colleague Boris made this code, so I'll have to check with him to make sure he hasn't made any other changes other than the one you're pointing out.
Note that in this third version, we also forgot to add the <s> sentence start token, which complicates the finetuning of certain tasks (QA particularly). I'll add it when I push the 3rd version of the tokenizer.
Otherwise, I'm glad you're building a Brazillian T5. I'd love to know if it leads to anything conclusive.
Hello!
Thank you for this awesome project. T5 still pack a punch!
I was using your code to train a Brazillian T5, and the tokenizer was really bad.
After some debugging I see a bug:
examples/fat5-fr/train_tokenizer.py
pre_tokenizer = Sequence([Split(pattern=pat_str, behavior="isolated")])
Should be
pre_tokenizer = Sequence([Split(pattern=Regex(pat_str), behavior="isolated")])
After this change the results were so much better. But looking at https://huggingface.co/CATIE-AQ/FAT5-small/raw/main/tokenizer.json it seems that you used another tokenizer.
The text was updated successfully, but these errors were encountered: