A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
- 02/10/2025 0.0.2: 🤗 Initial release!
- Compatible with all HF
Transformers
recognized tokenizers - Auto-fix
models
not settingpadding_token
- Auto-Fix
models
released with wrongpadding_token
: manymodels
incorrectly useeos_token
aspad_token
which leads to subtle and hidden errors in post-training and inference whenbatching
is used which is almost always. - Zero external dependency outside of
Transformers
- Add
automatic
tokenizer validation tomodel
training
and subsequentinference
so that not only tokenizer config but actualdecode
/encode
are 100% re-validated on model load. Often the case,inference
andtraining
engines modifies the traditional tokenizers causing subtle and inaccurate output wheninference
performed on a platform that is disjointed from thetrainer
.
pip install -v tokenicer
uv pip install -v tokenicer
# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer
# compile
pip install -v .
- Replace all calls to
AutoTokenizer.from_pretrained()
withTokenizer.load()
: args are 100% compatible withAutoTokenizer
# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
# With `Tokenicer.load()`
from tokenicer import Tokenicer
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
print(f"pad_token: `{tokenizer.pad_token}`")
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}