Toke(n)icer

A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.

News

02/10/2025 0.0.2: 🤗 Initial release!

Features:

Compatible with all HF Transformers recognized tokenizers
Auto-fix models not setting padding_token
Auto-Fix models released with wrong padding_token: many models incorrectly use eos_token as pad_token which leads to subtle and hidden errors in post-training and inference when batching is used which is almost always.
Zero external dependency outside of Transformers

Upcoming Features:

Add automatic tokenizer validation to model training and subsequent inference so that not only tokenizer config but actual decode/encode are 100% re-validated on model load. Often the case, inference and training engines modifies the traditional tokenizers causing subtle and inaccurate output when inference performed on a platform that is disjointed from the trainer.

Install

PIP/UV

pip install -v tokenicer
uv pip install -v tokenicer

Install from source

# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer

# compile
pip install -v .

Usage

Replace all calls to AutoTokenizer.from_pretrained() with Tokenizer.load(): args are 100% compatible with AutoTokenizer

# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')

# With `Tokenicer.load()`
from tokenicer import Tokenicer
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')

# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
print(f"pad_token: `{tokenizer.pad_token}`")

Citation

@misc{gptqmodel,
    author = {ModelCloud.ai and qubitium@modelcloud.ai},
    title = {Toke(n)icer},
    year = {2025},
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {\url{https://github.com/modelcloud/tokenicer}},
    note = {Contact: qubitium@modelcloud.ai}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
tests		tests
tokenicer		tokenicer
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toke(n)icer

News

Features:

Upcoming Features:

Install

PIP/UV

Install from source

Usage

Citation

About

Releases 1

Packages

Contributors 4

Languages

License

ModelCloud/Tokenicer

Folders and files

Latest commit

History

Repository files navigation

Toke(n)icer

News

Features:

Upcoming Features:

Install

PIP/UV

Install from source

Usage

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages