Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks.
The technology, i.e., registering, is introduced in our paper.
This is the repository for reproducing the data mining and pre-training described in our paper.
Note:
Given that partial works are done during Zhi Qu's internship at ASTREC of NICT, Japan, the codes in this repository are under the open-source procedure of NICT.
Once the procedure is finished, we will release all codes immediately.
However!!! You can move to our HuggingFace pages, MITRE_466M and MITRE_913M, where we have already released another version of our codes and pre-trained models with the exactly same performance.
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
@misc{qu2025registeringsourcetokenstarget,
title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation},
author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe},
year={2025},
eprint={2501.02979},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.02979},
}