Tokenizer while indexing #1262

chenyn66 · 2022-09-12T00:59:52Z

chenyn66
Sep 12, 2022

Hi,
I'm wondering how to change the default tokenizer while indexing documents. More specifically, I want to index my documents with a T5 tokenizer. Is it possible to do that?
Thanks!

Answered by lintool

Sep 12, 2022

Current solution is to generate a new version of the corpus that has already been tokenized, and then use the -pretokenized option during indexing. Note that queries need to be similarly pretokenized.

For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization.

View full answer

lintool · 2022-09-12T01:09:19Z

lintool
Sep 12, 2022
Maintainer

Current solution is to generate a new version of the corpus that has already been tokenized, and then use the -pretokenized option during indexing. Note that queries need to be similarly pretokenized.

For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization.

1 reply

chenyn66 Sep 12, 2022
Author

Thanks for the instruction!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer while indexing #1262

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Tokenizer while indexing #1262

chenyn66 Sep 12, 2022

Replies: 1 comment · 1 reply

lintool Sep 12, 2022 Maintainer

chenyn66 Sep 12, 2022 Author

chenyn66
Sep 12, 2022

Replies: 1 comment 1 reply

lintool
Sep 12, 2022
Maintainer

chenyn66 Sep 12, 2022
Author