-
Hi, |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Current solution is to generate a new version of the corpus that has already been tokenized, and then use the For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization. |
Beta Was this translation helpful? Give feedback.
Current solution is to generate a new version of the corpus that has already been tokenized, and then use the
-pretokenized
option during indexing. Note that queries need to be similarly pretokenized.For a full example, see https://github.com/castorini/anserini/blob/master/docs/regressions-msmarco-passage-wp.md - this is MS MARCO with WordPiece tokenization.