Subdocuments while Indexing #1372

vrdn-23 · 2022-12-06T17:28:40Z

vrdn-23
Dec 6, 2022

Is there a way to handle or split documents into subdocuments or passages using Pyserini? I can see that during each we have an option to return the passage with max score. Could someone point me to any documentation or examples for how we deal with subdocuments or passages for a really long document?

Answered by lintool

Dec 7, 2022

Hi @vrdn-23 - you'll have to chop up the documents yourself into passages (i.e., create a variant corpus and index that). Number the ids like "doc1#0", "doc2#1", "doc#2"... and then you can use the max passage feature in Pyserini. See https://castorini.github.io/pyserini/2cr/msmarco-v1-doc.html for example invocations, under the "doc segmented" conditions.

View full answer

lintool · 2022-12-07T13:32:51Z

lintool
Dec 7, 2022
Maintainer

Hi @vrdn-23 - you'll have to chop up the documents yourself into passages (i.e., create a variant corpus and index that). Number the ids like "doc1#0", "doc2#1", "doc#2"... and then you can use the max passage feature in Pyserini. See https://castorini.github.io/pyserini/2cr/msmarco-v1-doc.html for example invocations, under the "doc segmented" conditions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Subdocuments while Indexing #1372

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Subdocuments while Indexing #1372

vrdn-23 Dec 6, 2022

Replies: 1 comment

lintool Dec 7, 2022 Maintainer

vrdn-23
Dec 6, 2022

lintool
Dec 7, 2022
Maintainer