Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SimpleIndexer bindings for on-the-fly indexing #1344

Merged
merged 2 commits into from
Nov 16, 2022
Merged

Add SimpleIndexer bindings for on-the-fly indexing #1344

merged 2 commits into from
Nov 16, 2022

Conversation

lintool
Copy link
Member

@lintool lintool commented Nov 15, 2022

This allows "on-the-fly" indexing - so we don't need to first write files to disk and then index.

Very bare bones... no tests, no documentation, etc.

But, works on MS MARCO passage:

python scripts/msmarco-passage/index_msmarco_passage.py \
  --input /System/Volumes/Data/store/collections/msmarco/passage \
  --index indexes/lucene-index-msmarco-passage

On my personal machine (iMac Pro), running time:

Total 8841823 docs indexed in 1284s

This is roughly 21m, which is roughly consistent with figures reported in castorini/anserini#2016 - 18m on the Java end... add a bit of time for Python overhead, so passes sanity check.

PR not ready for review yet.

@lintool lintool marked this pull request as draft November 15, 2022 02:02
@ola13
Copy link

ola13 commented Nov 15, 2022

What format of the data is currently expected in input?

@lintool
Copy link
Member Author

lintool commented Nov 15, 2022

I'll demonstrate in a test case when I get a chance, but the same "standard" JSON format that Pyserini/Anserini takes:

{"id": "0", "contents": "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."}

I realize right now that it takes a string representation of JSON, which create a lot of needless string manipulation... but let's save as TODO later and get the pipeline working first.

@lintool lintool marked this pull request as ready for review November 16, 2022 12:01
@lintool
Copy link
Member Author

lintool commented Nov 16, 2022

I added basic documentation and a test case. @crystina-z @ToluClassics this is ready for review now.

@lintool lintool merged commit 4d61b56 into master Nov 16, 2022
@lintool lintool deleted the indexing branch November 16, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants