Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to append to existing index #1451

Merged
merged 2 commits into from
Feb 23, 2023
Merged

Add ability to append to existing index #1451

merged 2 commits into from
Feb 23, 2023

Conversation

lintool
Copy link
Member

@lintool lintool commented Feb 23, 2023

Python bindings for castorini/anserini#2062

Requested feature: #1443

This now works:

>>> from pyserini.index.lucene import LuceneIndexer, IndexReader
>>> indexer = LuceneIndexer('index')
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2023-02-23 15:32:17,020 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:115) - Using DefaultEnglishAnalyzer
2023-02-23 15:32:17,022 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:116) - Stemmer: porter
2023-02-23 15:32:17,022 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:117) - Keep stopwords? false
2023-02-23 15:32:17,023 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:118) - Stopwords file: null
>>> indexer.add('{"id": "0", "contents": "Document 0"}')
>>> indexer.close()
>>> reader = IndexReader("index")
>>> print(reader.stats())
{'total_terms': 2, 'documents': 1, 'non_empty_documents': 1, 'unique_terms': 2}
>>> 
>>> indexer = LuceneIndexer('index', append=True)
2023-02-23 15:32:23,091 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:115) - Using DefaultEnglishAnalyzer
2023-02-23 15:32:23,091 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:116) - Stemmer: porter
2023-02-23 15:32:23,091 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:117) - Keep stopwords? false
2023-02-23 15:32:23,091 INFO  [main] index.SimpleIndexer (SimpleIndexer.java:118) - Stopwords file: null
>>> indexer.add('{"id": "1", "contents": "Document 1"}')
>>> indexer.close()
>>> reader = IndexReader("index")
>>> print(reader.stats())
{'total_terms': 4, 'documents': 2, 'non_empty_documents': 2, 'unique_terms': -1}

@theyorubayesian btw, -threads is optional now, with a default.

@lintool lintool merged commit 3a8c0f8 into master Feb 23, 2023
@lintool lintool deleted the indexing branch February 23, 2023 21:49
@httplups
Copy link

Hi! Is there no need to use that variable append=True? This way, append is False. My doubt is that if the index is being re-built or not. When I tried with append=True parameter, I got an error:

File "/usr/local/lib/python3.10/dist-packages/pyserini/index/lucene/_indexer.py", line 66, in add_doc_raw
self.object.addRawDocument(doc)
File "jnius/jnius_export_class.pxi", line 877, in jnius.JavaMethod.call
File "jnius/jnius_export_class.pxi", line 971, in jnius.JavaMethod.call_method
File "jnius/jnius_utils.pxi", line 79, in jnius.check_exception
jnius.JavaException: JVM exception occurred: cannot change field "contents" from index options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS_AND_FREQS java.lang.IllegalArgumentException

@httplups
Copy link

When I followed your example, without passing Append, the number of terms decreased:

root@6c6cd6ba0676:/app# python3 main.py
{'total_terms': 16, 'documents': 3, 'non_empty_documents': 3, 'unique_terms': 11}
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
2024-06-14 19:13:33,315 INFO [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer
2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter
2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false
2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null
{'total_terms': 3, 'documents': 1, 'non_empty_documents': 1, 'unique_terms': 3}
root@6c6cd6ba0676:/app#

@lintool
Copy link
Member Author

lintool commented Jun 14, 2024

Your first error stems from the fact that you're putting different things into the index - "cannot change field "contents" from index options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS_AND_FREQS"

When I followed your example, without passing Append, the number of terms decreased:

root@6c6cd6ba0676:/app# python3 main.py {'total_terms': 16, 'documents': 3, 'non_empty_documents': 3, 'unique_terms': 11} WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2024-06-14 19:13:33,315 INFO [main] index.SimpleIndexer (SimpleIndexer.java:141) - Using DefaultEnglishAnalyzer 2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:142) - Stemmer: porter 2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:143) - Keep stopwords? false 2024-06-14 19:13:33,317 INFO [main] index.SimpleIndexer (SimpleIndexer.java:144) - Stopwords file: null {'total_terms': 3, 'documents': 1, 'non_empty_documents': 1, 'unique_terms': 3} root@6c6cd6ba0676:/app#

Can you show us the entire execution trace?

@NourOM02
Copy link
Contributor

@lintool The add method you used looks like it is no longer supported :
image
When I tried to used the add_doc_dict method, I came across the error :
jnius.JavaException: JVM exception occurred: cannot change field "contents" from index options=DOCS_AND_FREQS_AND_POSITIONS to inconsistent index options=DOCS_AND_FREQS java.lang.IllegalArgumentException
I built my index using the following command :
command = f"python -m pyserini.index.lucene \ --collection JsonCollection \ --input '{json_path}' \ --index '{index_path}' \ --generator DefaultLuceneDocumentGenerator \ --threads {threads} \ --storePositions --storeDocvectors --storeRaw"

I tried to set the args parameter on the LuceneIndexer to [--storePositions, --storeDocvectors, --storeRaw] but I get another error :
jnius.JavaException: JVM exception occurred: No argument is allowed: storePositions org.kohsuke.args4j.CmdLineException

Would you please suggest a solution? Thank in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants