Academic Paper Semantic Search is a localized search engine, providing plain text search on a collection of pdf papers.
Search methods include neural network based search (BERT based) and simple dictionary based matching (BM25
).
It requires Human Intelligence on top of Artificial Intelligence, so it will not be a local ChatGPT
.
The current version is designed to run in laptops without GPU, so large neural network models are not used. But this program modularizes the search algorithm, so you can easily swap the state of art models. Lite versions are strongly recommended.
If you are interested in fine-tuning large models with limited GPU, I strongly recommend one of the BOOM paper below (or the first figure). Training a model with 100+ Billion parameters requires a different skill set. It's easy to torture one Nvidia 2080 Ti for a year, but it is not enough.
- Create virtual environment. Recommend using
mamba
rather thanconda
to install packages
conda create -p env/ python=3.9
conda activate env/
pip install --upgrade pip
pip install farm-haystack[sql,only-faiss,inmemorygraph] streamlit st-annotated-text
- You may want the GPU version if possible
conda activate env/
pip install farm-haystack[only-faiss-gpu] transformers[torch]
Windows users may have trouble installing the faiss-gpu
in the farm-haystack
. An alternative is
conda activate env/
conda install -c conda-forge faiss-gpu
- Copy
data/db-*
folders todata/
and run
../env/python -m streamlit run ui/Search.py --server.runOnSave=true --server.address=127.0.0.1
This script starts FastAPI for query
%~dp0./env/python.exe -m uvicorn rest_api.search_rest_gunicorn:app --host 127.0.0.1 --port 7999 --workers 1
This script starts the webserver
%~dp0./env/python -m streamlit run ui/Search.py --server.runOnSave=true --server.address=127.0.0.1
Note: without the --server.address=127.0.0.1
, streamlit
will broadcast your ip address to the world.
- Install Zotera, and plug-in
ZotFile
andDOI Manager
Tool -> ZotFile Preference-> use subfolder defined by
:[%a](%y){ %t}
ZotFile Preference-> Renaming Rules -> Format for all item & Patents
:[%a](%y){ %t}
ZotFile Preference-> Tablet Settings
: checkuse ZotFile to send and get files from tablet
, and setbase folder
- Import pdf papers into Zotero, obtain
doi
and clean up metadata - Select all pdf and right click, management attachments, sent to tablet
- Use
Adobe Pro
orAbbyy
for batch text recognition
-
Use Adobe Pro to recognize and export pdf to word document
-
use Pandoc to convert to plain text:
pandoc -f docx -i file_name.docx -t plain -o file_name.txt
Use virtual environment to manage python packages. Many of them may have conflict with your current packages.
- Download GROBID docker image and the python client. The CRT-only is enough.
- Run
src/extract_text.py
to convert pdf totei.xml
format and parse them into plain text - May need spell check
- Recommend
mamba
to install packages - Install haystack python package or docker image
- need to install faiss; given the small number of documents, the cpu version is fast enough
- if possible, recommend
mamba install -c conda-forge libfaiss-avx2
- if want to embed documents, need
transformer[torch]
package
git clone https://github.com/kermitt2/grobid_client_python
- run
src/build_database.py
- Simple dictionary search use BM25 and In Memory
database.
- Documents are chunked into sentences. Each has at most 300 words with 10 words overlap.
- May need to tune sentence length for better performance
- Neural Network based search use sentence-transformer to embed words. Data are stored
in FAISS
- Documents are chunked into sentences with at most 100 words
- Embedding model use (need GPU for fast process)
sentence-transformers/multi-qa-mpnet-base-dot-v1
sentence-transformers/msmarco-distilbert-base-tas-b
- databases are in the
data
folder. Copydb-faiss
anddb-inmemory
todeploy/data/
- Copy the haystack-demos and modify scripts in the
ui
folder. The main part iswebapp.py
- The sample scripts use
haystack
api and docker, but you can run your script directly without docker.
Windows users who have Intel cpu and want faster matching speed may want to compile the avx2
version from source. You can
link the MKL library for faster speed.
-
Install
visual studio 2019
(desktop development with C++),cuda toolkit
,Intel OneAPI toolkit
, andswig
-
Use conda env to activate desired python version
-
Download the latest
faiss
release. -
Assume install with default settings, in
cmd
, activate environment variablesconda activate "path to desire python env" "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
MKL library will be loaded automatically. If you don't need GPU, set
-DFAISS_ENABLE_GPU=OFF
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin\cmake.exe" -B build ^ -DFAISS_ENABLE_PYTHON=ON ^ -DFAISS_ENABLE_GPU=ON ^ -DBUILD_SHARED_LIBS=ON ^ -DCMAKE_BUILD_TYPE=Release ^ -DFAISS_OPT_LEVEL=avx2 ^ -DBUILD_TESTING=OFF
maxCpuCount
default is 1, include the switch without number will use all coresMSBuild.exe build/faiss/faiss_avx2.vcxproj /property:Configuration=Release /maxCpuCount:12
Or you may use Visual Studio to open
faiss/build/ALL_BUILD.vcxproj
, selectrelease
, and buildswigfaiss_avx2
-
Build python wheel
cd build/faiss/python/ python setup.py bdist_wheel python setup.py install
- Spell check for plain text
- Fine tune embedding models
- Check quality & runtime for joint model: combine multiple embedding models for neural network based search