本项目利用SentenceTransformer对字典词条进行向量化,创建Faiss向量索引,实现错别字、同音字、形似字、漏字等情况的搜索词召回。
- Python 3.11+
- 单氨胖光酸
http://localhost:8080/search?word=单氨胖光酸&top=1&pinyin=1
{ "code": 1, "message": "success", "result": [ { "index": "WORD", "code": "15273", "word": "甘草酸单铵半胱氨酸氯化钠注射液", "score": 5, "distance": 0.375848472118378 } ], "micro": 110706 }
- 阿代那非
http://localhost:8080/search?word=阿代那非&top=1&pinyin=1
{ "code": 1, "message": "success", "result": [ { "index": "WORD", "code": "24317", "word": "阿伐那非片", "score": 7, "distance": 0.395916491746902 } ], "micro": 154205 }
- 霜瓜唐安
http://localhost:8080/search?word=霜瓜唐安&top=1&pinyin=1
{ "code": 1, "message": "success", "result": [ { "index": "PINYIN", "code": "4598", "word": "双瓜糖安胶囊", "score": 4, "distance": 0.0730657055974007 } ], "micro": 129010 }
dict/dict_words.csv
:字典文件,每行一个词条- 包含两个字符串字段,分别为: 词条代码、词条内容
- 可通过http(post): /put 接口上传字典文件
curl -X POST -F "file=@xxx.csv" http://localhost:8080/put
- 查看帮助
- 源码运行
python main.py -h
- 可执行文件运行
cd /path/to/vector-search ./vector-search -h
- 源码运行
- 创建索引
- 源码运行
python main.py index -worker=4 -batch=1000 -min=2 -max=4
- 可执行文件运行
cd /path/to/vector-search ./vector-search index -worker=4 -batch=1000 -min=2 -max=4
- 源码运行
-
启动服务
- 源码运行
python main.py server -port=8080 -log-level=info
- 可执行文件运行
cd /path/to/vector-search ./vector-search server -port=8080 -log-level=info
- 源码运行
-
访问帮助页面
- 从 release page 下载Windows打包程序
- 创建虚拟环境
python -m venv .venv source venv/bin/activate
- 安装依赖
pip install -r requirements.txt
- 下载预训练模型
from sentence_transformers import SentenceTransformer model = SentenceTransformer('sentence-transformers/distiluse-base-multilingual-cased-v1') model.save('model/distiluse-base-multilingual-cased-v1')
- 打包程序
pyinstaller ./vector-search.spec
- 本例使用CPU,有GUP的情况下,可使用GPU加速,请修改faiss-cpu为faiss-gpu