A Data-Efficient Nearest-Neighbor Language Model via Lightweight Nets

Qinhao Zhou¹ Xiang Xiang¹ Ke Wang² Yuqi Zhang³

¹School of Artificial Intelligence and Automation, Huazhong University of Science and Technology

²Alibaba Group

³Nat'l Practice Base for Outstanding Engineers (Digital Tech)

Code for our CCMT 2024 paper "A Data-Efficient Nearest-Neighbor Language Model via Lightweight Nets". Please cite our paper if you find this repository helpful in your research:

@inproceedings{zhouxiang-lightweight-ccmt-2024,
    title = "A Data-Efficient Nearest-Neighbor Language Model via Lightweight Nets",
    author = "Qinhao Zhou, Xiang Xiang, Ke Wang, Yuqi Zhang",
    month = Nov,
    year = "2024",
}

This project is based on adaptive kNN-MT, The implementation is build upon fairseq, and heavily inspired by knn-lm.

Requirements and Installation

pytorch version >= 1.5.0
python version >= 3.6
faiss-gpu >= 1.6.5
pytorch_scatter = 2.0.5
1.19.0 <= numpy < 1.20.0

Run the Code

In line with other works based on kNN-MT, our code is designed to support the following four datasets by default:

IT	Medical	koran	Law
3613350	6903320	524400	19070000

The data can be downloaded in this site Pre-trained model ckpt from this site need to be download before run the code.

Train

First, construct the datastore

bash ./sh/create_datastore.sh

Then, use faiss build datastore index, This step significantly improves the training speed.

bash ./sh/build_faiss_index.sh

Fianlly, run the train script.

bash ./sh/train.sh

Test

For inference, we can run

bash ./sh/inference.sh

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
.vscode		.vscode
build		build
classification		classification
config		config
docs		docs
examples		examples
fairseq.egg-info		fairseq.egg-info
fairseq		fairseq
fairseq_cli		fairseq_cli
prune_datastore		prune_datastore
scripts		scripts
sh		sh
tests		tests
.DS_Store		.DS_Store
.gitattributes		.gitattributes
README.md		README.md
experimental_compact_network.py		experimental_compact_network.py
experimental_generate.py		experimental_generate.py
gitpush.sh		gitpush.sh
hubconf.py		hubconf.py
knn_generate.py		knn_generate.py
pyproject.toml		pyproject.toml
save_datastore.py		save_datastore.py
setup.py		setup.py
train.py		train.py
train_datastore.py		train_datastore.py
train_datastore_gpu.py		train_datastore_gpu.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Data-Efficient Nearest-Neighbor Language Model via Lightweight Nets

Requirements and Installation

Run the Code

Train

Test

About

Releases

Packages

Languages

HAIV-Lab/NNLM

Folders and files

Latest commit

History

Repository files navigation

A Data-Efficient Nearest-Neighbor Language Model via Lightweight Nets

Requirements and Installation

Run the Code

Train

Test

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages