Dialogue models are often enriched with extensive external knowledge to provide informative responses through a retrieval-augmented pipeline. Nevertheless, retrieval-augmented approaches rely on finely annotated retrieval training data and knowledge-grounded response generation data, making it costly to transfer.
To tackle this challenge, this paper proposed a retrieval-free approach, KiDG, by automatically turning knowledge documents into simulated multi-turn dialogues through a MultiDocument Traversal algorithm. The simulated knowledge-intensive dialogues constructed by KiDG in one domain can be easily used to train and enhance pre-trained dialogue models’ knowledge w.r.t. this domain without costly annotation.
We conduct extensive experiments comparing retrieval-augmented models and a variety of retrieval-free models. We found that dialogue models enhanced with data simulated with KiDG largely outperform state-ofthe-art retrieval-free methods, and it achieves comparable performance compared to retrievalaugmented methods while being better, and cheaper at domain transfer.
Clone the repo and install dependent packages:
git clone https://github.com/DevoAllen/KiDG.git
cd KiDG
pip install -r requirements.txt
Out of respect for these open-source projects, please download them by yourself and place them in supply path.
The word embedding datasets we used is Tencent AI Lab Embedding Corpora.
cd supply
mkdir word2vec
wget https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0-s.tar.gz
The knowledge graph we used is ownthink. Please put the downloaded file under supply/KG.
In this paper, we have trained a chinese-roberta-wwm-ext-large model using simcse as the backbone for bert-score to enhance performance. You can download it from sentEmbed and place it in the supply/sentEmbed path.
bash runKiDG.sh
This script will complete the construction of the KiDG graph, and traverse it to obtain sentence sequences.
We use bart-chinese-large as the backbone for the inpainting model. Please collect dialogue corpora on your own.
The training of the inpainting model is essentially the token infilling task of the BART model. Given a dialogue
For code reference, you can refer to the BART training examples.
This repo benefits from CPT, SimCSE, ownthink, Tencent AI Lab Embedding Corpora.
Thanks for their wonderful works!
If you find our project helpful, hope you can star our repo and cite our paper as follows:
@inproceedings{wang-etal-2023-retrieval,
title = "Retrieval-free Knowledge Injection through Multi-Document Traversal for Dialogue Models",
author = "Wang, Rui and
Bao, Jianzhu and
Mi, Fei and
Chen, Yi and
Wang, Hongru and
Wang, Yasheng and
Li, Yitong and
Shang, Lifeng and
Wong, Kam-Fai and
Xu, Ruifeng",
year = "2023",
url = "https://aclanthology.org/2023.acl-long.364",
doi = "10.18653/v1/2023.acl-long.364",
pages = "6608--6619",
}