Skip to content

Code and resources for our EMNLP-Findings'23 paper "DelucionQA: Detecting Hallucinations in Domain-specific Question Answering"

License

Notifications You must be signed in to change notification settings

boschresearch/DelucionQA

Repository files navigation

This repository contains the dataset proposed in the EMNLP 2023 (Findings) paper titled "DelucionQA: Detecting Hallucinations in Domain-specific Question Answering."

If you face any difficulties while downloading the dataset, raise an issue in this repository or contact us at sadat.mobashir@gmail.com

Abstract

Hallucination is a well-known phenomenon in language generation for large language models (LLMs). The existence of hallucinatory responses is found in almost all application scenarios e.g., summarization, question-answering (QA) etc. For applications requiring high reliability (e.g., customer-facing assistants), the potential existence of hallucination in LLM-generated text is a critical problem. The amount of hallucination can be reduced by leveraging information retrieval to provide relevant background information to the LLM. However, LLMs can still generate hallucinatory content for various reasons (e.g., prioritizing its parametric knowledge over the context, failure to capture the relevant information from the context, etc.). Detecting hallucinations through automated methods is thus paramount. To facilitate research in this direction, we introduce a sophisticated dataset, DelucionQA, that captures hallucinations made by retrieval-augmented LLMs for a domain-specific QA task. Furthermore, we propose a set of hallucination detection methods to serve as baselines for future works from the research community. Analysis and case study are also provided to share valuable insights on hallucination phenomena in the target scenario.

Dataset Description

The dataset DelucionQA is derived from Jeep 2023 Gladiator car manual. First, we make use of an LLM to generate candidate questions based on the car manual. Then, using various information retreival methods, we retreive context relevant to each question. Next, we prompt ChatGPT to answer the question based on the information available in the context. After generating the <question, retreived context, generated answer> triples, we use the Amazon Mechanical Turk (MTurk) platform to annotate whether each sentence in the generated answer in a given triple is supported/conflicted/neither supported nor conflicted with respsect to the context. Based on the sentence level annotations, we assign a binary "Hallucinated"/"Not Hallucinated" label to each triple. For an in-depth description of our dataset construction process, we invite the reader to our paper.

Data Statistics

Split #Questions #Triples #Hallucinated #Not Hallucinated
Train 513 1,151 392 759
Dev 100 216 94 122
Test 300 671 252 419
Total 913 2,038 738 1,300

Table 1: Number of unique questions, number of triples and label distribution in each split of DelucionQA.

Files

=> Our dataset is located in "./data/DelucionQA_final/" directory. The files named "train.csv", "dev.csv", and "test.csv" contain the training, development and test sets, respectively. Each file has the following columns:

* 'sample_id': a unique id for each sample.
* 'Retreival Setting': the information retrieval method that was used to retrieve the context in each sample.
* 'Question': the question posed about the car manual.
* 'Context': the retreived context relevant to the question in numeric format. Please follow the instructions below to convert the context to textual format.
* 'Answer': answer generated by the LLM for the given Question and Context.
* 'Answer_sent_tokenized': sentence tokenized version of the generated Answer.
* 'Sentence_labels': labels for each sentence in the answer indicating whether it is supported/conflicted/neither supported not conflicted with respect to the Context.
* 'Label': label indicating whether the answer contains hallucination or not.
* 'Answerable': True/False labels indicating whether the Question is answerable based on the retrieved Context.
* 'Does_not_answer': True/False labels indicating whether the answer implies "I don't know" i.e., the LLM refuses to provide an answer.

=> We provide a collection of 240 samples which we excluded from the three splits of DelucionQA. These samples were selected using a rule-based method which detects if the answer provided by ChatGPT implies "I don't know." These examples are included in a file named "unanswerable.csv" along with their 'Answerable' labels. Note that there can still be examples which are not answerable in the train/test/dev splits which were undetected by our rule-based method (i.e., they imply "I don't know" in a more subtle manner).

Reconstructing the context

Due to licensing issues, we are releasing the context retrieved for each question from the car-manual in a numeric format. Here are the instructions to convert it to a textual format:

  • create python virtual environment (e.g. conda), then install the python packages (we tested our code in python 3.11)

    install.sh
    

    The installation might need sudo priviledge to run "playwright install-deps"

  • under the project folder, run the code to (1) crawl the data and (2) conduct the conversion

    run.sh
    
  • Within the run.sh, we have 2 python commands, the 1st one is to run the following command to crawl the car-manual data to a file named 'data.jsonl':

    PYTHONPATH=src python -m crawler.main
    
    • params:
      • input_url_file: each line contains meta info of the target vehicle, tab separated in format of (BrandName ModelName Year URL_to_Vehicle).
      • output_folder: location of the crawled data. Eventually, a 'data.jsonl' will be generated in './data/data_for_index' folder (Jeep Gladiator 2023).
  • After the data has been crawled, the 2nd step is to run the following command:

    PYTHONPATH=src python -m context_reconstruction.main_convert_context_to_textual
    
    • params:
      • base: location of a directory containing the train, test and dev files in CSV format.
      • full_text_location: location of the 'data.jsonl' file containing the fulltext of the Jeep Gladiator 2023 manual as the result of crawling.
  • Please note:

  1. We use hydra for the configuration, please run the command in root project folder so that hydra config can be read. Crawling (1st step) will create a 'data.jsonl' file as output, which is one of the inputs for the 2nd step (convert context).
  2. The crawler is implemented based on current website's content. The crawler might need update in future, once there is a major change in the target page.
  3. The crawler has a parameter 'headless' in file config/config_crawl.yaml', which by default is set as True to enable headless visit to website (browser GUI will not be opened). If 'headless' is set as False, a browser (chromium) will be opened to visit the target url and visit the links in the target page for crawling.

Baseline Performance

We experiment with several baseline methods for DelucionQA. The Macro-F1 score for all three splits can be seen below. The baseline methods are described in detail in our paper.

Method Train Dev Test
Sim-cosine 70.03 74.78 69.45
Sim-overlap 75.59 76.84 71.09
Sim-hybrid 75.94 76.84 70.81
Keyword-match 53.86 50.57 52.77

Table 2: Macro F1 scores of our baseline methods on the three splits of DelucionQA.

Citation

If you use this dataset, please cite our paper:

@inproceedings{sadat-etal-2023-delucionqa,
    title = "{D}elucion{QA}: Detecting Hallucinations in Domain-specific Question Answering",
    author = "Sadat, Mobashir  and
      Zhou, Zhengyu  and
      Lange, Lukas  and
      Araki, Jun  and
      Gundroo, Arsalan  and
      Wang, Bingqing  and
      Menon, Rakesh  and
      Parvez, Md  and
      Feng, Zhe",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.59",
    doi = "10.18653/v1/2023.findings-emnlp.59",
    pages = "822--835",
}

License

The code in this repository is open-sourced under the AGPL-3.0 license. See the LICENSE file for details. For a list of other open source components included in this project, see the file 3rd-party-licenses.txt.

The data folder contains files that are used for reconstructing the DelucionQA data, which are licensed under Creative Commons Attribution 4.0 International License (CC-BY-4.0).

Contact

Please contact us at zhengyu.zhou2@bosch.com, msadat3@uic.edu, sadat.mobashir@gmail.com with any questions.

About

Code and resources for our EMNLP-Findings'23 paper "DelucionQA: Detecting Hallucinations in Domain-specific Question Answering"

Topics

Resources

License

Stars

Watchers

Forks