Skip to content

WikiChat is an improved RAG. It stops the hallucination of large language models by retrieving data from a corpus.

License

Notifications You must be signed in to change notification settings

stanford-oval/WikiChat

Repository files navigation

WikiChat Logo

WikiChat
arXiv Github Stars

Stopping the Hallucination of Large Language Models

Online demo: https://wikichat.genie.stanford.edu

Demo.Video.mp4

Table of Contents

Introduction

Large language model (LLM) chatbots like ChatGPT and GPT-4 get things wrong a lot, especially if the information you are looking for is recent ("Tell me about the 2024 Super Bowl.") or about less popular topics ("What are some good movies to watch from [insert your favorite foreign director]?"). WikiChat uses Wikipedia and the following 7-stage pipeline to makes sure its responses are factual. Each numbered stage involves one or more LLM calls.

WikiChat Pipeline

Check out our paper for more details: Sina J. Semnani, Violet Z. Yao*, Heidi C. Zhang*, and Monica S. Lam. 2023. WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore. Association for Computational Linguistics.

🚨 Announcements

  • (August 22, 2024) WikiChat 2.0 is now available! Key updates include:

    • Multilingual Support: By default, retrieves information from 10 different Wikipedias: 🇺🇸 English, 🇨🇳 Chinese, 🇪🇸 Spanish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇩🇪 German, 🇮🇷 Farsi, 🇯🇵 Japanese, 🇫🇷 French, and 🇮🇹 Italian.

    • Improved Information Retrieval

      • Now supports retrieval from structured data such as tables, infoboxes, and lists, in addition to text.
      • Has the highest quality public Wikipedia preprocessing scripts
      • Uses the state-of-the-art multilingual retrieval model BGE-M3.
      • Uses Qdrant for scalable vector search.
      • Uses RankGPT to rerank search results.
    • Free Multilingual Wikipedia Search API: We offer a high-quality, free (but rate-limited) search API for access to 10 Wikipedias, encompassing over 180M vector embeddings.

    • Expanded LLM Compatibility: Supports 100+ LLMs through a unified interface, thanks to LiteLLM.

    • Optimized Pipeline: Option for a faster and more cost-effective pipeline by merging the "generate" and "extract claim" stages of WikiChat.

    • LangChain Compatibility: Fully compatible with LangChain 🦜️🔗.

    • And Much More!

  • (June 20, 2024) WikiChat won the 2024 Wikimedia Research Award!

    The @Wikimedia Research Award of the Year 2024 goes to "WikiChat: Stopping the hallucination of large language model chatbots by few-shot grounding on Wikipedia" ⚡

    📜 https://t.co/d2M8Qrarkw pic.twitter.com/P2Sh47vkyi

    — Wiki Workshop 2024 (@wikiworkshop) June 20, 2024
  • (May 16, 2024) Our follow-up paper "🍝 SPAGHETTI: Open-Domain Question Answering from Heterogeneous Data Sources with Retrieval and Semantic Parsing" is accepted to the Findings of ACL 2024. This paper adds support for structured data like tables, infoboxes and lists.

  • (January 8, 2024) Distilled LLaMA-2 models are released. You can run these models locally for a cheaper and faster alternative to paid APIs.

  • (December 8, 2023) We present our work at EMNLP 2023.

  • (October 27, 2023) The camera-ready version of our paper is now available on arXiv.

  • (October 06, 2023) Our paper is accepted to the Findings of EMNLP 2023.

Installation

Installing WikiChat involves the following steps:

  1. Install dependencies
  2. Configure the LLM of your choice. WikiChat supports over 100 LLMs, including models from OpenAI, Azure, Anthropic, Mistral, HuggingFace, Together.ai, and Groq.
  3. Select an information retrieval source. This can be any HTTP endpoint that conforms to the interface defined in retrieval/retriever_server.py. We provide instructions and scripts for the following options:
    1. Use our free, rate-limited API for Wikipedia in 10 languages.
    2. Download and host our provided Wikipedia index yourself.
    3. Create and run a new custom index from your own documents.
  4. Run WikiChat with your desired configuration.
  5. [Optional] Deploy WikiChat for multi-user access. We provide code to deploy a simple front-end and backend, as well as instructions to connect to an Azure Cosmos DB database for storing conversations.

System Requirements

This project has been tested with Python 3.10 on Ubuntu 20.04 LTS (Focal Fossa), but it should be compatible with many other Linux distributions. If you plan to use this on Windows WSL or macOS, or with a different Python version, be prepared for potential troubleshooting during installation.

Hardware requirements vary based on your intended use:

  1. Basic Usage: Running WikiChat with LLM APIs and our Wikipedia search API has minimal hardware requirements and should work on most systems.

  2. Local Search Index: If you intend to host a search index locally, ensure you have sufficient disk space for the index. For large indices, retrieval latency is heavily dependant on disk speed, so we recommend using SSDs and preferably NVMe drives. For example, storage-optimized VMs like Standard_L8s_v3 on Azure are suitable for this.

  3. Local LLM: If you plan to use WikiChat with a local LLM, a GPU is necessary to host the model.

  4. Creating a New Retrieval Index: If you want to index a collection, you need a GPU to embed documents to vectors. The default embedding model (BAAI/BGE-M3) requires at least 13GB of GPU memory to run.

Install Dependencies

First, clone the repository:

git clone https://github.com/stanford-oval/WikiChat.git
cd WikiChat

We recommend using the conda environment specified in conda_env.yaml. This environment includes Python 3.10, pip, gcc, g++, make, Redis, and all required Python packages.

Ensure you have either Conda, Anaconda, or Miniconda installed. Then create and activate the conda environment:

conda env create --file conda_env.yaml
conda activate wikichat
python -m spacy download en_core_web_sm  # Spacy is only needed for certain WikiChat configurations

If you see Error: Redis lookup failed after running the chatbot, it probably means Redis is not properly installed. You can try reinstalling it by following its official documentation.

Keep this environment activated for all subsequent commands.

Install Docker for your operating system by following the instructions at https://docs.docker.com/engine/install/. WikiChat uses Docker primarily for creating and serving vector databases for retrieval, specifically 🤗 Text Embedding Inference and Qdrant. On recent Ubuntu versions, you can try running inv install-docker. For other operating systems, follow the instructions on the docker website.

WikiChat uses invoke to add custom commands for various purposes. To see all available commands and their descriptions, run:

invoke --list

or the shorthand:

inv -l

For more details about a specific command, use:

inv [command name] --help

These commands are implemented in the tasks/ folder.

Configure the LLM of Your Choice

WikiChat is compatible with various LLMs, including models from OpenAI, Azure, Anthropic, Mistral, Together.ai, and Groq. You can also use WikiChat with many locally hosted models via HuggingFace.

To configure your LLM:

  1. Fill out the appropriate fields in llm_config.yaml.

  2. Create a file named API_KEYS (which is included in .gitignore).

  3. In the API_KEYS file, set the API key for the LLM endpoint you want to use. The name of the API key should match the name you provided in llm_config.yaml under api_key. For example, if you're using OpenAI models via openai.com and Mistral endpoints, your API_KEYS file might look like this:

# Fill in the following values with your API keys. Make sure there is not extra space after the key.
# Changes to this file are ignored by git, so you can safely store your keys here during development.
OPENAI_API_KEY=[Your OpenAI API key from https://platform.openai.com/api-keys]
MISTRAL_API_KEY=[Your Mistral API key from https://console.mistral.ai/api-keys/]

Note that locally hosted models do NOT need an API key, but you need to provide an OpenAI-compatible endpoint in api_base. The code has been tested with 🤗 Text Generation Inference.

Configure Information Retrieval

Option 1 (Default): Use our free rate-limited Wikipedia search API

By default, WikiChat retrieves information from 10 Wikipedias via the endpoint at https://wikichat.genie.stanford.edu/search/. If you want to just try WikiChat, you do not need to modify anything.

Option 2: Download and host our Wikipedia index

  1. Download the August 1, 2024 index of 10 Wikipedia languages from 🤗 Hub and extract it:
inv download-wikipedia-index --workdir ./workdir

Note that this index contains ~180M vector embeddings and therefore requires at least 500 GB of empty disk space. It uses Qdrant's binary quantization to reduce RAM requirements to 55 GB without sacrificing accuracy or latency.

  1. Start a FastAPI server similar to option 1 that responds to HTTP POST requests:
inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
  1. Start WikiChat by passing in the URL of this retriever. For example:
inv demo --retriever-endpoint "http://0.0.0.0:<port number>/search"

Note that this server and its embedding model run on CPU, and do not require GPU. For better performance, on compatible systems you can add --use-onnx to use the ONNX version of the embedding model, to significantly lower the embedding latency.

Option 3: Build your own index

To build a Wikipedia index

The following command will download, preprocess, and index the latest HTML dump of the Kurdish Wikipedia, which we use in this example for its relatively small size.

inv index-wikipedia-dump  --embedding-model BAAI/bge-m3 --workdir ./workdir --language ku

To index custom documents

  1. Preprocess your data into a JSON Lines file (with .jsonl or .jsonl.gz file extension) where each line has the following fields:
{"id": "integer", "content_string": "string", "article_title": "string", "full_section_title": "string", "block_type": "string", "language": "string", "last_edit_date": "string (optional)", "num_tokens": "integer (optional)"}

content_string should be the chunked text of your documents. We recommend chunking to less than 500 tokens of the embedding model's tokenizer. See this for an overview on chunking methods. block_type and language are only used to provide filtering on search results. If you do not need them, you can simply set them to block_type=text and language=en. The script will feed full_section_title and content_string to the embedding model to create embedding vectors.

See preprocessing/preprocess_wikipedia_html_dump.py for details on how this is implemented for Wikipedia HTML dumps.

  1. Run the indexing command:
inv index-collection --collection-path <path to preprocessed JSONL> --collection-name <name>

This command starts docker containers for 🤗 Text Embedding Inference (one per available GPU). By default, it uses the docker image compatible with NVIDIA GPUs with Ampere 80 architecture, e.g. A100. Support for some other GPUs is also available, but you would need to choose the right docker image from available docker images.

  1. (Optional) Add a payload index
python retrieval/add_payload_index.py

This will enable queries that filter on language or block_type. Note that for large indices, it might take several minutes for the index to become available again.

  1. After indexing, load and use the index as in option 2. For example:
inv start-retriever --embedding-model BAAI/bge-m3 --retriever-port <port number>
curl -X POST 0.0.0.0:5100/search -H "Content-Type: application/json" -d '{"query": ["What is GPT-4?", "What is LLaMA-3?"], "num_blocks": 3}'
  1. Start WikiChat by passing in the URL of this retriever. For example:
inv demo --retriever-endpoint "http://0.0.0.0:<port number>/search"

To upload a Qdrant index to 🤗 Hub:

  1. Split the index into smaller parts:
tar -cvf - <path to the Qdrant index folder> | pigz -p 14 | split --bytes=10GB --numeric-suffixes=0 --suffix-length=4 - <path to the output folder>/qdrant_index.tar.gz.part-
  1. Upload the resulting parts:
python retrieval/upload_folder_to_hf_hub.py --folder_path <path to the output folder> --repo_id <Repo ID on 🤗 Hub>

Run WikiChat in Terminal

You can run different configurations of WikiChat using commands like these:

inv demo --engine gpt-4o # engine can be any value configured in llm_config, for example, mistral-large, claude-sonnet-35, local
inv demo --pipeline generate_and_correct # available pipelines are early_combine, generate_and_correct and retrieve_and_generate
inv demo --temperature 0.9 # changes the temperature of the user-facing stages like refinement

For a full list of all available options, you can run inv demo --help

[Optional] Deploy WikiChat for Multi-user Access

This repository provides code to deploy a web-based chat interface via Chainlit, and store user conversations to a Cosmos DB database. These are implemented in backend_server.py and database.py respectively. If you want to use other databases or front-ends, you need to modify these files. For development, it should be straightforward to remove the dependency on Cosmos DB and simply store conversations in memory. You can also configure chatbot parameters defined in backend_server.py, for example to use a different LLM or add/remove stages of WikiChat.

Set up Cosmos DB

After creating an instance via Azure, obtain the connection string and add this value in API_KEYS.

COSMOS_CONNECTION_STRING=[Your Cosmos DB connection string]

Run Chainlit

Running this will start the backend and front-end servers. You can then access the front-end at the specified port (5001 by default). inv chainlit --backend-port 5001

The Free Rate-limited Wikipedia Search API

You can use this API endpoint for prototyping high-quality RAG systems. See https://wikichat.genie.stanford.edu/search/redoc for the full specification.

Note that we do not provide any guarantees about this endpoint, and it is not suitable for production.

Wikipedia Preprocessing: Why is it Difficult?

(Coming soon...)

We publicly release preprocessed Wikipedia in 10 languages.

Other Commands

Run a Distilled Model for Lower Latency and Cost

WikiChat 2.0 is not compatible with fine-tuned LLaMA-2 checkpoints released. Please refer to v1.0 for now.

Simulate Conversations

To evaluate a chatbot, you can simulate conversations using a user simulator. The subset parameter can be one of head, tail, or recent, corresponding to the three subsets introduced in the WikiChat paper. You can also specify the language of the user (WikiChat always replies in the user's language). This script reads the topic (i.e., a Wikipedia title and article) from the corresponding benchmark/topics/{subset}_articles_{language}.json file. Use --num-dialogues to set the number of simulated dialogues to generate, and --num-turns to specify the number of turns in each dialogue.

inv simulate-users --num-dialogues 1 --num-turns 2 --simulation-mode passage --language en --subset head

Depending on the engine you are using, this might take some time. The simulated dialogues and log files will be saved in benchmark/simulated_dialogues/. You can also provide any of the pipeline parameters from above. You can experiment with different user characteristics by modifying user_characteristics in benchmark/user_simulator.py.

License

WikiChat code, and models and data are released under Apache-2.0 license.

Citation

If you have used code or data from this repository, please cite the following papers:

@inproceedings{semnani-etal-2023-wikichat,
    title = "{W}iki{C}hat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on {W}ikipedia",
    author = "Semnani, Sina  and
      Yao, Violet  and
      Zhang, Heidi  and
      Lam, Monica",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.157",
    pages = "2387--2413",
}

@inproceedings{zhang-etal-2024-spaghetti,
    title = "{SPAGHETTI}: Open-Domain Question Answering from Heterogeneous Data Sources with Retrieval and Semantic Parsing",
    author = "Zhang, Heidi  and
      Semnani, Sina  and
      Ghassemi, Farhad  and
      Xu, Jialiang  and
      Liu, Shicheng  and
      Lam, Monica",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.96",
    pages = "1663--1678",
}