Skip to content

Latest commit

 

History

History
27 lines (18 loc) · 1.41 KB

README.md

File metadata and controls

27 lines (18 loc) · 1.41 KB

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Task

This repository contains the dataset and code of the paper:

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

COLING 2025

Datasets

Our evaluation data are released in the data folder. These data files are processed versions of data that is found online.

Code

All of the code that is part of running our evaluations is provided in data. Both the WPQ and Local Order Quiz is run using this while the Token Overlap method is run here and the Canonical Order is ran here and the Min-K% is ran here.

Here we provide an example of setting up the environment

Setup

# Environment setup
conda create -n contamination python=3.9 -y
conda activate contamination

# install dependency
pip install -r requirements.txt