Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

This repository contains the dataset and code of the paper:

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

COLING 2025

Datasets

Our evaluation data are released in the data folder. These data files are processed versions of data that is found online.

Code

All of the code that is part of running our evaluations is provided in data. Both the WPQ and Local Order Quiz is run using this while the Token Overlap method is run here and the Canonical Order is ran here and the Min-K% is ran here.

Here we provide an example of setting up the environment

Setup

# Environment setup
conda create -n contamination python=3.9 -y
conda activate contamination

# install dependency
pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Datasets

Code

Setup

Files

README.md

Latest commit

History

README.md

File metadata and controls

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Datasets

Code

Setup