Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

This repository contains the dataset and code of the paper:

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

COLING 2025

Datasets

Our evaluation data are released in the data folder. These data files are processed versions of data that is found online.

Code

All of the code that is part of running our evaluations is provided in data. Both the WPQ and Local Order Quiz is run using this while the Token Overlap method is run here and the Canonical Order is ran here and the Min-K% is ran here.

Here we provide an example of setting up the environment

Setup

# Environment setup
conda create -n contamination python=3.9 -y
conda activate contamination

# install dependency
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
code		code
datasets		datasets
README.md		README.md
pipeline.jpg		pipeline.jpg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Datasets

Code

Setup

About

Releases

Packages

Languages

vsamuel2003/data-contamination

Folders and files

Latest commit

History

Repository files navigation

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Datasets

Code

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages