Skip to content

vsamuel2003/data-contamination

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

Task

This repository contains the dataset and code of the paper:

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

COLING 2025

Datasets

Our evaluation data are released in the data folder. These data files are processed versions of data that is found online.

Code

All of the code that is part of running our evaluations is provided in data. Both the WPQ and Local Order Quiz is run using this while the Token Overlap method is run here and the Canonical Order is ran here and the Min-K% is ran here.

Here we provide an example of setting up the environment

Setup

# Environment setup
conda create -n contamination python=3.9 -y
conda activate contamination

# install dependency
pip install -r requirements.txt

About

Accepted to COLING 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published