metabench

A Sparse Benchmark to Measure General Ability in LLMs

🤗 metabench distills the Open LLM Leaderboard 1 to less than 3% of its original size
🧑‍🏫 item selection is based on item response theory analyses of over 5000 LLMs
🔥 scores for the six benchmarks¹ can be reconstructed with on average <0.9% mean absolute error
☄️ the Open LLM Leaderboard score can be reconstructed with <0.5% mean absolute error

This repo contains the source code for dataset scraping in Python and statistical analysis in R.
For details, please read our preprint.

Setup

The R programming language is required for running metabench. Once installed, you can setup all R dependencies by running this line in the current directory:

Rscript setup.R

Testing your LLM with metabench

Step 1 - evaluate your LLM using the lm-evaluation-harness

lm-eval --model hf \
    --model_args {model_id} \
    --tasks metabench{version}{permute} \            
    --output_path path/to/metabench/harness-results \
    --log_samples # this saves the instance level results as a jsonl

{model_id} is your Hugging Face model ID,\
{version} is "" for the main version or "_secondary" for the repeated evaluation version\
{permute} is "" for the unpermuted responses and "_permute" for the permuted responses.

Step 2 - reconstruct the full points

Rscript reconstruct.R {model_id} {ver} {per}

{ver} is "A" for the main version and "B" for the repeated evaluation version\
{per} is "False" for the unpermuted responses and "True" for the permuted responses.

How metabench was constructed

Collect item-wise accuracies from all available LLMs for each benchmark on Open LLM Leaderboard.
Remove items based on simple statistics like variance.
Perform cross-validated subsampling to 350 items per benchmark.
Fit variants of IRT models to the remaining items, infer item information from the item parameters and select the most informative items to construct metabench.
Use the model fits to estimate the benchmark-specific abilities and reconstruct the original (normalized) benchmark scores as well as their mean using a generalized additive model with cross-validation.

Data

If you wish to reproduce our results, please find the complete datasets used in this project on zenodo.
Simply download and extract data.tar.gz to data inside your .../metabench/ directory.

Folders

analysis: Statistical analyses (preprocessing, cross-validated random sampling, item response theory, information filtering, factor analysis, computerized adaptive testing simulations)
bash: Templates for running scripts on a compute cluster with slurm
figures: Scripts for generating the figures shown in the paper
scraping: Scripts for downloading and processing publically available item-wise responses by LLMs

Citing the Project

To cite metabench in publications:

@article{metabench,
  author  = {Alex Kipnis and Konstantinos Voudouris and Luca M. Schulze Buschoff and Eric Schulz},
  title   = {metabench - A Sparse Benchmark to Measure General Ability in Large Language Models},
  journal = {arXiv preprint arXiv:2407.12844},
  year    = {2024},
}

ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande ↩

Name	Name	Last commit message	Last commit date
Latest commit kozzy97 Add llama finetuning and evaluation on different splits Feb 11, 2025 892f30a · Feb 11, 2025 History 1,128 Commits
analysis	analysis	Merge pull request #4 from adkipnis/cat-optimize	Feb 11, 2025
bash	bash	Merge pull request #4 from adkipnis/cat-optimize	Feb 11, 2025
figures	figures	update architecture plots	Nov 27, 2024
finetuning	finetuning	Add finetuning scripts and results	Feb 11, 2025
harness-results/EleutherAI__pythia-14m	harness-results/EleutherAI__pythia-14m	example harness-results	Nov 28, 2024
items	items	final items	Nov 14, 2024
scraping	scraping	Merge branch 'main' of https://github.com/adkipnis/metabench	Nov 28, 2024
.gitattributes	.gitattributes	init git lfs	May 12, 2024
.gitignore	.gitignore	Merge branch 'main' into cat-optimize	Feb 11, 2025
LICENSE	LICENSE	change license to CC BY-NC-SA	Jun 13, 2024
README.md	README.md	Update README.md	Nov 28, 2024
reconstruct.R	reconstruct.R	move reconstruct.R to main dir	Nov 28, 2024
setup.R	setup.R	move setup to main dir	Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

metabench

A Sparse Benchmark to Measure General Ability in LLMs

Setup

Testing your LLM with metabench

How metabench was constructed

Data

Folders

Citing the Project

About

Releases

Packages

Contributors 3

Languages

License

adkipnis/metabench

Folders and files

Latest commit

History

Repository files navigation

metabench

A Sparse Benchmark to Measure General Ability in LLMs

Setup

Testing your LLM with metabench

How metabench was constructed

Data

Folders

Citing the Project

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages