š¤āmetabench distills the Open LLM Leaderboard 1 to less than 3% of its original size
š§āš«āitem selection is based on item response theory analyses of over 5000 LLMs
š„āscores for the six benchmarks1 can be reconstructed with on average <0.9% mean absolute error
āļøāthe Open LLM Leaderboard score can be reconstructed with <0.5% mean absolute error
This repo contains the source code for dataset scraping in Python and statistical analysis in R.
For details, please read our preprint.
The R programming language is required for running metabench. Once installed, you can setup all R dependencies by running this line in the current directory:
Rscript setup.R
Step 1 - evaluate your LLM using the lm-evaluation-harness
lm-eval --model hf \
--model_args {model_id} \
--tasks metabench{version}{permute} \
--output_path path/to/metabench/harness-results \
--log_samples # this saves the instance level results as a jsonl
{model_id}
is your Hugging Face model ID,\{version}
is "" for the main version or "_secondary" for the repeated evaluation version\{permute}
is "" for the unpermuted responses and "_permute" for the permuted responses.
Step 2 - reconstruct the full points
Rscript reconstruct.R {model_id} {ver} {per}
{ver}
is "A" for the main version and "B" for the repeated evaluation version\{per}
is "False" for the unpermuted responses and "True" for the permuted responses.
- Collect item-wise accuracies from all available LLMs for each benchmark on Open LLM Leaderboard.
- Remove items based on simple statistics like variance.
- Perform cross-validated subsampling to 350 items per benchmark.
- Fit variants of IRT models to the remaining items, infer item information from the item parameters and select the most informative items to construct metabench.
- Use the model fits to estimate the benchmark-specific abilities and reconstruct the original (normalized) benchmark scores as well as their mean using a generalized additive model with cross-validation.
If you wish to reproduce our results, please find the complete datasets used in this project on zenodo.
Simply download and extract data.tar.gz
to data
inside your .../metabench/
directory.
- analysis: Statistical analyses (preprocessing, cross-validated random sampling, item response theory, information filtering, factor analysis, computerized adaptive testing simulations)
- bash: Templates for running scripts on a compute cluster with slurm
- figures: Scripts for generating the figures shown in the paper
- scraping: Scripts for downloading and processing publically available item-wise responses by LLMs
To cite metabench in publications:
@article{metabench,
author = {Alex Kipnis and Konstantinos Voudouris and Luca M. Schulze Buschoff and Eric Schulz},
title = {metabench - A Sparse Benchmark to Measure General Ability in Large Language Models},
journal = {arXiv preprint arXiv:2407.12844},
year = {2024},
}
Footnotes
-
ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande ā©