Skip to content

A principled reduction of six benchmarks from the Open LLM Leaderboard to a single sparse benchmark.

License

Notifications You must be signed in to change notification settings

adkipnis/metabench

Folders and files

NameName
Last commit message
Last commit date
Feb 11, 2025
Feb 11, 2025
Nov 27, 2024
Feb 11, 2025
Nov 28, 2024
Nov 14, 2024
Nov 28, 2024
May 12, 2024
Feb 11, 2025
Jun 13, 2024
Nov 28, 2024
Nov 28, 2024
Nov 28, 2024

Repository files navigation

metabench

A Sparse Benchmark to Measure General Ability in LLMs

šŸ¤—ā€ƒmetabench distills the Open LLM Leaderboard 1 to less than 3% of its original size
šŸ§‘ā€šŸ«ā€ƒitem selection is based on item response theory analyses of over 5000 LLMs
šŸ”„ā€ƒscores for the six benchmarks1 can be reconstructed with on average <0.9% mean absolute error
ā˜„ļøā€ƒthe Open LLM Leaderboard score can be reconstructed with <0.5% mean absolute error

This repo contains the source code for dataset scraping in Python and statistical analysis in R.
For details, please read our preprint.

Setup

The R programming language is required for running metabench. Once installed, you can setup all R dependencies by running this line in the current directory:

Rscript setup.R

Testing your LLM with metabench

Step 1 - evaluate your LLM using the lm-evaluation-harness

lm-eval --model hf \
    --model_args {model_id} \
    --tasks metabench{version}{permute} \            
    --output_path path/to/metabench/harness-results \
    --log_samples # this saves the instance level results as a jsonl
  • {model_id} is your Hugging Face model ID,\
  • {version} is "" for the main version or "_secondary" for the repeated evaluation version\
  • {permute} is "" for the unpermuted responses and "_permute" for the permuted responses.

Step 2 - reconstruct the full points

Rscript reconstruct.R {model_id} {ver} {per}
  • {ver} is "A" for the main version and "B" for the repeated evaluation version\
  • {per} is "False" for the unpermuted responses and "True" for the permuted responses.

How metabench was constructed

  1. Collect item-wise accuracies from all available LLMs for each benchmark on Open LLM Leaderboard.
  2. Remove items based on simple statistics like variance.
  3. Perform cross-validated subsampling to 350 items per benchmark.
  4. Fit variants of IRT models to the remaining items, infer item information from the item parameters and select the most informative items to construct metabench.
  5. Use the model fits to estimate the benchmark-specific abilities and reconstruct the original (normalized) benchmark scores as well as their mean using a generalized additive model with cross-validation.

Data

If you wish to reproduce our results, please find the complete datasets used in this project on zenodo.
Simply download and extract data.tar.gz to data inside your .../metabench/ directory.

Folders

  • analysis: Statistical analyses (preprocessing, cross-validated random sampling, item response theory, information filtering, factor analysis, computerized adaptive testing simulations)
  • bash: Templates for running scripts on a compute cluster with slurm
  • figures: Scripts for generating the figures shown in the paper
  • scraping: Scripts for downloading and processing publically available item-wise responses by LLMs

Citing the Project

To cite metabench in publications:

@article{metabench,
  author  = {Alex Kipnis and Konstantinos Voudouris and Luca M. Schulze Buschoff and Eric Schulz},
  title   = {metabench - A Sparse Benchmark to Measure General Ability in Large Language Models},
  journal = {arXiv preprint arXiv:2407.12844},
  year    = {2024},
}

Footnotes

  1. ARC, GSM8K, HellaSwag, MMLU, TruthfulQA and WinoGrande ā†©

About

A principled reduction of six benchmarks from the Open LLM Leaderboard to a single sparse benchmark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published