RealCritic is a novel benchmark for evaluating language models' critique capabilities through a closed-loop, effectiveness-driven approach. Unlike traditional open-loop evaluations, we measure critique quality based on its ability to drive solution improvement.
- Closed-loop Evaluation: Measures critique quality through improvement in refined solutions rather than isolated judgments
- Comprehensive Assessment: First benchmark to evaluate self-critique, cross-critique, and iterative critique capabilities
- Novel Findings: Reveals significant gaps between reasoning-based and traditional LLMs in critique abilities
- ARC Challenge: Multiple choice science questions
- GSM8K: Grade school math problems
- MATH: Advanced mathematics
- College Math: Undergraduate mathematics
- Minerva Math: Scientific reasoning
- GPQA: Physics questions
- MMLU STEM: Science/tech/engineering/math topics
- Olympiad Bench: Competition-level problems
- Direct CoT: Baseline problem-solving ability
- Self-Critique: Model critiques its own solutions
- Cross-Critique: Model critiques other models' solutions
- Iterative Critique: Multi-round refinement process
Please follow the installation instructions in the Qwen2.5-Math repository.
#!/bin/bash
PROMPT_TYPE="critic_and_correct"
USER_PROMPT_TYPE="critic_and_correct"
MODEL_NAME_OR_PATH="path/to/your/model"
DATA_DIR="data/mix_data"
DATASETS="arc-challenge,college_math,gpqa,gsm8k,math,minerva_math,mmlu_stem,olympiadbench"
MULTI_TURN=1
export CUDA_VISIBLE_DEVICES="0,1"
bash sh/local/run_cross_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN
# bash sh/local/run_self_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN
# bash sh/local/run_direct_cot.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS
#!/bin/bash
PROMPT_TYPE="direct-cot"
USER_PROMPT_TYPE="default-cot"
MODEL_NAME_OR_PATH="gpt-4o"
DATA_DIR="data/mix_data"
DATASETS="arc-challenge,college_math,gpqa,gsm8k,math,minerva_math,mmlu_stem,olympiadbench"
MULTI_TURN=1
export DASHSCOPE_API_KEY="your_api_key"
export DASHSCOPE_BASE_URL="https://api.dashscope.com/v1"
bash sh/run_direct_cot.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS
# bash sh/run_cross_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN
# bash sh/run_self_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN
python3 -u model_eval_critic.py \
--data_dir path/for/critic/math_eval \
--model_name_or_path /post/check/model \
--output_dir /output/dir/ \
--data_name ${DATA_NAME} \
--seed 0 \
--temperature 0 \
--top_p 1 \
All prompts are defined in utils.py
:
# System prompts
SYSTEM_PROMPT_TEMPLATES = {
"direct-cot": "...",
"critic_and_correct": "...", # Must include "critic" keyword
}
# User prompts
USER_PROMPT_TEMPLATE = {
"default-cot": {
"type": "cot",
"template": "..."
},
"critic_and_correct": {
"type": "critic",
"template": "..."
}
}
The code in this repository is built upon Qwen2.5-Math. We thank the Qwen team for their excellent work and open-source contributions.
If you find this work useful, please cite:
@article{tang2024realcritic,
title={RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques},
author={Tang, Zhengyang and Li, Ziniu and Xiao, Zhenyang and Ding, Tian and Sun, Ruoyu and Wang, Benyou and Liu, Dayiheng and Huang, Fei and Liu, Tianyu and Yu, Bowen and Lin, Junyang},
journal={arXiv preprint arXiv:2501.14492},
year={2024}
}