RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

RealCritic is a novel benchmark for evaluating language models' critique capabilities through a closed-loop, effectiveness-driven approach. Unlike traditional open-loop evaluations, we measure critique quality based on its ability to drive solution improvement.

Key Contributions

Closed-loop Evaluation: Measures critique quality through improvement in refined solutions rather than isolated judgments
Comprehensive Assessment: First benchmark to evaluate self-critique, cross-critique, and iterative critique capabilities
Novel Findings: Reveals significant gaps between reasoning-based and traditional LLMs in critique abilities

Benchmark Details

Tasks

ARC Challenge: Multiple choice science questions
GSM8K: Grade school math problems
MATH: Advanced mathematics
College Math: Undergraduate mathematics
Minerva Math: Scientific reasoning
GPQA: Physics questions
MMLU STEM: Science/tech/engineering/math topics
Olympiad Bench: Competition-level problems

Evaluation Modes

Direct CoT: Baseline problem-solving ability
Self-Critique: Model critiques its own solutions
Cross-Critique: Model critiques other models' solutions
Iterative Critique: Multi-round refinement process

Usage

Installation

Please follow the installation instructions in the Qwen2.5-Math repository.

Local Evaluation

#!/bin/bash

PROMPT_TYPE="critic_and_correct"
USER_PROMPT_TYPE="critic_and_correct"
MODEL_NAME_OR_PATH="path/to/your/model"
DATA_DIR="data/mix_data"
DATASETS="arc-challenge,college_math,gpqa,gsm8k,math,minerva_math,mmlu_stem,olympiadbench"
MULTI_TURN=1

export CUDA_VISIBLE_DEVICES="0,1"

bash sh/local/run_cross_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN

# bash sh/local/run_self_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN

# bash sh/local/run_direct_cot.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS

API Evaluation

#!/bin/bash

PROMPT_TYPE="direct-cot"
USER_PROMPT_TYPE="default-cot"
MODEL_NAME_OR_PATH="gpt-4o"
DATA_DIR="data/mix_data"
DATASETS="arc-challenge,college_math,gpqa,gsm8k,math,minerva_math,mmlu_stem,olympiadbench"
MULTI_TURN=1

export DASHSCOPE_API_KEY="your_api_key"
export DASHSCOPE_BASE_URL="https://api.dashscope.com/v1"

bash sh/run_direct_cot.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS

# bash sh/run_cross_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN

# bash sh/run_self_critic.sh $PROMPT_TYPE $USER_PROMPT_TYPE $MODEL_NAME_OR_PATH $DATA_DIR $DATASETS $MULTI_TURN

Post Check (Optional)

python3 -u model_eval_critic.py \
	--data_dir path/for/critic/math_eval \
	--model_name_or_path /post/check/model \
	--output_dir /output/dir/ \
	--data_name ${DATA_NAME} \
	--seed 0 \
	--temperature 0 \
	--top_p 1 \

Customizing Prompts

All prompts are defined in utils.py:

# System prompts
SYSTEM_PROMPT_TEMPLATES = {
    "direct-cot": "...",
    "critic_and_correct": "...",  # Must include "critic" keyword
}

# User prompts
USER_PROMPT_TEMPLATE = {
    "default-cot": {
        "type": "cot",
        "template": "..."
    },
    "critic_and_correct": {
        "type": "critic",
        "template": "..."
    }
}

Acknowledgements

The code in this repository is built upon Qwen2.5-Math. We thank the Qwen team for their excellent work and open-source contributions.

Citation

If you find this work useful, please cite:

@article{tang2024realcritic,
    title={RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques},
    author={Tang, Zhengyang and Li, Ziniu and Xiao, Zhenyang and Ding, Tian and Sun, Ruoyu and Wang, Benyou and Liu, Dayiheng and Huang, Fei and Liu, Tianyu and Yu, Bowen and Lin, Junyang},
    journal={arXiv preprint arXiv:2501.14492},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data/mix_data		data/mix_data
imgs		imgs
latex2sympy		latex2sympy
sh		sh
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
data_loader.py		data_loader.py
evaluate.py		evaluate.py
examples.py		examples.py
grader.py		grader.py
math_eval.py		math_eval.py
math_utils.py		math_utils.py
model_eval_critic.py		model_eval_critic.py
model_utils.py		model_utils.py
parser.py		parser.py
python_executor.py		python_executor.py
requirements.txt		requirements.txt
test.py		test.py
trajectory.py		trajectory.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Key Contributions

Benchmark Details

Tasks

Evaluation Modes

Usage

Installation

Local Evaluation

API Evaluation

Post Check (Optional)

Customizing Prompts

Acknowledgements

Citation

About

Releases

Packages

Languages

License

tangzhy/RealCritic

Folders and files

Latest commit

History

Repository files navigation

RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques

Key Contributions

Benchmark Details

Tasks

Evaluation Modes

Usage

Installation

Local Evaluation

API Evaluation

Post Check (Optional)

Customizing Prompts

Acknowledgements

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages