This repo include the papers discussed in our survey paper A Survey on LLM-as-a-Judge
Feel free to cite if you find our survey is useful for your research:
@article{gu2024surveyllmasajudge,
title = {A Survey on LLM-as-a-Judge},
author = {Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Yuanzhuo Wang and Jian Guo},
year = {2024},
journal = {arXiv preprint arXiv: 2411.15594}
}
-
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Preprint
Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes [Paper] [Code], 2024.07
-
JUSTICE OR PREJUDICE? QUANTIFYING BIASES IN LLM-AS-A-JUDGE
Preprint
Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang [paper]
-
Argument Quality Assessment in the Age of Instruction-Following Large Language Models
COLING2024
Henning Wachsmuth, Gabriella Lapesa, Elena Cabrio, Anne Lauscher, Joonsuk Park, Eva Maria Vecchi, Serena Villata, Timon Ziegenbein1 [paper]
-
A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators
AAAI 2024
Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li [Paper] [Code], 2024.01
-
Large Language Models Cannot Self-Correct Reasoning Yet
ICLR 2024
Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou [Paper], 2024.05
-
Large Language Models are not Fair Evaluators
ACL 2024
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, Zhifang Sui [Paper] [Code], 2023.08
-
Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models
ACL 2024
Abhishek Kumar, Sarfaroz Yunusov, Ali Emami [Paper] [Code], 2024.06
-
Are LLM-based Evaluators Confusing NLG Quality Criteria
ACL 2024
Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan [Paper] [Code], 2024.06
-
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
ACL 2024 findings
Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki [Paper], 2024.05
-
Can Large Language Models Be an Alternative to Human Evaluations?
ACL 2023
Cheng-Han Chiang, Hung-yi Lee [Paper], 2023.05
-
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks
EMNLP 2023
Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan [Paper] [Code], 2023.10
-
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
NewSumm @ EMNLP 2023
Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou [Paper] [Code], 2023.10
-
Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency?
NAACL 2024 findings
Nathan Brake, Thomas Schaaf [Paper], 2024.04
-
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
COLING 2024
-
Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study
Preprint
Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu [Paper] [Code], 2023.09
-
Humans or LLMs as the Judge? A Study on Judgement Biases
Preprint
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang [Paper], 2024.06
-
On the Limitations of Fine-tuned Judge Models for LLM Evaluation
Preprint
Hui Huang, Yingqi Qu, Hongli Zhou, Jing Liu, Muyun Yang, Bing Xu, Tiejun Zhao [Paper] [Code], 2024.06
-
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Preprint
Vyas Raina, Adian Liusie, Mark Gales [Paper] [Code], 2024.07
-
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Preprint
Hui Wei, Shenghua He, Tian Xia, Andy Wong, Jingyang Lin, Mei Han [Paper] [Code], 2024.08
-
On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
ICLR 2024 (oral)
Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu [Paper] [Code], 2023.12
-
Generative Judge for Evaluating Alignment
ICLR 2024
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu [Paper] [Code], 2023.12
-
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization
ICLR 2024
Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang [Paper] [Code], 2024.05
-
Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph
ICLR 2024
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, Jian Guo [Paper] [Code], 2024.05
-
HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition
ACL 2024
Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang [Paper], 2024.02
-
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation
ACL 2024
Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng [Paper] [Code], 2024.06
-
FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model
ACL 2024
Yebin Lee, Imseong Park, Myungjoo Kang [Paper] [Code], 2024.06
-
**KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models **
ACL 2024
Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang [Paper] [Code], 2024.06
-
ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
ACL 2024
Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, Linqi Song [Paper] [Code], 2024.06
-
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
ACL 2024
Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang [Paper] [Code], 2024.06
-
Aligning Large Language Models by On-Policy Self-Judgment
ACL 2024
Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu [Paper] [Code], 2024.06
-
FineSurE: Fine-grained Summarization Evaluation using LLMs
ACL 2024
Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour [Paper] [Code], 2024.07
-
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
ACL 2024 findings
Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao [Paper] [Code], 2024.06
-
LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models
NLP4ConvAI @ ACL 2023
Yen-Ting Lin, Yun-Nung Chen [Paper], 2023.05
-
G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
EMNLP 2023
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu [Paper] [Code], 2023.05
-
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models
EMNLP 2023
Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor [Paper] [Code], 2023.10
-
**INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback **
EMNLP 2023
Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, Lei Li [Paper], 2023.10
-
**FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation **
EMNLP 2023
Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi [Paper] [Code], 2023.10
-
Revisiting Automated Topic Model Evaluation with Large Language Models
EMNLP 2023 (short)
Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, Elliott Ash [Paper] [Code], 2023.10
-
CLAIR: Evaluating Image Captions with Large Language Models
EMNLP 2023 (short)
David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John Canny [Paper] [Code], 2023.10
-
GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models
NAACL 2024
Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han [Paper] [Code], 2024.02
-
GPTScore: Evaluate as You Desire
NAACL 2024
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu [Paper] [Code], 2023.02
-
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
NAACL 2024
Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li [Paper], 2024.06
-
A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models
NAACL 2024 (short)
Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun [Paper] [Code], 2024.03
-
SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation
NAACL 2024 findings
Hangfeng He, Hongming Zhang, Dan Roth [Paper] [Code], 2024.06
-
Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators
COLM 2024
Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulić, Anna Korhonen, Nigel Collier [Paper] [Code], 2024.08
-
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation
NeurIPS 2023
Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang [Paper] [Code], 2023.05
-
Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning
NeurIPS 2023
Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen [Paper] [Code], 2023.06
-
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
NeurIPS 2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang [Paper] [Code], 2023.10
-
Reflexion: Language Agents with Verbal Reinforcement Learning
NeuralIPS 2023
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao [Paper] [Code], 2023.10
-
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
NeurIPS 2023
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang [Paper] [Code], 2023.10
-
Self-Evaluation Guided Beam Search for Reasoning
NeurIPS 2023
Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, Michael Xie [Paper] [Code], 2023.10
-
Benchmarking Foundation Models with Language-Model-as-an-Examiner
NeurIPS 2023
Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou [Paper] [Code], 2023.11
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
NeurIPS 2023
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica [Paper] [Code], 2023.12
-
Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality*
Blog
-
Human-like summarization evaluation with chatgpt
Preprint
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan [Paper], 2023.04
-
WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct
Preprint
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang [Paper] [Code] [Model], 2023.08
-
**Judgelm: Fine-tuned large language models are scalable judges **
Preprint
Lianghui Zhu, Xinggang Wang, Xinlong Wang [Paper] [Code], 2023.10
-
Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
Preprint
Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu [Paper] [Code], 2023.12
-
JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models
Preprint
-
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Preprint
Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto [Paper] [Code], 2024.04
-
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
Preprint
Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, Sanghyuk Choi [Paper] [Code], 2024.07
-
DHP Benchmark: Are LLMs Good NLG Evaluators?
Preprint
Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu [Paper], 2024.08
-
Generative Verifiers: Reward Modeling as Next-Token Prediction
Preprint
Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal [Paper], 2024.08
-
Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation
Preprint
Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket [Paper], 2024.09
-
LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization
Preprint
Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, Partha Pratim Chakrabarti [Paper] [Code], 2024.09
-
Reasoning with Language Model is Planning with World Model
EMNLP 2023
Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, Zhiting Hu [Paper] [Code] [Reasoners] [Blog], 2023.05
-
Solving Math Word Problems via Cooperative Reasoning induced Language Models
ACL 2023
Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, Yujiu Yang [Paper] [Code], 2023.07
-
Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language
NeurIPS 2023
-
Deductive Verification of Chain-of-Thought Reasoning
NeurIPS 2023
Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su [Paper] [Code], 2023.10
-
Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning
NeurIPS 2023
Xiaoming Shi, Siqiao Xue, Kangrui Wang, Fan Zhou, James Zhang, Jun Zhou, Chenhao Tan, Hongyuan Mei [Paper] [Code], 2023.10
-
DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models
NeurIPS 2023
Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, Sibei Yang [Paper] [Code], 2023.10
-
Learning to Reason and Memorize with Self-Notes
NeurIPS 2023
Jack Lanchantin, Shubham Toshniwal, Jason Weston, arthur szlam, Sainbayar Sukhbaatar [Paper], 2023.10
-
Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning
NeurIPS 2023
Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, Cewu Lu [Paper] [Code], 2023.11
-
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
NeurIPS 2023
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan [Paper] [Code], 2023.12
-
Understanding Social Reasoning in Language Models with Language Models
NeurIPS 2023
Kanishk Gandhi, Jan-Philipp Fraenken, Tobias Gerstenberg, Noah Goodman [Paper] [Code], 2023.12
-
Automatic model selection with large language models for reasoning
EMNLP 2023 findings
James Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Michael Xie [Paper] [Code], 2023.10
-
Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation
Preprint
Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang [Paper] [Code], 2024.02
-
Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning
Preprint
Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, Zhifang Sui [Paper], 2024.02
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Preprint
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov [Paper], 2024.08
-
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
EMNLP2024
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra paper
-
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
EMNLP2024
Zhiting Fan, Ruizhe Chen, Ruiling Xu, Zuozhu Liu paper
-
Are LLMs Good Zero-Shot Fallacy Classifiers?
EMNLP2024
Fengjun Pan, Xiaobao Wu, Zongrui Li, Anh Tuan Luu paper
-
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
EMNLP2024
Alex Chandler, Devesh Surve, Hui Su paper
-
Split and Merge: Aligning Position Biases in LLM-based Evaluators
EMNLP2024
Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu paper
-
Annotation alignment: Comparing LLM and human annotations of conversational safety
EMNLP2024
Rajiv Movva, Pang Wei Koh, Emma Pierson paper
-
Humans or LLMs as the Judge? A Study on Judgement Bias
EMNLP2024
Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang paper
-
RealVul: Can We Detect Vulnerabilities in Web Applications with LLM?
EMNLP2024
Di Cao, Yong Liao, Xiuwei Shang paper
-
A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
EMNLP2024
Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram paper
-
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
EMNLP2024
Vyas Raina, Adian Liusie, Mark Gales paper
-
RepEval: Effective Text Evaluation with LLM Representation
EMNLP2024
Shuqian Sheng, Yi Xu, Tianhang Zhang, Zanwei Shen, Luoyi Fu, Jiaxin Ding, Lei Zhou, Xiaoying Gan, Xinbing Wang, Chenghu Zhou paper
-
Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons
EMNLP2024
Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales paper
-
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
EMNLP2024
Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao paper
-
An LLM Feature-based Framework for Dialogue Constructiveness Assessment
EMNLP2024
Lexin Zhou, Youmna Farag, Andreas Vlachos paper
-
I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation
EMNLP2024
Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen paper
-
Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation
EMNLP2024
Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim paper
-
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
EMNLP2024
Yicheng Gao, Gonghan Xu, Zhe Wang, Arman Cohan paper
-
Evaluating Mathematical Reasoning Beyond Accuracy
preprint
Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu paper
-
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models
preprint
Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, Zhi Tang paper
-
Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization
preprint
Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He paper
-
LLaVA-RLHF Aligning Large Multimodal Models with Factually Augmented RLHF
ACL 2024 Findings
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell paper
-
Alpagasus: Training A better alpaca with fewer data.
ICLR 2024
Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin paper
-
Concept-skill Transferability-based Data Selection for Large Vision-Language Models
EMNLP 2024
Jaewoo Lee, Boyang Li, Sung Ju Hwang paper
-
Less is More: High-value Data Selection for Visual Instruction Tuning
preprint
Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen paper
-
Data-Juicer: A One-Stop Data Processing System for Large Language Models
preprint
Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou paper
-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
ECCV 2024
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin paper
-
Visual Instruction Tuning
NIPS 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee paper
-
VBench: Comprehensive Benchmark Suite for Video Generative Models
preprint
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu paper
-
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
preprint
Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma paper
-
Agent-as-a-Judge: Evaluate Agents with Agents
preprint
Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber paper