Skip to content

IDEA-FinAI/LLM-as-a-Judge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

Awesome LLM-as-a-Judge

This repo include the papers discussed in our survey paper A Survey on LLM-as-a-Judge

Reference

Feel free to cite if you find our survey is useful for your research:

@article{gu2024surveyllmasajudge,
	title   = {A Survey on LLM-as-a-Judge},
	author  = {Jiawei Gu and Xuhui Jiang and Zhichao Shi and Hexiang Tan and Xuehao Zhai and Chengjin Xu and Wei Li and Yinghan Shen and Shengjie Ma and Honghao Liu and Yuanzhuo Wang and Jian Guo},
	year    = {2024},
	journal = {arXiv preprint arXiv: 2411.15594}
}

Overview of LLM-as-a-Judge

overview

Evaluation Pipelines of LLM-as-a-Judge

evaluation_pipeline

Improvement Strategies for LLM-as-a-Judge

improvement_strategy

Paper List

1.Survey

  1. Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Preprint

    Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes [Paper] [Code], 2024.07

  2. JUSTICE OR PREJUDICE? QUANTIFYING BIASES IN LLM-AS-A-JUDGE Preprint

    Jiayi Ye, Yanbo Wang, Yue Huang, Dongping Chen, Qihui Zhang, Nuno Moniz, Tian Gao, Werner Geyer, Chao Huang, Pin-Yu Chen, Nitesh V Chawla, Xiangliang Zhang [paper]

  3. Argument Quality Assessment in the Age of Instruction-Following Large Language Models COLING2024

    Henning Wachsmuth, Gabriella Lapesa, Elena Cabrio, Anne Lauscher, Joonsuk Park, Eva Maria Vecchi, Serena Villata, Timon Ziegenbein1 [paper]

2.Analysis

  1. A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators AAAI 2024

    Chen Zhang, Luis Fernando D'Haro, Yiming Chen, Malu Zhang, Haizhou Li [Paper] [Code], 2024.01

  2. Large Language Models Cannot Self-Correct Reasoning Yet ICLR 2024

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou [Paper], 2024.05

  3. Large Language Models are not Fair Evaluators ACL 2024

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, Zhifang Sui [Paper] [Code], 2023.08

  4. Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models ACL 2024

    Abhishek Kumar, Sarfaroz Yunusov, Ali Emami [Paper] [Code], 2024.06

  5. Are LLM-based Evaluators Confusing NLG Quality Criteria ACL 2024

    Xinyu Hu, Mingqi Gao, Sen Hu, Yang Zhang, Yicheng Chen, Teng Xu, Xiaojun Wan [Paper] [Code], 2024.06

  6. Likelihood-based Mitigation of Evaluation Bias in Large Language Models ACL 2024 findings

    Masanari Ohi, Masahiro Kaneko, Ryuto Koike, Mengsay Loem, Naoaki Okazaki [Paper], 2024.05

  7. Can Large Language Models Be an Alternative to Human Evaluations? ACL 2023

    Cheng-Han Chiang, Hung-yi Lee [Paper], 2023.05

  8. Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks EMNLP 2023

    Andrea Sottana, Bin Liang, Kai Zou, Zheng Yuan [Paper] [Code], 2023.10

  9. Is ChatGPT a Good NLG Evaluator? A Preliminary Study NewSumm @ EMNLP 2023

    Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, Jie Zhou [Paper] [Code], 2023.10

  10. Comparing Two Model Designs for Clinical Note Generation; Is an LLM a Useful Evaluator of Consistency? NAACL 2024 findings

    Nathan Brake, Thomas Schaaf [Paper], 2024.04

  11. Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks COLING 2024

    Ruiyang Zhou, Lu Chen, Kai Yu [Paper] [Dataset], 2024.05

  12. Exploring the Use of Large Language Models for Reference-Free Text Quality Evaluation: An Empirical Study Preprint

    Yi Chen, Rui Wang, Haiyun Jiang, Shuming Shi, Ruifeng Xu [Paper] [Code], 2023.09

  13. Humans or LLMs as the Judge? A Study on Judgement Biases Preprint

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang [Paper], 2024.06

  14. On the Limitations of Fine-tuned Judge Models for LLM Evaluation Preprint

    Hui Huang, Yingqi Qu, Hongli Zhou, Jing Liu, Muyun Yang, Bing Xu, Tiejun Zhao [Paper] [Code], 2024.06

  15. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment Preprint

    Vyas Raina, Adian Liusie, Mark Gales [Paper] [Code], 2024.07

  16. Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates Preprint

    Hui Wei, Shenghua He, Tian Xia, Andy Wong, Jingyang Lin, Mei Han [Paper] [Code], 2024.08

3.Auto-Evaluator

  1. On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs ICLR 2024 (oral)

    Jen-tse Huang, Wenxuan Wang, Eric John Li, Man Ho Lam, Shujie Ren, Youliang Yuan, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu [Paper] [Code], 2023.12

  2. Generative Judge for Evaluating Alignment ICLR 2024

    Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei Liu [Paper] [Code], 2023.12

  3. PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization ICLR 2024

    Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, Yue Zhang [Paper] [Code], 2024.05

  4. Think-on-Graph: Deep and Responsible Reasoning of Large Language Model on Knowledge Graph ICLR 2024

    Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel M. Ni, Heung-Yeung Shum, Jian Guo [Paper] [Code], 2024.05

  5. HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition ACL 2024

    Yuxuan Liu, Tianchi Yang, Shaohan Huang, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang [Paper], 2024.02

  6. Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation ACL 2024

    Xiaoying Zhang, Baolin Peng, Ye Tian, Jingyan Zhou, Lifeng Jin, Linfeng Song, Haitao Mi, Helen Meng [Paper] [Code], 2024.06

  7. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model ACL 2024

    Yebin Lee, Imseong Park, Myungjoo Kang [Paper] [Code], 2024.06

  8. **KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models ** ACL 2024

    Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang [Paper] [Code], 2024.06

  9. ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models ACL 2024

    Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, Yunlong Feng, Xiaoguang Li, Yasheng Wang, Lifeng Shang, Qun Liu, Linqi Song [Paper] [Code], 2024.06

  10. CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024

    Pei Ke, Bosi Wen, Andrew Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, Jie Tang, Minlie Huang [Paper] [Code], 2024.06

  11. Aligning Large Language Models by On-Policy Self-Judgment ACL 2024

    Sangkyu Lee, Sungdong Kim, Ashkan Yousefpour, Minjoon Seo, Kang Min Yoo, Youngjae Yu [Paper] [Code], 2024.06

  12. FineSurE: Fine-grained Summarization Evaluation using LLMs ACL 2024

    Hwanjun Song, Hang Su, Igor Shalyminov, Jason Cai, Saab Mansour [Paper] [Code], 2024.07

  13. SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models ACL 2024 findings

    Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wangmeng Zuo, Dahua Lin, Yu Qiao, Jing Shao [Paper] [Code], 2024.06

  14. LLM-EVAL: Unified Multi-Dimensional Automatic Evaluation for Open-Domain Conversations with Large Language Models NLP4ConvAI @ ACL 2023

    Yen-Ting Lin, Yun-Nung Chen [Paper], 2023.05

  15. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment EMNLP 2023

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu [Paper] [Code], 2023.05

  16. TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models EMNLP 2023

    Zorik Gekhman, Jonathan Herzig, Roee Aharoni, Chen Elkind, Idan Szpektor [Paper] [Code], 2023.10

  17. **INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback **EMNLP 2023

    Wenda Xu, Danqing Wang, Liangming Pan, Zhenqiao Song, Markus Freitag, William Wang, Lei Li [Paper], 2023.10

  18. **FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation ** EMNLP 2023

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, Hannaneh Hajishirzi [Paper] [Code], 2023.10

  19. Revisiting Automated Topic Model Evaluation with Large Language Models EMNLP 2023 (short)

    Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, Elliott Ash [Paper] [Code], 2023.10

  20. CLAIR: Evaluating Image Captions with Large Language Models EMNLP 2023 (short)

    David Chan, Suzanne Petryk, Joseph Gonzalez, Trevor Darrell, John Canny [Paper] [Code], 2023.10

  21. GENRES: Rethinking Evaluation for Generative Relation Extraction in the Era of Large Language Models NAACL 2024

    Pengcheng Jiang, Jiacheng Lin, Zifeng Wang, Jimeng Sun, Jiawei Han [Paper] [Code], 2024.02

  22. GPTScore: Evaluate as You Desire NAACL 2024

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu [Paper] [Code], 2023.02

  23. Branch-Solve-Merge Improves Large Language Model Evaluation and Generation NAACL 2024

    Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, Xian Li [Paper], 2024.06

  24. A Multi-Aspect Framework for Counter Narrative Evaluation using Large Language Models NAACL 2024 (short)

    Jaylen Jones, Lingbo Mo, Eric Fosler-Lussier, Huan Sun [Paper] [Code], 2024.03

  25. SocREval: Large Language Models with the Socratic Method for Reference-free Reasoning Evaluation NAACL 2024 findings

    Hangfeng He, Hongming Zhang, Dan Roth [Paper] [Code], 2024.06

  26. Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators COLM 2024

    Yinhong Liu, Han Zhou, Zhijiang Guo, Ehsan Shareghi, Ivan Vulić, Anna Korhonen, Nigel Collier [Paper] [Code], 2024.08

  27. LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation NeurIPS 2023

    Yujie Lu, Xianjun Yang, Xiujun Li, Xin Eric Wang, William Yang Wang [Paper] [Code], 2023.05

  28. Evaluating and Improving Tool-Augmented Computation-Intensive Math Reasoning NeurIPS 2023

    Beichen Zhang, Kun Zhou, Xilin Wei, Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen [Paper] [Code], 2023.06

  29. RRHF: Rank Responses to Align Language Models with Human Feedback without tears NeurIPS 2023

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang [Paper] [Code], 2023.10

  30. Reflexion: Language Agents with Verbal Reinforcement Learning NeuralIPS 2023

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao [Paper] [Code], 2023.10

  31. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation NeurIPS 2023

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang [Paper] [Code], 2023.10

  32. Self-Evaluation Guided Beam Search for Reasoning NeurIPS 2023

    Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, Michael Xie [Paper] [Code], 2023.10

  33. Benchmarking Foundation Models with Language-Model-as-an-Examiner NeurIPS 2023

    Yushi Bai, Jiahao Ying, Yixin Cao, Xin Lv, Yuze He, Xiaozhi Wang, Jifan Yu, Kaisheng Zeng, Yijia Xiao, Haozhe Lyu, Jiayin Zhang, Juanzi Li, Lei Hou [Paper] [Code], 2023.11

  34. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena NeurIPS 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica [Paper] [Code], 2023.12

  35. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality* Blog

    The Vicuna Team [Code] [Blog], 2023.03

  36. Human-like summarization evaluation with chatgpt Preprint

    Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan [Paper], 2023.04

  37. WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct Preprint

    Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, Dongmei Zhang [Paper] [Code] [Model], 2023.08

  38. **Judgelm: Fine-tuned large language models are scalable judges ** Preprint

    Lianghui Zhu, Xinggang Wang, Xinlong Wang [Paper] [Code], 2023.10

  39. Goal-Oriented Prompt Attack and Safety Evaluation for LLMs Preprint

    Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, Fei Wu [Paper] [Code], 2023.12

  40. JADE: A Linguistics-based Safety Evaluation Platform for Large Language Models Preprint

    Mi Zhang, Xudong Pan, Min Yang [Paper] [Code], 2023.12

  41. Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators Preprint

    Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto [Paper] [Code], 2024.04

  42. OffsetBias: Leveraging Debiased Data for Tuning Evaluators Preprint

    Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, Sanghyuk Choi [Paper] [Code], 2024.07

  43. DHP Benchmark: Are LLMs Good NLG Evaluators? Preprint

    Yicheng Wang, Jiayi Yuan, Yu-Neng Chuang, Zhuoer Wang, Yingchi Liu, Mark Cusick, Param Kulkarni, Zhengping Ji, Yasser Ibrahim, Xia Hu [Paper], 2024.08

  44. Generative Verifiers: Reward Modeling as Next-Token Prediction Preprint

    Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, Rishabh Agarwal [Paper], 2024.08

  45. Towards Leveraging Large Language Models for Automated Medical Q&A Evaluation Preprint

    Jack Krolik, Herprit Mahal, Feroz Ahmad, Gaurav Trivedi, Bahador Saket [Paper], 2024.09

  46. LLMs as Evaluators: A Novel Approach to Evaluate Bug Report Summarization Preprint

    Abhishek Kumar, Sonia Haiduc, Partha Pratim Das, Partha Pratim Chakrabarti [Paper] [Code], 2024.09

Reasoning

  1. Reasoning with Language Model is Planning with World Model EMNLP 2023

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, Zhiting Hu [Paper] [Code] [Reasoners] [Blog], 2023.05

  2. Solving Math Word Problems via Cooperative Reasoning induced Language Models ACL 2023

    Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Yongfeng Huang, Ruyi Gan, Jiaxing Zhang, Yujiu Yang [Paper] [Code], 2023.07

  3. Human-like Few-Shot Learning via Bayesian Reasoning over Natural Language NeurIPS 2023

    Kevin Ellis [Paper] [Code], 2023.09

  4. Deductive Verification of Chain-of-Thought Reasoning NeurIPS 2023

    Zhan Ling, Yunhao Fang, Xuanlin Li, Zhiao Huang, Mingu Lee, Roland Memisevic, Hao Su [Paper] [Code], 2023.10

  5. Language Models Can Improve Event Prediction by Few-Shot Abductive Reasoning NeurIPS 2023

    Xiaoming Shi, Siqiao Xue, Kangrui Wang, Fan Zhou, James Zhang, Jun Zhou, Chenhao Tan, Hongyuan Mei [Paper] [Code], 2023.10

  6. DDCoT: Duty-Distinct Chain-of-Thought Prompting for Multimodal Reasoning in Language Models NeurIPS 2023

    Ge Zheng, Bin Yang, Jiajin Tang, Hong-Yu Zhou, Sibei Yang [Paper] [Code], 2023.10

  7. Learning to Reason and Memorize with Self-Notes NeurIPS 2023

    Jack Lanchantin, Shubham Toshniwal, Jason Weston, arthur szlam, Sainbayar Sukhbaatar [Paper], 2023.10

  8. Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning NeurIPS 2023

    Xiaoqian Wu, Yong-Lu Li, Jianhua Sun, Cewu Lu [Paper] [Code], 2023.11

  9. Tree of Thoughts: Deliberate Problem Solving with Large Language Models NeurIPS 2023

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, Karthik Narasimhan [Paper] [Code], 2023.12

  10. Understanding Social Reasoning in Language Models with Language Models NeurIPS 2023

    Kanishk Gandhi, Jan-Philipp Fraenken, Tobias Gerstenberg, Noah Goodman [Paper] [Code], 2023.12

  11. Automatic model selection with large language models for reasoning EMNLP 2023 findings

    James Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Michael Xie [Paper] [Code], 2023.10

  12. Everything of Thoughts: Defying the Law of Penrose Triangle for Thought Generation Preprint

    Ruomeng Ding, Chaoyun Zhang, Lu Wang, Yong Xu, Minghua Ma, Wei Zhang, Si Qin, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang [Paper] [Code], 2024.02

  13. Math-Shepherd: A Label-Free Step-by-Step Verifier for LLMs in Mathematical Reasoning Preprint

    Peiyi Wang, Lei Li, Zhihong Shao, R.X. Xu, Damai Dai, Yifei Li, Deli Chen, Y.Wu, Zhifang Sui [Paper], 2024.02

  14. Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents Preprint

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, Rafael Rafailov [Paper], 2024.08

New

  1. Finding Blind Spots in Evaluator LLMs with Interpretable Checklists EMNLP2024

    Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M. Khapra paper

  2. BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs EMNLP2024

    Zhiting Fan, Ruizhe Chen, Ruiling Xu, Zuozhu Liu paper

  3. Are LLMs Good Zero-Shot Fallacy Classifiers? EMNLP2024

    Fengjun Pan, Xiaobao Wu, Zongrui Li, Anh Tuan Luu paper

  4. Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors EMNLP2024

    Alex Chandler, Devesh Surve, Hui Su paper

  5. Split and Merge: Aligning Position Biases in LLM-based Evaluators EMNLP2024

    Zongjie Li, Chaozheng Wang, Pingchuan Ma, Daoyuan Wu, Shuai Wang, Cuiyun Gao, Yang Liu paper

  6. Annotation alignment: Comparing LLM and human annotations of conversational safety EMNLP2024

    Rajiv Movva, Pang Wei Koh, Emma Pierson paper

  7. Humans or LLMs as the Judge? A Study on Judgement Bias EMNLP2024

    Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, Benyou Wang paper

  8. RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? EMNLP2024

    Di Cao, Yong Liao, Xiuwei Shang paper

  9. A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP2024

    Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram paper

  10. Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment EMNLP2024

    Vyas Raina, Adian Liusie, Mark Gales paper

  11. RepEval: Effective Text Evaluation with LLM Representation EMNLP2024

    Shuqian Sheng, Yi Xu, Tianhang Zhang, Zanwei Shen, Luoyi Fu, Jiaxin Ding, Lei Zhou, Xiaoying Gan, Xinbing Wang, Chenghu Zhou paper

  12. Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons EMNLP2024

    Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales paper

  13. SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales EMNLP2024

    Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao paper

  14. An LLM Feature-based Framework for Dialogue Constructiveness Assessment EMNLP2024

    Lexin Zhou, Youmna Farag, Andreas Vlachos paper

  15. I Need Help! Evaluating LLM’s Ability to Ask for Users’ Support: A Case Study on Text-to-SQL Generation EMNLP2024

    Cheng-Kuang Wu, Zhi Rui Tam, Chao-Chung Wu, Chieh-Yen Lin, Hung-yi Lee, Yun-Nung Chen paper

  16. Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation EMNLP2024

    Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim paper

  17. Bayesian Calibration of Win Rate Estimation with LLM Evaluators EMNLP2024

    Yicheng Gao, Gonghan Xu, Zhe Wang, Arman Cohan paper

  18. Evaluating Mathematical Reasoning Beyond Accuracy preprint

    Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu paper

  19. MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models preprint

    Shuai Peng, Di Fu, Liangcai Gao, Xiuqin Zhong, Hongguang Fu, Zhi Tang paper

  20. Beyond Hallucinations: Enhancing LVLMs through Hallucination-Aware Direct Preference Optimization preprint

    Zhiyuan Zhao, Bin Wang, Linke Ouyang, Xiaoyi Dong, Jiaqi Wang, Conghui He paper

  21. LLaVA-RLHF Aligning Large Multimodal Models with Factually Augmented RLHF ACL 2024 Findings

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell paper

  22. Alpagasus: Training A better alpaca with fewer data. ICLR 2024

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin paper

  23. Concept-skill Transferability-based Data Selection for Large Vision-Language Models EMNLP 2024

    Jaewoo Lee, Boyang Li, Sung Ju Hwang paper

  24. Less is More: High-value Data Selection for Visual Instruction Tuning preprint

    Zikang Liu, Kun Zhou, Wayne Xin Zhao, Dawei Gao, Yaliang Li, Ji-Rong Wen paper

  25. Data-Juicer: A One-Stop Data Processing System for Large Language Models preprint

    Daoyuan Chen, Yilun Huang, Zhijian Ma, Hesen Chen, Xuchen Pan, Ce Ge, Dawei Gao, Yuexiang Xie, Zhaoyang Liu, Jinyang Gao, Yaliang Li, Bolin Ding, Jingren Zhou paper

  26. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions ECCV 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin paper

  27. Visual Instruction Tuning NIPS 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee paper

  28. VBench: Comprehensive Benchmark Suite for Video Generative Models preprint

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, Ziwei Liu paper

  29. RevisEval: Improving LLM-as-a-Judge via Response-Adapted References preprint

    Qiyuan Zhang, Yufei Wang, Tiezheng YU, Yuxin Jiang, Chuhan Wu, Liangyou Li, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Fuyuan Lyu, Chen Ma paper

  30. Agent-as-a-Judge: Evaluate Agents with Agents preprint

    Mingchen Zhuge, Changsheng Zhao, Dylan Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoorthi, Yuandong Tian, Yangyang Shi, Vikas Chandra, Jürgen Schmidhuber paper

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •