ACLUE |
ACLUE is an evaluation benchmark for ancient Chinese language comprehension. |
African Languages LLM Eval Leaderboard |
African Languages LLM Eval Leaderboard tracks progress and ranks performance of LLMs on African languages. |
AgentBoard |
AgentBoard is a benchmark for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates. |
AGIEval |
AGIEval is a human-centric benchmark to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. |
Aiera Leaderboard |
Aiera Leaderboard evaluates LLM performance on financial intelligence tasks, including speaker assignments, speaker change identification, abstractive summarizations, calculation-based Q&A, and financial sentiment tagging. |
AIR-Bench |
AIR-Bench is a benchmark to evaluate heterogeneous information retrieval capabilities of language models. |
AI Energy Score Leaderboard |
AI Energy Score Leaderboard tracks and compares different models in energy efficiency. |
ai-benchmarks |
ai-benchmarks contains a handful of evaluation results for the response latency of popular AI services. |
AlignBench |
AlignBench is a multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. |
AlpacaEval |
AlpacaEval is an automatic evaluator designed for instruction-following LLMs. |
ANGO |
ANGO is a generation-oriented Chinese language model evaluation benchmark. |
Arabic Tokenizers Leaderboard |
Arabic Tokenizers Leaderboard compares the efficiency of LLMs in parsing Arabic in its different dialects and forms. |
Arena-Hard-Auto |
Arena-Hard-Auto is a benchmark for instruction-tuned LLMs. |
AutoRace |
AutoRace focuses on the direct evaluation of LLM reasoning chains with metric AutoRace (Automated Reasoning Chain Evaluation). |
Auto Arena |
Auto Arena is a benchmark in which various language model agents engage in peer-battles to evaluate their performance. |
Auto-J |
Auto-J hosts evaluation results on the pairwise response comparison and critique generation tasks. |
BABILong |
BABILong is a benchmark for evaluating the performance of language models in processing arbitrarily long documents with distributed facts. |
BBL |
BBL (BIG-bench Lite) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance, while being far cheaper to evaluate than the full set of more than 200 programmatic and JSON tasks in BIG-bench. |
BeHonest |
BeHonest is a benchmark to evaluate honesty - awareness of knowledge boundaries (self-knowledge), avoidance of deceit (non-deceptiveness), and consistency in responses (consistency) - in LLMs. |
BenBench |
BenBench is a benchmark to evaluate the extent to which LLMs conduct verbatim training on the training set of a benchmark over the test set to enhance capabilities. |
BenCzechMark |
BenCzechMark (BCM) is a multitask and multimetric Czech language benchmark for LLMs with a unique scoring system that utilizes the theory of statistical significance. |
BiGGen-Bench |
BiGGen-Bench is a comprehensive benchmark to evaluate LLMs across a wide variety of tasks. |
BotChat |
BotChat is a benchmark to evaluate the multi-round chatting capabilities of LLMs through a proxy task. |
CaselawQA |
CaselawQA is a benchmark comprising legal classification tasks derived from the Supreme Court and Songer Court of Appeals legal databases. |
CFLUE |
CFLUE is a benchmark to evaluate LLMs' understanding and processing capabilities in the Chinese financial domain. |
Ch3Ef |
Ch3Ef is a benchmark to evaluate alignment with human expectations using 1002 human-annotated samples across 12 domains and 46 tasks based on the hhh principle. |
Chain-of-Thought Hub |
Chain-of-Thought Hub is a benchmark to evaluate the reasoning capabilities of LLMs. |
Chatbot Arena |
Chatbot Arena hosts a chatbot arena where various LLMs compete based on user satisfaction. |
ChemBench |
ChemBench is a benchmark to evaluate the chemical knowledge and reasoning abilities of LLMs. |
Chinese SimpleQA |
Chinese SimpleQA is a Chinese benchmark to evaluate the factuality ability of language models to answer short questions. |
CLEM Leaderboard |
CLEM is a framework designed for the systematic evaluation of chat-optimized LLMs as conversational agents. |
CLEVA |
CLEVA is a benchmark to evaluate LLMs on 31 tasks using 370K Chinese queries from 84 diverse datasets and 9 metrics. |
Chinese Large Model Leaderboard |
Chinese Large Model Leaderboard is a platform to evaluate the performance of Chinese LLMs. |
CMB |
CMB is a multi-level medical benchmark in Chinese. |
CMMLU |
CMMLU is a benchmark to evaluate the performance of LLMs in various subjects within the Chinese cultural context. |
CMMMU |
CMMMU is a benchmark to evaluate LMMs on tasks demanding college-level subject knowledge and deliberate reasoning in a Chinese context. |
CommonGen |
CommonGen is a benchmark to evaluate generative commonsense reasoning by testing machines on their ability to compose coherent sentences using a given set of common concepts. |
CompMix |
CompMix is a benchmark for heterogeneous question answering. |
Compression Rate Leaderboard |
Compression Rate Leaderboard aims to evaluate tokenizer performance on different languages. |
Compression Leaderboard |
Compression Leaderboard is a platform to evaluate the compression performance of LLMs. |
CopyBench |
CopyBench is a benchmark to evaluate the copying behavior and utility of language models as well as the effectiveness of methods to mitigate copyright risks. |
CoTaEval |
CoTaEval is a benchmark to evaluate the feasibility and side effects of copyright takedown methods for LLMs. |
ConvRe |
ConvRe is a benchmark to evaluate LLMs' ability to comprehend converse relations. |
CriticEval |
CriticEval is a benchmark to evaluate LLMs' ability to make critique responses. |
CS-Bench |
CS-Bench is a bilingual benchmark designed to evaluate LLMs' performance across 26 computer science subfields, focusing on knowledge and reasoning. |
CUTE |
CUTE is a benchmark to test the orthographic knowledge of LLMs. |
CyberMetric |
CyberMetric is a benchmark to evaluate the cybersecurity knowledge of LLMs. |
CzechBench |
CzechBench is a benchmark to evaluate Czech language models. |
C-Eval |
C-Eval is a Chinese evaluation suite for LLMs. |
Decentralized Arena Leaderboard |
Decentralized Arena hosts a decentralized and democratic platform for LLM evaluation, automating and scaling assessments across diverse, user-defined dimensions, including mathematics, logic, and science. |
DecodingTrust |
DecodingTrust is a platform to evaluate the trustworthiness of LLMs. |
Domain LLM Leaderboard |
Domain LLM Leaderboard is a platform to evaluate the popularity of domain-specific LLMs. |
Enterprise Scenarios leaderboard |
Enterprise Scenarios Leaderboard tracks and evaluates the performance of LLMs on real-world enterprise use cases. |
EQ-Bench |
EQ-Bench is a benchmark to evaluate aspects of emotional intelligence in LLMs. |
European LLM Leaderboard |
European LLM Leaderboard tracks and compares performance of LLMs in European languages. |
EvalGPT.ai |
EvalGPT.ai hosts a chatbot arena to compare and rank the performance of LLMs. |
Eval Arena |
Eval Arena measures noise levels, model quality, and benchmark quality by comparing model pairs across several LLM evaluation benchmarks with example-level analysis and pairwise comparisons. |
Factuality Leaderboard |
Factuality Leaderboard compares the factual capabilities of LLMs. |
FanOutQA |
FanOutQA is a high quality, multi-hop, multi-document benchmark for LLMs using English Wikipedia as its knowledge base. |
FastEval |
FastEval is a toolkit for quickly evaluating instruction-following and chat language models on various benchmarks with fast inference and detailed performance insights. |
FELM |
FELM is a meta benchmark to evaluate factuality evaluation benchmark for LLMs. |
FinEval |
FinEval is a benchmark to evaluate financial domain knowledge in LLMs. |
Fine-tuning Leaderboard |
Fine-tuning Leaderboard is a platform to rank and showcase models that have been fine-tuned using open-source datasets or frameworks. |
Flames |
Flames is a highly adversarial Chinese benchmark for evaluating LLMs' value alignment across fairness, safety, morality, legality, and data protection. |
FollowBench |
FollowBench is a multi-level fine-grained constraints following benchmark to evaluate the instruction-following capability of LLMs. |
Forbidden Question Dataset |
Forbidden Question Dataset is a benchmark containing 160 questions from 160 violated categories, with corresponding targets for evaluating jailbreak methods. |
FuseReviews |
FuseReviews aims to advance grounded text generation tasks, including long-form question-answering and summarization. |
GAIA |
GAIA aims to test fundamental abilities that an AI assistant should possess. |
GAVIE |
GAVIE is a GPT-4-assisted benchmark for evaluating hallucination in LMMs by scoring accuracy and relevancy without relying on human-annotated groundtruth. |
GPT-Fathom |
GPT-Fathom is an LLM evaluation suite, benchmarking 10+ leading LLMs as well as OpenAI's legacy models on 20+ curated benchmarks across 7 capability categories, all under aligned settings. |
GrailQA |
Strongly Generalizable Question Answering (GrailQA) is a large-scale, high-quality benchmark for question answering on knowledge bases (KBQA) on Freebase with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). |
GTBench |
GTBench is a benchmark to evaluate and rank LLMs' reasoning abilities in competitive environments through game-theoretic tasks, e.g., board and card games. |
Guerra LLM AI Leaderboard |
Guerra LLM AI Leaderboard compares and ranks the performance of LLMs across quality, price, performance, context window, and others. |
Hallucinations Leaderboard |
Hallucinations Leaderboard aims to track, rank and evaluate hallucinations in LLMs. |
HalluQA |
HalluQA is a benchmark to evaluate the phenomenon of hallucinations in Chinese LLMs. |
Hebrew LLM Leaderboard |
Hebrew LLM Leaderboard tracks and ranks language models according to their success on various tasks on Hebrew. |
HellaSwag |
HellaSwag is a benchmark to evaluate common-sense reasoning in LLMs. |
Hughes Hallucination Evaluation Model leaderboard |
Hughes Hallucination Evaluation Model leaderboard is a platform to evaluate how often a language model introduces hallucinations when summarizing a document. |
Icelandic LLM leaderboard |
Icelandic LLM leaderboard tracks and compare models on Icelandic-language tasks. |
IFEval |
IFEval is a benchmark to evaluate LLMs' instruction following capabilities with verifiable instructions. |
IL-TUR |
IL-TUR is a benchmark for evaluating language models on monolingual and multilingual tasks focused on understanding and reasoning over Indian legal documents. |
Indic LLM Leaderboard |
Indic LLM Leaderboard is platform to track and compare the performance of Indic LLMs. |
Indico LLM Leaderboard |
Indico LLM Leaderboard evaluates and compares the accuracy of various language models across providers, datasets, and capabilities like text classification, key information extraction, and generative summarization. |
InstructEval |
InstructEval is a suite to evaluate instruction selection methods in the context of LLMs. |
Italian LLM-Leaderboard |
Italian LLM-Leaderboard tracks and compares LLMs in Italian-language tasks. |
JailbreakBench |
JailbreakBench is a benchmark for evaluating LLM vulnerabilities through adversarial prompts. |
Japanese Chatbot Arena |
Japanese Chatbot Arena hosts the chatbot arena, where various LLMs compete based on their performance in Japanese. |
Japanese Language Model Financial Evaluation Harness |
Japanese Language Model Financial Evaluation Harness is a harness for Japanese language model evaluation in the financial domain. |
Japanese LLM Roleplay Benchmark |
Japanese LLM Roleplay Benchmark is a benchmark to evaluate the performance of Japanese LLMs in character roleplay. |
JMED-LLM |
JMED-LLM (Japanese Medical Evaluation Dataset for Large Language Models) is a benchmark for evaluating LLMs in the medical field of Japanese. |
JMMMU |
JMMMU (Japanese MMMU) is a multimodal benchmark to evaluate LMM performance in Japanese. |
JustEval |
JustEval is a powerful tool designed for fine-grained evaluation of LLMs. |
KoLA |
KoLA is a benchmark to evaluate the world knowledge of LLMs. |
LaMP |
LaMP (Language Models Personalization) is a benchmark to evaluate personalization capabilities of language models. |
Language Model Council |
Language Model Council (LMC) is a benchmark to evaluate tasks that are highly subjective and often lack majoritarian human agreement. |
LawBench |
LawBench is a benchmark to evaluate the legal capabilities of LLMs. |
La Leaderboard |
La Leaderboard evaluates and tracks LLM memorization, reasoning and linguistic capabilities in Spain, LATAM and Caribbean. |
LogicKor |
LogicKor is a benchmark to evaluate the multidisciplinary thinking capabilities of Korean LLMs. |
LongICL Leaderboard |
LongICL Leaderboard is a platform to evaluate long in-context learning evaluations for LLMs. |
LooGLE |
LooGLE is a benchmark to evaluate long context understanding capabilties of LLMs. |
LAiW |
LAiW is a benchmark to evaluate Chinese legal language understanding and reasoning. |
LLM Benchmarker Suite |
LLM Benchmarker Suite is a benchmark to evaluate the comprehensive capabilities of LLMs. |
Large Language Model Assessment in English Contexts |
Large Language Model Assessment in English Contexts is a platform to evaluate LLMs in the English context. |
Large Language Model Assessment in the Chinese Context |
Large Language Model Assessment in the Chinese Context is a platform to evaluate LLMs in the Chinese context. |
LIBRA |
LIBRA is a benchmark for evaluating LLMs' capabilities in understanding and processing long Russian text. |
LibrAI-Eval GenAI Leaderboard |
LibrAI-Eval GenAI Leaderboard focuses on the balance between the LLM’s capability and safety in English. |
LiveBench |
LiveBench is a benchmark for LLMs to minimize test set contamination and enable objective, automated evaluation across diverse, regularly updated tasks. |
LLMEval |
LLMEval is a benchmark to evaluate the quality of open-domain conversations with LLMs. |
Llmeval-Gaokao2024-Math |
Llmeval-Gaokao2024-Math is a benchmark for evaluating LLMs on 2024 Gaokao-level math problems in Chinese. |
LLMHallucination Leaderboard |
Hallucinations Leaderboard evaluates LLMs based on an array of hallucination-related benchmarks. |
LLMPerf |
LLMPerf is a tool to evaluate the performance of LLMs using both load and correctness tests. |
LLMs Disease Risk Prediction Leaderboard |
LLMs Disease Risk Prediction Leaderboard is a platform to evaluate LLMs on disease risk prediction. |
LLM Leaderboard |
LLM Leaderboard tracks and evaluates LLM providers, enabling selection of the optimal API and model for user needs. |
LLM Leaderboard for CRM |
CRM LLM Leaderboard is a platform to evaluate the efficacy of LLMs for business applications. |
LLM Observatory |
LLM Observatory is a benchmark that assesses and ranks LLMs based on their performance in avoiding social biases across categories like LGBTIQ+ orientation, age, gender, politics, race, religion, and xenophobia. |
LLM Price Leaderboard |
LLM Price Leaderboard tracks and compares LLM costs based on one million tokens. |
LLM Rankings |
LLM Rankings offers a real-time comparison of language models based on normalized token usage for prompts and completions, updated frequently. |
LLM Roleplay Leaderboard |
LLM Roleplay Leaderboard evaluates human and AI performance in a social werewolf game for NPC development. |
LLM Safety Leaderboard |
LLM Safety Leaderboard aims to provide a unified evaluation for language model safety. |
LLM Use Case Leaderboard |
LLM Use Case Leaderboard tracks and evaluates LLMs in business usecases. |
LLM-AggreFact |
LLM-AggreFact is a fact-checking benchmark that aggregates most up-to-date publicly available datasets on grounded factuality evaluation. |
LLM-Leaderboard |
LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
LLM-Perf Leaderboard |
LLM-Perf Leaderboard aims to benchmark the performance of LLMs with different hardware, backends, and optimizations. |
LMExamQA |
LMExamQA is a benchmarking framework where a language model acts as an examiner to generate questions and evaluate responses in a reference-free, automated manner for comprehensive, equitable assessment. |
LongBench |
LongBench is a benchmark for assessing the long context understanding capabilities of LLMs. |
Loong |
Loong is a long-context benchmark for evaluating LLMs' multi-document QA abilities across financial, legal, and academic scenarios. |
Low-bit Quantized Open LLM Leaderboard |
Low-bit Quantized Open LLM Leaderboard tracks and compares quantization LLMs with different quantization algorithms. |
LV-Eval |
LV-Eval is a long-context benchmark with five length levels and advanced techniques for accurate evaluation of LLMs on single-hop and multi-hop QA tasks across bilingual datasets. |
LucyEval |
LucyEval offers a thorough assessment of LLMs' performance in various Chinese contexts. |
L-Eval |
L-Eval is a Long Context Language Model (LCLM) evaluation benchmark to evaluate the performance of handling extensive context. |
M3KE |
M3KE is a massive multi-level multi-subject knowledge evaluation benchmark to measure the knowledge acquired by Chinese LLMs. |
MetaCritique |
MetaCritique is a judge that can evaluate human-written or LLMs-generated critique by generating critique. |
MINT |
MINT is a benchmark to evaluate LLMs' ability to solve tasks with multi-turn interactions by using tools and leveraging natural language feedback. |
Mirage |
Mirage is a benchmark for medical information retrieval-augmented generation, featuring 7,663 questions from five medical QA datasets and tested with 41 configurations using the MedRag toolkit. |
MedBench |
MedBench is a benchmark to evaluate the mastery of knowledge and reasoning abilities in medical LLMs. |
MedS-Bench |
MedS-Bench is a medical benchmark that evaluates LLMs across 11 task categories using 39 diverse datasets. |
Meta Open LLM leaderboard |
The Meta Open LLM leaderboard serves as a central hub for consolidating data from various open LLM leaderboards into a single, user-friendly visualization page. |
MIMIC Clinical Decision Making Leaderboard |
MIMIC Clinical Decision Making Leaderboard tracks and evaluates LLms in realistic clinical decision-making for abdominal pathologies. |
MixEval |
MixEval is a benchmark to evaluate LLMs via by strategically mixing off-the-shelf benchmarks. |
ML.ENERGY Leaderboard |
ML.ENERGY Leaderboard evaluates the energy consumption of LLMs. |
MMedBench |
MMedBench is a medical benchmark to evaluate LLMs in multilingual comprehension. |
MMLU |
MMLU is a benchmark to evaluate the performance of LLMs across a wide array of natural language understanding tasks. |
MMLU-by-task Leaderboard |
MMLU-by-task Leaderboard provides a platform for evaluating and comparing various ML models across different language understanding tasks. |
MMLU-Pro |
MMLU-Pro is a more challenging version of MMLU to evaluate the reasoning capabilities of LLMs. |
ModelScope LLM Leaderboard |
ModelScope LLM Leaderboard is a platform to evaluate LLMs objectively and comprehensively. |
Model Evaluation Leaderboard |
Model Evaluation Leaderboard tracks and evaluates text generation models based on their performance across various benchmarks using Mosaic Eval Gauntlet framework. |
MSNP Leaderboard |
MSNP Leaderboard tracks and evaluates quantized GGUF models' performance on various GPU and CPU combinations using single-node setups via Ollama. |
MSTEB |
MSTEB is a benchmark for measuring the performance of text embedding models in Spanish. |
MTEB |
MTEB is a massive benchmark for measuring the performance of text embedding models on diverse embedding tasks across 112 languages. |
MTEB Arena |
MTEB Arena host a model arena for dynamic, real-world assessment of embedding models through user-based query and retrieval comparisons. |
MT-Bench-101 |
MT-Bench-101 is a fine-grained benchmark for evaluating LLMs in multi-turn dialogues. |
MY Malay LLM Leaderboard |
MY Malay LLM Leaderboard aims to track, rank, and evaluate open LLMs on Malay tasks. |
NoCha |
NoCha is a benchmark to evaluate how well long-context language models can verify claims written about fictional books. |
NPHardEval |
NPHardEval is a benchmark to evaluate the reasoning abilities of LLMs through the lens of computational complexity classes. |
Occiglot Euro LLM Leaderboard |
Occiglot Euro LLM Leaderboard compares LLMs in four main languages from the Okapi benchmark and Belebele (French, Italian, German, Spanish and Dutch). |
OlympiadBench |
OlympiadBench is a bilingual multimodal scientific benchmark featuring 8,476 Olympiad-level mathematics and physics problems with expert-level step-by-step reasoning annotations. |
OlympicArena |
OlympicArena is a benchmark to evaluate the advanced capabilities of LLMs across a broad spectrum of Olympic-level challenges. |
oobabooga |
Oobabooga is a benchmark to perform repeatable performance tests of LLMs with the oobabooga web UI. |
OpenEval |
OpenEval is a platform assessto evaluate Chinese LLMs. |
OpenLLM Turkish leaderboard |
OpenLLM Turkish leaderboard tracks progress and ranks the performance of LLMs in Turkish. |
Openness Leaderboard |
Openness Leaderboard tracks and evaluates models' transparency in terms of open access to weights, data, and licenses, exposing models that fall short of openness standards. |
Openness Leaderboard |
Openness Leaderboard is a tool that tracks the openness of instruction-tuned LLMs, evaluating their transparency, data, and model availability. |
OpenResearcher |
OpenResearcher contains the benchmarking results on various RAG-related systems as a leaderboard. |
Open Arabic LLM Leaderboard |
Open Arabic LLM Leaderboard tracks progress and ranks the performance of LLMs in Arabic. |
Open Chinese LLM Leaderboard |
Open Chinese LLM Leaderboard aims to track, rank, and evaluate open Chinese LLMs. |
Open CoT Leaderboard |
Open CoT Leaderboard tracks LLMs' abilities to generate effective chain-of-thought reasoning traces. |
Open Dutch LLM Evaluation Leaderboard |
Open Dutch LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in Dutch. |
Open Financial LLM Leaderboard |
Open Financial LLM Leaderboard aims to evaluate and compare the performance of financial LLMs. |
Open ITA LLM Leaderboard |
Open ITA LLM Leaderboard tracks progress and ranks the performance of LLMs in Italian. |
Open Ko-LLM Leaderboard |
Open Ko-LLM Leaderboard tracks progress and ranks the performance of LLMs in Korean. |
Open LLM Leaderboard |
Open LLM Leaderboard tracks progress and ranks the performance of LLMs in English. |
Open Medical-LLM Leaderboard |
Open Medical-LLM Leaderboard aims to track, rank, and evaluate open LLMs in the medical domain. |
Open MLLM Leaderboard |
Open MLLM Leaderboard aims to track, rank and evaluate LLMs and chatbots. |
Open MOE LLM Leaderboard |
OPEN MOE LLM Leaderboard assesses the performance and efficiency of various Mixture of Experts (MoE) LLMs. |
Open Multilingual LLM Evaluation Leaderboard |
Open Multilingual LLM Evaluation Leaderboard tracks progress and ranks the performance of LLMs in multiple languages. |
Open PL LLM Leaderboard |
Open PL LLM Leaderboard is a platform for assessing the performance of various LLMs in Polish. |
Open Portuguese LLM Leaderboard |
Open PT LLM Leaderboard aims to evaluate and compare LLMs in the Portuguese-language tasks. |
Open Taiwan LLM leaderboard |
Open Taiwan LLM leaderboard showcases the performance of LLMs on various Taiwanese Mandarin language understanding tasks. |
Open-LLM-Leaderboard |
Open-LLM-Leaderboard evaluates LLMs in language understanding and reasoning by transitioning from multiple-choice questions (MCQs) to open-style questions. |
OPUS-MT Dashboard |
OPUS-MT Dashboard is a platform to track and compare machine translation models across multiple language pairs and metrics. |
OR-Bench |
OR-Bench is a benchmark to evaluate the over-refusal of enhanced safety in LLMs. |
ParsBench |
ParsBench provides toolkits for benchmarking LLMs based on the Persian language. |
Persian LLM Leaderboard |
Persian LLM Leaderboard provides a reliable evaluation of LLMs in Persian Language. |
Pinocchio ITA leaderboard |
Pinocchio ITA leaderboard tracks and evaluates LLMs in Italian Language. |
PL-MTEB |
PL-MTEB (Polish Massive Text Embedding Benchmark) is a benchmark for evaluating text embeddings in Polish across 28 NLP tasks. |
Polish Medical Leaderboard |
Polish Medical Leaderboard evaluates language models on Polish board certification examinations. |
Powered-by-Intel LLM Leaderboard |
Powered-by-Intel LLM Leaderboard evaluates, scores, and ranks LLMs that have been pre-trained or fine-tuned on Intel Hardware. |
PubMedQA |
PubMedQA is a benchmark to evaluate biomedical research question answering. |
PromptBench |
PromptBench is a benchmark to evaluate the robustness of LLMs on adversarial prompts. |
QAConv |
QAConv is a benchmark for question answering using complex, domain-specific, and asynchronous conversations as the knowledge source. |
QuALITY |
QuALITY is a benchmark for evaluating multiple-choice question-answering with a long context. |
RABBITS |
RABBITS is a benchmark to evaluate the robustness of LLMs by evaluating their handling of synonyms, specifically brand and generic drug names. |
Rakuda |
Rakuda is a benchmark to evaluate LLMs based on how well they answer a set of open-ended questions about Japanese topics. |
RedTeam Arena |
RedTeam Arena is a red-teaming platform for LLMs. |
Red Teaming Resistance Benchmark |
Red Teaming Resistance Benchmark is a benchmark to evaluate the robustness of LLMs against red teaming prompts. |
ReST-MCTS* |
ReST-MCTS* is a reinforced self-training method that uses tree search and process reward inference to collect high-quality reasoning traces for training policy and reward models without manual step annotations. |
Reviewer Arena |
Reviewer Arena hosts the reviewer arena, where various LLMs compete based on their performance in critiquing academic papers. |
RoleEval |
RoleEval is a bilingual benchmark to evaluate the memorization, utilization, and reasoning capabilities of role knowledge of LLMs. |
RPBench Leaderboard |
RPBench-Auto is an automated pipeline for evaluating LLMs using 80 personae for character-based and 80 scenes for scene-based role-playing. |
Russian Chatbot Arena |
Chatbot Arena hosts a chatbot arena where various LLMs compete in Russian based on user satisfaction. |
Russian SuperGLUE |
Russian SuperGLUE is a benchmark for Russian language models, focusing on logic, commonsense, and reasoning tasks. |
R-Judge |
R-Judge is a benchmark to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. |
Safety Prompts |
Safety Prompts is a benchmark to evaluate the safety of Chinese LLMs. |
SafetyBench |
SafetyBench is a benchmark to evaluate the safety of LLMs. |
SALAD-Bench |
SALAD-Bench is a benchmark for evaluating the safety and security of LLMs. |
ScandEval |
ScandEval is a benchmark to evaluate LLMs on tasks in Scandinavian languages as well as German, Dutch, and English. |
Science Leaderboard |
Science Leaderboard is a platform to evaluate LLMs' capabilities to solve science problems. |
SciGLM |
SciGLM is a suite of scientific language models that use a self-reflective instruction annotation framework to enhance scientific reasoning by generating and revising step-by-step solutions to unlabelled questions. |
SciKnowEval |
SciKnowEval is a benchmark to evaluate LLMs based on their proficiency in studying extensively, enquiring earnestly, thinking profoundly, discerning clearly, and practicing assiduously. |
SCROLLS |
SCROLLS is a benchmark to evaluate the reasoning capabilities of LLMs over long texts. |
SeaExam |
SeaExam is a benchmark to evaluate LLMs for Southeast Asian (SEA) languages. |
SEAL LLM Leaderboards |
SEAL LLM Leaderboards is an expert-driven private evaluation platform for LLMs. |
SeaEval |
SeaEval is a benchmark to evaluate the performance of multilingual LLMs in understanding and reasoning with natural language, as well as comprehending cultural practices, nuances, and values. |
SEA HELM |
SEA HELM is a benchmark to evaluate LLMs' performance across English and Southeast Asian tasks, focusing on chat, instruction-following, and linguistic capabilities. |
SecEval |
SecEval is a benchmark to evaluate cybersecurity knowledge of foundation models. |
Self-Improving Leaderboard |
Self-Improving Leaderboard (SIL) is a dynamic platform that continuously updates test datasets and rankings to provide real-time performance insights for open-source LLMs and chatbots. |
Spec-Bench |
Spec-Bench is a benchmark to evaluate speculative decoding methods across diverse scenarios. |
StructEval |
StructEval is a benchmark to evaluate LLMs by conducting structured assessments across multiple cognitive levels and critical concepts. |
Subquadratic LLM Leaderboard |
Subquadratic LLM Leaderboard evaluates LLMs with subquadratic/attention-free architectures (i.e. RWKV & Mamba). |
SuperBench |
SuperBench is a comprehensive system of tasks and dimensions to evaluate the overall capabilities of LLMs. |
SuperGLUE |
SuperGLUE is a benchmark to evaluate the performance of LLMs on a set of challenging language understanding tasks. |
SuperLim |
SuperLim is a benchmark to evaluate the language understanding capabilities of LLMs in Swedish. |
Swahili LLM-Leaderboard |
Swahili LLM-Leaderboard is a joint community effort to create one central leaderboard for LLMs. |
S-Eval |
S-Eval is a comprehensive, multi-dimensional safety benchmark with 220,000 prompts designed to evaluate LLM safety across various risk dimensions. |
TableQAEval |
TableQAEval is a benchmark to evaluate LLM performance in modeling long tables and comprehension capabilities, such as numerical and multi-hop reasoning. |
TAT-DQA |
TAT-DQA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combine both structured and unstructured information. |
TAT-QA |
TAT-QA is a benchmark to evaluate LLMs on the discrete reasoning over documents that combines both tabular and textual content. |
Thai LLM Leaderboard |
Thai LLM Leaderboard aims to track and evaluate LLMs in the Thai-language tasks. |
The Pile |
The Pile is a benchmark to evaluate the world knowledge and reasoning ability of LLMs. |
TOFU |
TOFU is a benchmark to evaluate the unlearning performance of LLMs in realistic scenarios. |
Toloka LLM Leaderboard |
Toloka LLM Leaderboard is a benchmark to evaluate LLMs based on authentic user prompts and expert human evaluation. |
Toolbench |
ToolBench is a platform for training, serving, and evaluating LLMs specifically for tool learning. |
Toxicity Leaderboard |
Toxicity Leaderboard evaluates the toxicity of LLMs. |
Trustbit LLM Leaderboards |
Trustbit LLM Leaderboards is a platform that provides benchmarks for building and shipping products with LLMs. |
TrustLLM |
TrustLLM is a benchmark to evaluate the trustworthiness of LLMs. |
TuringAdvice |
TuringAdvice is a benchmark for evaluating language models' ability to generate helpful advice for real-life, open-ended situations. |
TutorEval |
TutorEval is a question-answering benchmark which evaluates how well an LLM tutor can help a user understand a chapter from a science textbook. |
T-Eval |
T-Eval is a benchmark for evaluating the tool utilization capability of LLMs. |
UGI Leaderboard |
UGI Leaderboard measures and compares the uncensored and controversial information known by LLMs. |
UltraEval |
UltraEval is an open-source framework for transparent and reproducible benchmarking of LLMs across various performance dimensions. |
Vals AI |
Vals AI is a platform evaluating generative AI accuracy and efficacy on real-world legal tasks. |
VCR |
Visual Commonsense Reasoning (VCR) is a benchmark for cognition-level visual understanding, requiring models to answer visual questions and provide rationales for their answers. |
ViDoRe |
ViDoRe is a benchmark to evaluate retrieval models on their capacity to match queries to relevant documents at the page level. |
VLLMs Leaderboard |
VLLMs Leaderboard aims to track, rank and evaluate open LLMs and chatbots. |
VMLU |
VMLU is a benchmark to evaluate overall capabilities of foundation models in Vietnamese. |
WildBench |
WildBench is a benchmark for evaluating language models on challenging tasks that closely resemble real-world applications. |
Xiezhi |
Xiezhi is a benchmark for holistic domain knowledge evaluation of LLMs. |
Yanolja Arena |
Yanolja Arena host a model arena to evaluate the capabilities of LLMs in summarizing and translating text. |
Yet Another LLM Leaderboard |
Yet Another LLM Leaderboard is a platform for tracking, ranking, and evaluating open LLMs and chatbots. |
ZebraLogic |
ZebraLogic is a benchmark evaluating LLMs' logical reasoning using Logic Grid Puzzles, a type of Constraint Satisfaction Problem (CSP). |
ZeroSumEval |
ZeroSumEval is a competitive evaluation framework for LLMs using multiplayer simulations with clear win conditions. |