Skip to content

clembench: Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents

Chalamalasetti, K., Götze, J., Hakimov, S., Madureira, B., Sadler, P., & Schlangen, D. (2023). clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents. In Proceedings of EMNLP 2023. PDF

We propose a complementary way of evaluating LLMs which combines the control (and reproducibility that comes from automation) that reference-based evaluation offers with the interactivity challenged in chatbot-type preferential evaluation. This is achieved through gameplay in well-defined conversational / dialogue games. We have implemented a set of games (such as Wordle or Taboo, or games where one player must formulate descriptions of what to do to another player) which current models can play in self-play. These games come with metrics that measure the quality of the game play. Together, across the set of games, we can calculate an overal score per model (what we call the clemscore) which serves as an indicator of how well the model can follow fine-grained instructions and how well it can simulate goal-directed conversational behaviour.

@inproceedings{chalamalasetti-etal-2023-clembench,
    title = "clembench: Using Game Play to Evaluate Chat-Optimized Language Models as Conversational Agents",
    author = {Chalamalasetti, Kranti  and
      G{\"o}tze, Jana  and
      Hakimov, Sherzod  and
      Madureira, Brielen  and
      Sadler, Philipp  and
      Schlangen, David},
      booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
      month = dec,
      year = "2023",
      address = "Singapore",
      publisher = "Association for Computational Linguistics",
      url = "https://aclanthology.org/2023.emnlp-main.689",
      pages = "11174--11219"
}

Links

Pinned Loading

  1. clembench-runs clembench-runs Public

    The full outputs generated by running the benchmark on different LLMs.

    1 7

  2. clembench.github.io clembench.github.io Public

    Website for clembench results

    HTML 2

  3. clembench-leaderboard clembench-leaderboard Public

    Leaderboard to show the evaluted LLMs

    Python 1 1

  4. clemgame-template clemgame-template Public template

    Template repository for developing games

    Python 1

Repositories

Showing 9 of 9 repositories
  • clembench-runs Public

    The full outputs generated by running the benchmark on different LLMs.

    clembench/clembench-runs’s past year of commit activity
    1 7 2 0 Updated Apr 9, 2025
  • clemgame-template Public template

    Template repository for developing games

    clembench/clemgame-template’s past year of commit activity
    Python 0 Apache-2.0 1 0 0 Updated Apr 9, 2025
  • clemcore Public Forked from clp-research/clemcore

    A Framework for the Systematic Evaluation of Chat-Optimized Language Models as Conversational Agents and an Extensible Benchmark

    clembench/clemcore’s past year of commit activity
    Python 5 MIT 38 0 0 Updated Apr 9, 2025
  • clembench Public Forked from clp-research/clembench

    Collection of games to be run with the clembench framework

    clembench/clembench’s past year of commit activity
    Python 0 9 0 0 Updated Apr 9, 2025
  • llm-calculator Public

    LLM Calculator

    clembench/llm-calculator’s past year of commit activity
    Python 0 Apache-2.0 1 0 0 Updated Mar 14, 2025
  • clembench.github.io Public

    Website for clembench results

    clembench/clembench.github.io’s past year of commit activity
    HTML 0 2 0 0 Updated Mar 11, 2025
  • clembench/multimodal-clem-leaderboard’s past year of commit activity
    Python 0 Apache-2.0 1 0 0 Updated Mar 7, 2025
  • clembench-leaderboard Public

    Leaderboard to show the evaluted LLMs

    clembench/clembench-leaderboard’s past year of commit activity
    Python 1 MIT 1 1 0 Updated Mar 7, 2025
  • .github Public
    clembench/.github’s past year of commit activity
    0 0 0 0 Updated Dec 11, 2023

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Loading…

Most used topics

Loading…