This repository contains the dataset and code of the paper (Under Review):
PersonaGym: Evaluating Persona Agents and LLMs
Our personas used in our experiment are located in the personas file. The current list of static environments is located in the environments file
# Environment setup
conda create -n PersonaGym python=3.9 -y
conda activate PersonaGym
# install dependencies
pip install -r requirements.txt
Currently, our framework supports the evaluation of any model available through the OpenAI, Anthorpic, or TogetherAI APIs.
To start the evaluation of a persona or multiple personas, begin by inputting your OpenAI, Anthropic, and TogetherAI API keys here
OPENAI_API_KEY = "Insert OpenAI key here"
CLAUDE_API_KEY = "Insert Claude key here"
LLAMA_API_KEY = "Insert Llama key here"
Then move to the code directory and run the run.py
file. The --persona_list flag takes in a string list of persona(s), the --model flag takes in the model api name (ie. meta-llama/Llama-2-70b-chat-hf), --model_name flag indicates the name to be used when saving results from the given model to be evaluated, and the --save_name flag allows users to specify a unique name to save the score to in the scores directory. Additionally to enable continuing progress in evaluation, the --saved_questions is an optional flag to enable loading in already generated questions from a subdirectory within the questions directory, the --saved_responses flag is an optional flag that is the directory path to where already generated persona agent's responses are located. Finally the --benchmark enables running on our benchmark. Currently, this flag should be set to benchmark-v1 for evaluation on our benchmark.
An example of running the run.py
file is included below
python run.py --persona_list '["an Asian software engineer", "a high school physics teacher"]' --model meta-llama/Llama-2-70b-chat-hf --model_name llama_2_70b
An example of evaluating on our benchmark is included below
python run.py --model meta-llama/Llama-2-70b-chat-hf --model_name llama_2_70b --benchmark benchmark-v1