Skip to content
This repository has been archived by the owner on Jun 9, 2024. It is now read-only.

local runs, home_path config, submodule miniagi #50

Merged
merged 9 commits into from
Jul 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions .github/workflows/mini-agi.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
name: mini-agi Regression Test

on:
workflow_dispatch:
branches: [master]
push:
branches: [stable, master, ci-test*]

jobs:
regression-tests:
permissions:
pull-requests: write
contents: write
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
matrix:
python-version: ['3.10']

steps:
- name: Checkout repository
uses: actions/checkout@v3
with:
fetch-depth: 0
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
submodules: true

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}

- id: get_date
name: Get date
run: echo "date=$(date +'%Y-%m-%d')" >> $GITHUB_OUTPUT

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python -

- name: Set up Poetry cache
uses: actions/cache@v2
with:
path: |
~/.cache/pypoetry
.venv
key: ${{ runner.os }}-poetry-${{ hashFiles('**/pyproject.toml') }}-${{ hashFiles('**/poetry.lock') }}-${{ steps.get_date.outputs.date }}

- name: Set up venv and install Python dependencies
run: |
poetry install --only main
poetry build

- name: Run regression tests
run: |
cd agent/mini-agi
make install
source venv/bin/activate
pip install ../../dist/agbenchmark-0.1.0-py3-none-any.whl
agbenchmark start --reg
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
4 changes: 4 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@
path = agent/gpt-engineer
url = https://github.com/merwanehamadi/gpt-engineer.git
branch = benchmark-integration
[submodule "agent/mini-agi"]
path = agent/mini-agi
url = https://github.com/SilenNaihin/mini-agi.git
branch = benchmark-integration
[submodule "agent/smol-developer"]
path = agent/smol-developer
url = https://github.com/merwanehamadi/developer.git
Expand Down
128 changes: 7 additions & 121 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,127 +2,13 @@

A repo built for the purpose of benchmarking the performance of agents far and wide, regardless of how they are set up and how they work

## As a user
### Scores:

1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to run and kill agent
3. `agbenchmark start`
- `--category challenge_category` to run tests in a specific category
- `--mock` to only run mock tests if they exists for each test
- `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
4. We call boilerplate code for your agent
5. Show pass rate of tests, logs, and any other metrics
Scoring of agents will go here. Both overall and by category.

## Contributing
### Integrated Agents

##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x

### To run the existing mocks

1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `cp .env_example .env`
6. `agbenchmark start --mock`
Keep config the same and watch the logs :)

### To run with mini-agi

1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
2. `pip install -r requirements.txt`
3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`

- To add requirements `poetry add requirement`.

Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.

If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit

Let people know what beautiful code you write does, document everything well

Share your progress :)

### Pytest

an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic

```python
import pytest
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os


class TestWriteFile(BasicChallenge):
"""Testing if LLM can write to a file"""

def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")

@pytest.mark.depends(on=[], name="basic_write_file")
def test_method(self, workspace):
# implement scoring logic by looking at workspace
```

All challenges will inherit from parent class which has the mark and any specific methods for their category

```python
@pytest.mark.basic
class BasicChallenge(Challenge):
pass
```

Add the below to create a file in the workspace prior to running a challenge. Only use when a file is needed to be created in the workspace prior to a test, such as with the read_file_test.
```python
@pytest.fixture(
scope="module", autouse=True
) # this is specific to setting up a file for the test, not all tests have this
def setup_module(self, workspace):
Challenge.write_to_file(
workspace, self.data.ground.files[0], "this is how we're doing"
)
```

#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)

## Workspace

If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config

#### Dataset

Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/

## Repo

```
|-- auto-gpt-benchmarks/ **main project directory**
| |-- metrics.py **combining scores, metrics, final evaluation**
| |-- start_benchmark.py **entry point from cli**
| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
| |-- Challenge.py **easy challenge creation class**
| |-- config.json **workspace folder**
| |-- challenges/ **challenges across different domains**
| | |-- adaptability/
| | |-- basic_abilities/
| | |-- code/
| | |-- memory/
| | |-- retrieval/
| | |-- web_navigation/
| | |-- writing/
| |-- tests/
| | |-- basic_abilities/ **every llm should pass these challenges**
| | |-- regression/ **challenges that already passed**
```

## How to add new agents to agbenchmark ?
Example with smol developer.

1- Create a github branch with your agent following the same pattern as this example:

https://github.com/smol-ai/developer/pull/114/files

2- Create the submodule and the github workflow by following the same pattern as this example:

https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/pull/48/files
- Auto-GPT
- gpt-engineer
- mini-agi
- smol-developer
126 changes: 126 additions & 0 deletions agbenchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
## As a user

1. `pip install auto-gpt-benchmarks`
2. Add boilerplate code to run and kill agent
3. `agbenchmark start`
- `--category challenge_category` to run tests in a specific category
- `--mock` to only run mock tests if they exists for each test
- `--noreg` to skip any tests that have passed in the past. When you run without this flag and a previous challenge that passed fails, it will now not be regression tests
4. We call boilerplate code for your agent
5. Show pass rate of tests, logs, and any other metrics

## Contributing

##### Diagrams: https://whimsical.com/agbenchmark-5n4hXBq1ZGzBwRsK4TVY7x

### To run the existing mocks

1. clone the repo `auto-gpt-benchmarks`
2. `pip install poetry`
3. `poetry shell`
4. `poetry install`
5. `cp .env_example .env`
6. `agbenchmark start --mock`
Keep config the same and watch the logs :)

### To run with mini-agi

1. Navigate to `auto-gpt-benchmarks/agent/mini-agi`
2. `pip install -r requirements.txt`
3. `cp .env_example .env`, set `PROMPT_USER=false` and add your `OPENAI_API_KEY=`. Sset `MODEL="gpt-3.5-turbo"` if you don't have access to `gpt-4` yet. Also make sure you have Python 3.10^ installed
4. Make sure to follow the commands above, and remove mock flag `agbenchmark start`

- To add requirements `poetry add requirement`.

Feel free to create prs to merge with `main` at will (but also feel free to ask for review) - if you can't send msg in R&D chat for access.

If you push at any point and break things - it'll happen to everyone - fix it asap. Step 1 is to revert `master` to last working commit

Let people know what beautiful code you write does, document everything well

Share your progress :)

### Pytest

an example of a test is below, use it as a template and change the class name, the .json name, what the test depends on and it's name, and the scoring logic

```python
import pytest
from agbenchmark.tests.basic_abilities.BasicChallenge import BasicChallenge
import os


class TestWriteFile(BasicChallenge):
"""Testing if LLM can write to a file"""

def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")

@pytest.mark.depends(on=[], name="basic_write_file")
def test_method(self, workspace):
# implement scoring logic by looking at workspace
```

All challenges will inherit from parent class which has the mark and any specific methods for their category

```python
@pytest.mark.basic
class BasicChallenge(Challenge):
pass
```

Add the below to create a file in the workspace prior to running a challenge. Only use when a file is needed to be created in the workspace prior to a test, such as with the read_file_test.

```python
@pytest.fixture(
scope="module", autouse=True
) # this is specific to setting up a file for the test, not all tests have this
def setup_module(self, workspace):
Challenge.write_to_file(
workspace, self.data.ground.files[0], "this is how we're doing"
)
```

#### The main Challenge class has all the parametrization and loading logic so that all tests can inherit from it. It lives within [this file](https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/blob/master/agbenchmark/Challenge.py)

## Workspace

If `--mock` flag is used it is at `agbenchmark/mocks/workspace`. Otherwise for mini-agi it is at `C:/Users/<name>/miniagi` - it will be automitcally set on config

#### Dataset

Manually created, existing challenges within Auto-Gpt, https://osu-nlp-group.github.io/Mind2Web/

## Repo

```
|-- auto-gpt-benchmarks/ **main project directory**
| |-- metrics.py **combining scores, metrics, final evaluation**
| |-- start_benchmark.py **entry point from cli**
| |-- conftest.py **config, workspace creation + teardown, regression tesst markers, parameterization**
| |-- Challenge.py **easy challenge creation class**
| |-- config.json **workspace folder**
| |-- challenges/ **challenges across different domains**
| | |-- adaptability/
| | |-- basic_abilities/
| | |-- code/
| | |-- memory/
| | |-- retrieval/
| | |-- web_navigation/
| | |-- writing/
| |-- tests/
| | |-- basic_abilities/ **every llm should pass these challenges**
| | |-- regression/ **challenges that already passed**
```

## How to add new agents to agbenchmark ?

Example with smol developer.

1- Create a github branch with your agent following the same pattern as this example:

https://github.com/smol-ai/developer/pull/114/files

2- Create the submodule and the github workflow by following the same pattern as this example:

https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks/pull/48/files
11 changes: 5 additions & 6 deletions agbenchmark/agent_interface.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
import importlib
import os
import subprocess
import sys
Expand Down Expand Up @@ -29,18 +28,18 @@ def run_agent(
mock_manager.delegate(mock_func)
else:
timeout = config["cutoff"]
print(f"Running Python function '{config['func_path']}' with timeout {timeout}")
print(
f"Running Python function '{config['entry_path']}' with timeout {timeout}"
)

# Get the current working directory
cwd = os.getcwd()

# Add current directory to Python's import path
sys.path.append(cwd)
sys.path.append(os.path.join(cwd, config["home_path"]))

module_name = config["func_path"].replace("/", ".").rstrip(".py")
module = importlib.import_module(module_name)

command = [sys.executable, "benchmarks.py", str(task)]
command = [sys.executable, config["entry_path"], str(task)]
process = subprocess.Popen(
command,
stdout=subprocess.PIPE,
Expand Down
2 changes: 1 addition & 1 deletion agbenchmark/start_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def start(category: str, reg: bool, mock: bool) -> int:
default=os.path.join(Path.home(), "workspace"),
)

config["func_path"] = click.prompt(
config["entry_path"] = click.prompt(
"Please enter a the path to your run_specific_agent function implementation",
default="/benchmarks.py",
)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import os
from pathlib import Path
from typing import Any, Dict
import pytest

from agbenchmark.tests.basic_abilities.basic_challenge import BasicChallenge

Expand All @@ -11,6 +12,7 @@ class TestWriteFile(BasicChallenge):
def get_file_path(self) -> str: # all tests must implement this method
return os.path.join(os.path.dirname(__file__), "w_file_data.json")

@pytest.mark.depends(name="basic_write_file")
def test_method(self, config: Dict[str, Any]) -> None:
self.setup_challenge(config)

Expand Down
2 changes: 1 addition & 1 deletion agent/Auto-GPT
Loading