Skip to content

Commit

Permalink
Rework (#40)
Browse files Browse the repository at this point in the history
  • Loading branch information
aorwall authored Jan 17, 2025
1 parent ddac5f6 commit c8658a1
Show file tree
Hide file tree
Showing 117 changed files with 105,579 additions and 6,577 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,5 +160,6 @@ notebooks/local_experiments.ipynb
playground
logs
Pipfile
experiments
evals
test_results
experiments
96 changes: 95 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,13 @@ I have focused on testing my ideas, and the project is currently a bit messy. My

## Environment Setup

Install dependencies:
```bash
poetry install
```

## Environment Variables

Before running the evaluation, you'll need:
1. At least one LLM provider API key (e.g., OpenAI, Anthropic, etc.)
2. A Voyage AI API key from [voyageai.com](https://voyageai.com) to use the pre-embedded vector stores for SWE-Bench instances.
Expand Down Expand Up @@ -86,7 +93,94 @@ export TESTBED_API_KEY="<your-key>"
export TESTBED_BASE_URL="<your-base-url>"
```

## Example
## Verified Models

Default model configurations are provided for verified models. Note that other models may work but have not been extensively tested. When specifying just the `--model` argument, the following configurations are used:

| Model | Response Format | Message History | Thoughts in Action |
|-------|----------------|-----------------|-------------------|
| claude-3-5-sonnet-20241022 | tool_call | messages | no |
| claude-3-5-haiku-20241022 | tool_call | messages | no |
| gpt-4o-2024-11-20 | tool_call | messages | yes |
| gpt-4o-mini-2024-07-18 | tool_call | messages | yes |
| deepseek/deepseek-chat | react | react | yes |
| gemini/gemini-2.0-flash-exp | tool_call | messages | yes |
| openrouter/meta-llama/llama-3.1-70b-instruct | react | react | no |
| openrouter/qwen/qwen-2.5-coder-32b-instruct | react | react | no |

## Verify Setup

Before running the full evaluation, you can verify your setup using the integration test script:

```bash
# Run a single model test
poetry run scripts/run_integration_tests.py --model claude-3-5-sonnet-20241022
```

The script will run the model against a sample SWE-Bench instance

Results are saved in `test_results/integration_test_<timestamp>/` .


## Run evaluation

The evaluation script supports various configuration options through command line arguments:

```bash
poetry run python -m moatless.benchmark.run_evaluation [OPTIONS]
```

Required arguments:
- `--model MODEL`: Model to use for evaluation (e.g., 'claude-3-5-sonnet-20241022', 'gpt-4o')

Optional arguments:
- Model settings:
- `--model MODEL`: Model identifier. Can be a supported model from the table below or any custom model identifier.
- `--api-key KEY`: API key for the model
- `--base-url URL`: Base URL for the model API
- `--response-format FORMAT`: Response format ('tool_call' or 'react'). Defaults to 'tool_call' for custom models
- `--message-history TYPE`: Message history type ('messages', 'summary', 'react', 'messages_compact', 'instruct'). Defaults to 'messages' for custom models
- `--thoughts-in-action`: Enable thoughts in action
- `--temperature FLOAT`: Temperature for model sampling. Defaults to 0.0

- Dataset settings:
- `--split SPLIT`: Dataset split to use. Defaults to 'lite'
- `--instance-ids ID [ID ...]`: Specific instance IDs to evaluate

- Loop settings:
- `--max-iterations INT`: Maximum number of iterations
- `--max-cost FLOAT`: Maximum cost in dollars

- Runner settings:
- `--num-workers INT`: Number of parallel workers. Defaults to 10
- `--evaluation-name NAME`: Custom name for the evaluation run
- `--rerun-errors`: Rerun instances that previously errored

Available dataset splits that can be specified with the `--split` argument:

| Split Name | Description | Instance Count |
|------------|-------------|----------------|
| lite | All instances from the lite dataset | 300 |
| verified | All instances from the verified dataset | 500 |
| verified_mini | [MariusHobbhahn/swe-bench-verified-mini](https://huggingface.co/datasets/MariusHobbhahn/swe-bench-verified-mini), a subset of SWE-Bench Verified | 50 |
| lite_and_verified_solvable | Instances that exist in both lite and verified datasets and have at least one solved submission to SWE-Bench | 84 |

Example usage:
```bash
# Run evaluation with Claude 3.5 Sonnet using the ReACT format
poetry run python -m moatless.benchmark.run_evaluation \
--model claude-3-5-sonnet-20241022 \
--response-format react \
--message-history react \
--num-workers 10

# Run specific instances with GPT-4
poetry run python -m moatless.benchmark.run_evaluation \
--model gpt-4o \
--instance-ids "django__django-16379"
```

# Code Example

Basic setup using the `AgenticLoop` to solve a SWE-Bench instance.

Expand Down
1 change: 0 additions & 1 deletion moatless/actions/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
from moatless.actions.append_string import AppendString
from moatless.actions.code_change import RequestCodeChange
from moatless.actions.create_file import CreateFile
from moatless.actions.find_class import FindClass
from moatless.actions.find_code_snippet import FindCodeSnippet
Expand Down
60 changes: 4 additions & 56 deletions moatless/actions/action.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,7 @@

from pydantic import BaseModel, ConfigDict

from moatless.actions.model import (
ActionArguments,
Observation,
RewardScaleEntry,
FewShotExample,
)
from moatless.actions.schema import ActionArguments, Observation, RewardScaleEntry, FewShotExample
from moatless.file_context import FileContext
from moatless.index import CodeIndex
from moatless.repository.repository import Repository
Expand Down Expand Up @@ -79,41 +74,6 @@ def get_evaluation_criteria(cls, trajectory_length: int | None = None) -> List[s
"Repetitive or Redundant Actions: Detect if the agent is repeating the same unsuccessful or redundant actions without making progress. Pay close attention to the agent's history and outputs indicating lack of progress.",
]

@classmethod
def get_reward_scale(cls, trajectory_length) -> List[RewardScaleEntry]:
return [
RewardScaleEntry(
min_value=75,
max_value=100,
description="The action significantly advances the solution.",
),
RewardScaleEntry(
min_value=50,
max_value=74,
description="The action contributes positively towards solving the problem.",
),
RewardScaleEntry(
min_value=25,
max_value=49,
description="The action is acceptable but may have some issues.",
),
RewardScaleEntry(
min_value=0,
max_value=24,
description="The action has minimal impact or minor negative consequences.",
),
RewardScaleEntry(
min_value=-49,
max_value=-1,
description="The code change is inappropriate, unhelpful, introduces new issues, or redundantly repeats previous changes without making further progress. The Git diff does not align with instructions or is unnecessary.",
),
RewardScaleEntry(
min_value=-100,
max_value=-50,
description="The code change is counterproductive, causing significant setbacks or demonstrating persistent repetition without learning. The agent fails to recognize completed tasks and continues to attempt redundant actions.",
),
]

@staticmethod
def generate_reward_scale_entries(
descriptions: List[Tuple[int, int, str]],
Expand Down Expand Up @@ -154,14 +114,7 @@ def get_value_function_prompt(cls) -> str:
Get the base prompt for the value function.
This method can be overridden in subclasses to provide action-specific prompts.
"""
return """Your role is to evaluate the **last executed action** of the search tree that our AI agents are traversing, to help us determine the best trajectory to solve a programming issue. The agent is responsible for identifying and modifying the correct file(s) in response to the problem statement.
Important: While line numbers may be referenced in the initial problem description, they can shift as changes are made to the file. Focus on whether the agent is modifying the correct logical parts of the code, rather than strictly matching the initially mentioned line numbers. What matters is that the right section of code is being modified, even if its current line number differs from what was originally specified.
At this stage, the agent is still working on the solution. Your task is twofold:
1. **Evaluation**: Assess whether the change done by the **last executed action** is appropriate for addressing the problem and whether the agent is on the right path to resolving the issue. Verify that the correct sections of code are being modified, regardless of their current line numbers.
2. **Alternative Feedback**: Independently of your evaluation, provide guidance for an alternative problem-solving branch. This ensures parallel exploration of different solution paths.
"""
pass

@classmethod
def get_few_shot_examples(cls) -> List[FewShotExample]:
Expand All @@ -172,9 +125,7 @@ def get_few_shot_examples(cls) -> List[FewShotExample]:
return []

@classmethod
def get_action_by_args_class(
cls, args_class: Type[ActionArguments]
) -> Optional[Type["Action"]]:
def get_action_by_args_class(cls, args_class: Type[ActionArguments]) -> Optional[Type["Action"]]:
"""
Get the Action subclass corresponding to the given ActionArguments subclass.
Expand All @@ -186,10 +137,7 @@ def get_action_by_args_class(
"""

def search_subclasses(current_class):
if (
hasattr(current_class, "args_schema")
and current_class.args_schema == args_class
):
if hasattr(current_class, "args_schema") and current_class.args_schema == args_class:
return current_class
for subclass in current_class.__subclasses__():
result = search_subclasses(subclass)
Expand Down
19 changes: 7 additions & 12 deletions moatless/actions/append_string.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import re
from typing import List

from pydantic import Field
from pydantic import Field, ConfigDict

from moatless.actions.action import Action
from moatless.actions.action import Action, FewShotExample
from moatless.actions.code_action_value_mixin import CodeActionValueMixin
from moatless.actions.code_modification_mixin import CodeModificationMixin
from moatless.actions.model import ActionArguments, FewShotExample, Observation
from moatless.actions.schema import ActionArguments, Observation
from moatless.file_context import FileContext
from moatless.index.code_index import CodeIndex
from moatless.repository.file import do_diff
Expand All @@ -20,13 +20,10 @@ class AppendStringArgs(ActionArguments):
Append text content to the end of a file.
"""

path: str = Field(..., description="Path to the file to append to")
new_str: str = Field(
..., description="Text content to append at the end of the file"
)
model_config = ConfigDict(title="AppendString")

class Config:
title = "AppendString"
path: str = Field(..., description="Path to the file to append to")
new_str: str = Field(..., description="Text content to append at the end of the file")

def format_args_for_llm(self) -> str:
return f"""<path>{self.path}</path>
Expand All @@ -36,9 +33,7 @@ def format_args_for_llm(self) -> str:

@classmethod
def format_schema_for_llm(cls) -> str:
return cls.format_xml_schema(
{"path": "file/path.py", "new_str": "\ncontent to append at end of file\n"}
)
return cls.format_xml_schema({"path": "file/path.py", "new_str": "\ncontent to append at end of file\n"})

@classmethod
def get_few_shot_examples(cls) -> List[FewShotExample]:
Expand Down
Loading

0 comments on commit c8658a1

Please sign in to comment.