Rework (#40)

aorwall · Jan 17, 2025 · c8658a1 · c8658a1
1 parent ddac5f6
commit c8658a1
Show file tree

Hide file tree

Showing 117 changed files with 105,579 additions and 6,577 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,5 +160,6 @@ notebooks/local_experiments.ipynb
 playground
 logs
 Pipfile
-experiments
 evals
+test_results
+experiments
diff --git a/README.md b/README.md
@@ -47,6 +47,13 @@ I have focused on testing my ideas, and the project is currently a bit messy. My
 
 ## Environment Setup
 
+Install dependencies:
+```bash
+poetry install
+```
+
+## Environment Variables
+
 Before running the evaluation, you'll need:
 1. At least one LLM provider API key (e.g., OpenAI, Anthropic, etc.)
 2. A Voyage AI API key from [voyageai.com](https://voyageai.com) to use the pre-embedded vector stores for SWE-Bench instances.
@@ -86,7 +93,94 @@ export TESTBED_API_KEY="<your-key>"
 export TESTBED_BASE_URL="<your-base-url>"
 ```
 
-## Example
+## Verified Models
+
+Default model configurations are provided for verified models. Note that other models may work but have not been extensively tested. When specifying just the `--model` argument, the following configurations are used:
+
+| Model | Response Format | Message History | Thoughts in Action |
+|-------|----------------|-----------------|-------------------|
+| claude-3-5-sonnet-20241022 | tool_call | messages | no |
+| claude-3-5-haiku-20241022 | tool_call | messages | no |
+| gpt-4o-2024-11-20 | tool_call | messages | yes |
+| gpt-4o-mini-2024-07-18 | tool_call | messages | yes |
+| deepseek/deepseek-chat | react | react | yes |
+| gemini/gemini-2.0-flash-exp | tool_call | messages | yes |
+| openrouter/meta-llama/llama-3.1-70b-instruct | react | react | no |
+| openrouter/qwen/qwen-2.5-coder-32b-instruct | react | react | no |
+
+## Verify Setup
+
+Before running the full evaluation, you can verify your setup using the integration test script:
+
+```bash
+# Run a single model test
+poetry run scripts/run_integration_tests.py --model claude-3-5-sonnet-20241022
+```
+
+The script will run the model against a sample SWE-Bench instance
+
+Results are saved in `test_results/integration_test_<timestamp>/` .
+
+
+## Run evaluation
+
+The evaluation script supports various configuration options through command line arguments:
+
+```bash
+poetry run python -m moatless.benchmark.run_evaluation [OPTIONS]
+```
+
+Required arguments:
+- `--model MODEL`: Model to use for evaluation (e.g., 'claude-3-5-sonnet-20241022', 'gpt-4o')
+
+Optional arguments:
+- Model settings:
+  - `--model MODEL`: Model identifier. Can be a supported model from the table below or any custom model identifier. 
+  - `--api-key KEY`: API key for the model
+  - `--base-url URL`: Base URL for the model API
+  - `--response-format FORMAT`: Response format ('tool_call' or 'react'). Defaults to 'tool_call' for custom models
+  - `--message-history TYPE`: Message history type ('messages', 'summary', 'react', 'messages_compact', 'instruct'). Defaults to 'messages' for custom models
+  - `--thoughts-in-action`: Enable thoughts in action
+  - `--temperature FLOAT`: Temperature for model sampling. Defaults to 0.0
+
+- Dataset settings:
+  - `--split SPLIT`: Dataset split to use. Defaults to 'lite'
+  - `--instance-ids ID [ID ...]`: Specific instance IDs to evaluate
+
+- Loop settings:
+  - `--max-iterations INT`: Maximum number of iterations
+  - `--max-cost FLOAT`: Maximum cost in dollars
+
+- Runner settings:
+  - `--num-workers INT`: Number of parallel workers. Defaults to 10
+  - `--evaluation-name NAME`: Custom name for the evaluation run
+  - `--rerun-errors`: Rerun instances that previously errored
+
+Available dataset splits that can be specified with the `--split` argument:
+
+| Split Name | Description | Instance Count |
+|------------|-------------|----------------|
+| lite | All instances from the lite dataset | 300 | 
+| verified | All instances from the verified dataset | 500 | 
+| verified_mini | [MariusHobbhahn/swe-bench-verified-mini](https://huggingface.co/datasets/MariusHobbhahn/swe-bench-verified-mini), a subset of SWE-Bench Verified  | 50 |
+| lite_and_verified_solvable | Instances that exist in both lite and verified datasets and have at least one solved submission to SWE-Bench | 84 |
+
+Example usage:
+```bash
+# Run evaluation with Claude 3.5 Sonnet using the ReACT format
+poetry run python -m moatless.benchmark.run_evaluation \
+  --model claude-3-5-sonnet-20241022 \
+  --response-format react \
+  --message-history react \
+  --num-workers 10
+
+# Run specific instances with GPT-4
+poetry run python -m moatless.benchmark.run_evaluation \
+  --model gpt-4o \
+  --instance-ids "django__django-16379"
+```
+
+# Code Example
 
 Basic setup using the `AgenticLoop` to solve a SWE-Bench instance.
 

diff --git a/moatless/actions/__init__.py b/moatless/actions/__init__.py
@@ -1,5 +1,4 @@
 from moatless.actions.append_string import AppendString
-from moatless.actions.code_change import RequestCodeChange
 from moatless.actions.create_file import CreateFile
 from moatless.actions.find_class import FindClass
 from moatless.actions.find_code_snippet import FindCodeSnippet

diff --git a/moatless/actions/action.py b/moatless/actions/action.py
@@ -6,12 +6,7 @@
 
 from pydantic import BaseModel, ConfigDict
 
-from moatless.actions.model import (
-    ActionArguments,
-    Observation,
-    RewardScaleEntry,
-    FewShotExample,
-)
+from moatless.actions.schema import ActionArguments, Observation, RewardScaleEntry, FewShotExample
 from moatless.file_context import FileContext
 from moatless.index import CodeIndex
 from moatless.repository.repository import Repository
@@ -79,41 +74,6 @@ def get_evaluation_criteria(cls, trajectory_length: int | None = None) -> List[s
                 "Repetitive or Redundant Actions: Detect if the agent is repeating the same unsuccessful or redundant actions without making progress. Pay close attention to the agent's history and outputs indicating lack of progress.",
             ]
 
-    @classmethod
-    def get_reward_scale(cls, trajectory_length) -> List[RewardScaleEntry]:
-        return [
-            RewardScaleEntry(
-                min_value=75,
-                max_value=100,
-                description="The action significantly advances the solution.",
-            ),
-            RewardScaleEntry(
-                min_value=50,
-                max_value=74,
-                description="The action contributes positively towards solving the problem.",
-            ),
-            RewardScaleEntry(
-                min_value=25,
-                max_value=49,
-                description="The action is acceptable but may have some issues.",
-            ),
-            RewardScaleEntry(
-                min_value=0,
-                max_value=24,
-                description="The action has minimal impact or minor negative consequences.",
-            ),
-            RewardScaleEntry(
-                min_value=-49,
-                max_value=-1,
-                description="The code change is inappropriate, unhelpful, introduces new issues, or redundantly repeats previous changes without making further progress. The Git diff does not align with instructions or is unnecessary.",
-            ),
-            RewardScaleEntry(
-                min_value=-100,
-                max_value=-50,
-                description="The code change is counterproductive, causing significant setbacks or demonstrating persistent repetition without learning. The agent fails to recognize completed tasks and continues to attempt redundant actions.",
-            ),
-        ]
-
     @staticmethod
     def generate_reward_scale_entries(
         descriptions: List[Tuple[int, int, str]],
@@ -154,14 +114,7 @@ def get_value_function_prompt(cls) -> str:
         Get the base prompt for the value function.
         This method can be overridden in subclasses to provide action-specific prompts.
         """
-        return """Your role is to evaluate the **last executed action** of the search tree that our AI agents are traversing, to help us determine the best trajectory to solve a programming issue. The agent is responsible for identifying and modifying the correct file(s) in response to the problem statement.
-
-Important: While line numbers may be referenced in the initial problem description, they can shift as changes are made to the file. Focus on whether the agent is modifying the correct logical parts of the code, rather than strictly matching the initially mentioned line numbers. What matters is that the right section of code is being modified, even if its current line number differs from what was originally specified.
-
-At this stage, the agent is still working on the solution. Your task is twofold:
-1. **Evaluation**: Assess whether the change done by the **last executed action** is appropriate for addressing the problem and whether the agent is on the right path to resolving the issue. Verify that the correct sections of code are being modified, regardless of their current line numbers.
-2. **Alternative Feedback**: Independently of your evaluation, provide guidance for an alternative problem-solving branch. This ensures parallel exploration of different solution paths.
-"""
+        pass
 
     @classmethod
     def get_few_shot_examples(cls) -> List[FewShotExample]:
@@ -172,9 +125,7 @@ def get_few_shot_examples(cls) -> List[FewShotExample]:
         return []
 
     @classmethod
-    def get_action_by_args_class(
-        cls, args_class: Type[ActionArguments]
-    ) -> Optional[Type["Action"]]:
+    def get_action_by_args_class(cls, args_class: Type[ActionArguments]) -> Optional[Type["Action"]]:
         """
         Get the Action subclass corresponding to the given ActionArguments subclass.
 
@@ -186,10 +137,7 @@ def get_action_by_args_class(
         """
 
         def search_subclasses(current_class):
-            if (
-                hasattr(current_class, "args_schema")
-                and current_class.args_schema == args_class
-            ):
+            if hasattr(current_class, "args_schema") and current_class.args_schema == args_class:
                 return current_class
             for subclass in current_class.__subclasses__():
                 result = search_subclasses(subclass)

diff --git a/moatless/actions/append_string.py b/moatless/actions/append_string.py
@@ -1,12 +1,12 @@
 import re
 from typing import List
 
-from pydantic import Field
+from pydantic import Field, ConfigDict
 
-from moatless.actions.action import Action
+from moatless.actions.action import Action, FewShotExample
 from moatless.actions.code_action_value_mixin import CodeActionValueMixin
 from moatless.actions.code_modification_mixin import CodeModificationMixin
-from moatless.actions.model import ActionArguments, FewShotExample, Observation
+from moatless.actions.schema import ActionArguments, Observation
 from moatless.file_context import FileContext
 from moatless.index.code_index import CodeIndex
 from moatless.repository.file import do_diff
@@ -20,13 +20,10 @@ class AppendStringArgs(ActionArguments):
     Append text content to the end of a file.
     """
 
-    path: str = Field(..., description="Path to the file to append to")
-    new_str: str = Field(
-        ..., description="Text content to append at the end of the file"
-    )
+    model_config = ConfigDict(title="AppendString")
 
-    class Config:
-        title = "AppendString"
+    path: str = Field(..., description="Path to the file to append to")
+    new_str: str = Field(..., description="Text content to append at the end of the file")
 
     def format_args_for_llm(self) -> str:
         return f"""<path>{self.path}</path>
@@ -36,9 +33,7 @@ def format_args_for_llm(self) -> str:
 
     @classmethod
     def format_schema_for_llm(cls) -> str:
-        return cls.format_xml_schema(
-            {"path": "file/path.py", "new_str": "\ncontent to append at end of file\n"}
-        )
+        return cls.format_xml_schema({"path": "file/path.py", "new_str": "\ncontent to append at end of file\n"})
 
     @classmethod
     def get_few_shot_examples(cls) -> List[FewShotExample]: