MGDebugger is a hierarchical LLM code debugging method designed to isolate, identify, and resolve errors at various levels of granularity. Using a hierarchical bottom-up debugging approach, MGDebugger systematically progresses from individual subfunctions to the overall system, enabling precise error detection and correction.
With MGDebugger, developers can efficiently debug complex codes and functions by performing granular analysis, reducing debugging time, and improving the success rate of resolving complex issues.
Before running MGDebugger, ensure your environment meets the following requirements:
-
Python: Version 3.8 or later.
-
vLLM: Version 0.6.0 or later. Required for model loading and inference. You can follow the official installation guide to set it up.
-
Additional dependencies: Install all necessary Python packages using the following command:
pip install -r requirements.txt
To launch the vLLM server with the DeepSeek-Coder-V2-Lite-Instruct
model, execute the following command:
python -m vllm.entrypoints.openai.api_server \
--model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct \
--trust-remote-code \
--dtype auto \
--api-key token-abc123s \
--port 18889
This will initialize the model and start the server on port 18889
.
We've prepared a demo code snippet to showcase MGDebugger's debugging capabilities. You can run the demo by executing the following command after starting the vLLM server:
python demo.py
Once the vLLM server is up and running, start MGDebugger by executing:
python main.py
Tip: You can modify the
MODEL
andinput_seeds
parameters in theconfig.py
file to test different models and input configurations.
MGDebugger automatically stores all debugging and error logs in the output_data
directory. You can review these logs to gain deeper insights into debugging details and performance analysis.
The table below highlights the performance of different methods compared to the baseline (No-Debugging) on the HumanEval and MBPP datasets using the DeepSeek-Coder-V2-Lite model.
Method | HumanEval Acc. (%) | Δ Acc. (%) | HumanEval RSR (%) | MBPP Acc. (%) | Δ Acc. (%) | MBPP RSR (%) |
---|---|---|---|---|---|---|
No-Debugging | 76.8 | -- | -- | 67.2 | -- | -- |
Simple Feedback | 82.3 | +5.5 | 23.7 | 69.4 | +2.2 | 6.7 |
Self-Edit | 82.9 | +6.1 | 26.3 | 71.2 | +4.0 | 12.2 |
LDB (Block) | 84.1 | +7.3 | 31.6 | 74.0 | +6.8 | 20.7 |
Self-Debugging (Expl.) | 87.2 | +10.4 | 44.7 | 73.4 | +6.2 | 18.9 |
Self-Debugging (Trace) | 86.0 | +9.2 | 39.5 | 72.6 | +5.3 | 16.5 |
Reflexion | 90.9 | +14.1 | 60.5 | 76.6 | +9.4 | 28.7 |
Our Approach | 94.5 | +17.7 | 76.3 | 80.0 | +12.8 | 39.0 |
Our approach achieved the highest accuracy on both HumanEval and MBPP datasets, with a remarkable improvement of +17.7% and +12.8% in accuracy over the baseline, respectively. The Repair Success Rate (RSR) was also significantly higher than other methods, demonstrating the effectiveness of our debugging strategy in fixing diverse code issues.
We warmly welcome contributions to MGDebugger! We appreciate your feedback and look forward to building MGDebugger together with the community!