This repository contains the implementation code for the paper "Outcome-Refining Process Supervision for Code Generation".
-
First, ensure you have Conda installed. If not, please install Miniconda or Anaconda.
-
Create and activate a conda environment:
# Create a new environment named orps
conda create -n orps python=3.12.7
# Activate the environment
conda activate orps
- Install dependencies using requirements.txt:
pip install -r requirements.txt
Note: Make sure you have activated the conda environment before installing the dependencies.
We recommend using Hugging Face's Text Generation Inference (TGI) for model deployment and inference. Here are the steps to set up:
-
Deploy TGI by following the instructions at Text Generation Inference.
-
After deployment, you will get a base URL for your model. Configure this URL in the configuration file to run the code.
Our repository supports multiple communication modes with language models:
-
Remote HF (Recommended)
- Uses TGI for efficient model serving
- Provides better performance and resource management
- Configure the base URL in the config file after TGI deployment
-
Alternative Modes
local_hf
: For local Hugging Face model inferenceopenai
: For using OpenAI's API
While we support multiple modes, we recommend using the remote_hf
mode with TGI for optimal performance.
The code uses JSON configuration files to specify experiment parameters. Template configuration files can be found in the ./config/templates
directory.
To run an experiment:
python run.py -c path/to/your/config.json
To improve efficiency, our code supports running experiments on split datasets in parallel. This is controlled by the data_group_tuple
parameter in the configuration file.
For example, to split your dataset into 3 parts, you would create three configuration files with:
// Config 1: config_split_0.json
"data_group_tuple": [3, 0]
// Config 2: config_split_1.json
"data_group_tuple": [3, 1]
// Config 3: config_split_2.json
"data_group_tuple": [3, 2]
Where:
- First number (3) indicates the total number of splits
- Second number (0,1,2) indicates which split this configuration will process
After creating the split configuration files, you can manually run them in parallel using tmux:
- Create new tmux sessions for each split:
# Create and start session for split 0
tmux new-session -d -s split0 "python run.py -c config_split_0.json"
# Create and start session for split 1
tmux new-session -d -s split1 "python run.py -c config_split_1.json"
# Create and start session for split 2
tmux new-session -d -s split2 "python run.py -c config_split_2.json"
- Monitor the sessions:
# List all running tmux sessions
tmux list-sessions
# Attach to a specific session
tmux attach-session -t split0 # For monitoring split 0
tmux attach-session -t split1 # For monitoring split 1
tmux attach-session -t split2 # For monitoring split 2
- Useful tmux commands while monitoring:
Ctrl+b d
: Detach from current sessionCtrl+b s
: List and switch between sessionstmux kill-session -t split0
: Kill a specific sessiontmux kill-server
: Kill all sessions
To simplify the process of running multiple splits, we provide a run_tmux.py
script that automates the creation and execution of split configurations using tmux sessions.
Usage:
python run_tmux.py \
--num-partitions 8 \
--base-config path/to/base/config.json \
--output-config-dir path/to/output/configs \
--python-path path/to/python \
--output-log-dir path/to/logs
This will:
- Create multiple configuration files with different data splits
- Launch separate tmux sessions for each split
- Run experiments in parallel
Parameters:
num-partitions
: Number of splits to createbase-config
: Path to your base configuration fileoutput-config-dir
: Directory to store generated config filespython-path
: Path to Python interpreteroutput-log-dir
: Directory for log files
Before evaluating the results, you need to calculate the standard runtime metrics for your execution environment. This is crucial because runtime metrics are environment-dependent, and we need these baseline values for proper normalization.
Use calculate_standard_runtime_metrics.py
:
python calculate_standard_runtime_metrics.py \
--dataset_path path/to/your/dataset \
--dataset_type lbpp \
--output_path path/to/save/standard/metrics
Parameters:
dataset_path
: Path to your datasetdataset_type
: Type of dataset. Use:lbpp
for LBPP datasetlbpp_alter
for MBPP or HumanEval datasets
output_path
: Where to save the standard metrics
After running experiments, use report_scores.py
to calculate the results. The results will be stored in the directory specified by output_path/step_name
in your configuration file.
python report_scores.py \
--output_path path/to/experiment/output \
--select_method success_ratio-time_enabled_ns \
--standard_metrics_path path/to/standard/metrics \
--result_path path/to/save/results
Parameters:
output_path
: Path to experiment output (should matchoutput_path/step_name
from your config)select_method
: Strategy for selecting best solutions. Important: This must match theselect_best_node_method
specified in your configuration file. For example, if your config uses"select_best_node_method": "success_ratio-time_enabled_ns"
, you must use the same value here.standard_metrics_path
: Path to standard metrics calculated in Step 1result_path
: Where to save the final results